Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTPFileSystem isdir downloads the whole file #1707

Open
mxmlnkn opened this issue Oct 4, 2024 · 2 comments
Open

HTTPFileSystem isdir downloads the whole file #1707

mxmlnkn opened this issue Oct 4, 2024 · 2 comments

Comments

@mxmlnkn
Copy link
Contributor

mxmlnkn commented Oct 4, 2024

I need to implement the FUSE getattr (stat) callback. I.e., I need to get at least the file type and size, and possibly name for a given path.

I am failing to do this with the HTTP filesystem implementation because:

  • info(path) always returns the file information for the HTML file, i.e., the file type is also always a file. This is already inconsistent to all other fsspec implementations. The same for isfile, which always returns true.
  • isdir(path) hangs and when looking at my local HTTP server log or at my network bandwidth when testing with an external server, I see that this call downloads the whole file. This means that currently an ls -la will download all files in the given folder...

Test to reproduce:

import pprint
import time
import fsspec

prefix="https://ash-speed.hetzner.com/"

def timedCall(f, *args):    
    t0 = time.time()
    result = f(*args)
    t1 = time.time()
    print(f"{f} took {t1 - t0:.3f} s")
    pprint.pprint(result)
    print()

f = fsspec.open(prefix)

print(f"# Testing {prefix}\n")
timedCall(f.fs.exists, prefix)
timedCall(f.fs.listdir, prefix)
timedCall(f.fs.info, prefix)
timedCall(f.fs.isfile, prefix)
timedCall(f.fs.isdir, prefix)

path = prefix + "100MB.bin"
print(f"# Testing {path}\n")
timedCall(f.fs.exists, path)
timedCall(f.fs.info, path)
timedCall(f.fs.isfile, path)
timedCall(f.fs.isdir, path)

Output:

# Testing https://ash-speed.hetzner.com/

<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.362 s
True

<bound method AbstractFileSystem.listdir of <fsspec.implementations.http.HTTPFileSystem object at 0x7fb0beaca680>> took 0.110 s
[{'name': 'https://ash-speed.hetzner.com/10GB.bin',
  'size': None,
  'type': 'file'},
 {'name': 'https://ash-speed.hetzner.com/100MB.bin',
  'size': None,
  'type': 'file'},
 {'name': 'https://ash-speed.hetzner.com/1GB.bin',
  'size': None,
  'type': 'file'}]

<function HTTPFileSystem._info at 0x7fb0be244550> took 0.763 s
{'ETag': '"60f52d50-143"',
 'mimetype': 'text/html',
 'name': 'https://ash-speed.hetzner.com/',
 'size': 323,
 'type': 'file',
 'url': 'https://ash-speed.hetzner.com/'}

<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.108 s
True

<function HTTPFileSystem._isdir at 0x7fb0be244670> took 0.108 s
True

# Testing https://ash-speed.hetzner.com/100MB.bin

<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.216 s
True

<function HTTPFileSystem._info at 0x7fb0be244550> took 1.098 s
{'ETag': '"60c9b8bd-6400000"',
 'mimetype': 'application/octet-stream',
 'name': 'https://ash-speed.hetzner.com/100MB.bin',
 'size': 104857600,
 'type': 'file',
 'url': 'https://ash-speed.hetzner.com/100MB.bin'}

<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.428 s
True

<function HTTPFileSystem._isdir at 0x7fb0be244670> took 38.450 s
False

Imho, isdir should be implemented via a listdir to the parent if there is no other way. I am also wondering what it does check. Is it simply doing a mimetype check whether it is HTML? If so, then the first 1000 or so bytes would suffice. But then, wouldn't it detect arbitrary HTML files inside a given "folder" wrongly as a folder?

My current workaround is to call info first and only call isdir if mimetype is text/html. This logic could also be implemented in HTTPFileSystem if there is no better way.

@martindurant
Copy link
Member

The way that ls works for HTTP, is to download the URL/page, and look for links to URLs that look like children of the original page. This works well for "ftp-style" servers (like python -m http.server). That's the only way to know if a URL is a directory.

I guess, it would be reasonable to shortcut isdir to return False for ANY URL that isn't HTML?

@martindurant
Copy link
Member

I suppose the shortcut would be in info or ls actually - would you like to have a go at coding that? Before calling .text here, checking the content-type of r and returning nothing for any request that isn't HTML should avoid the download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants