HTTPFileSystem isdir downloads the whole file #1707

mxmlnkn · 2024-10-04T21:26:23Z

I need to implement the FUSE getattr (stat) callback. I.e., I need to get at least the file type and size, and possibly name for a given path.

I am failing to do this with the HTTP filesystem implementation because:

info(path) always returns the file information for the HTML file, i.e., the file type is also always a file. This is already inconsistent to all other fsspec implementations. The same for isfile, which always returns true.
isdir(path) hangs and when looking at my local HTTP server log or at my network bandwidth when testing with an external server, I see that this call downloads the whole file. This means that currently an ls -la will download all files in the given folder...

Test to reproduce:

import pprint
import time
import fsspec

prefix="https://ash-speed.hetzner.com/"

def timedCall(f, *args):    
    t0 = time.time()
    result = f(*args)
    t1 = time.time()
    print(f"{f} took {t1 - t0:.3f} s")
    pprint.pprint(result)
    print()

f = fsspec.open(prefix)

print(f"# Testing {prefix}\n")
timedCall(f.fs.exists, prefix)
timedCall(f.fs.listdir, prefix)
timedCall(f.fs.info, prefix)
timedCall(f.fs.isfile, prefix)
timedCall(f.fs.isdir, prefix)

path = prefix + "100MB.bin"
print(f"# Testing {path}\n")
timedCall(f.fs.exists, path)
timedCall(f.fs.info, path)
timedCall(f.fs.isfile, path)
timedCall(f.fs.isdir, path)

Output:

# Testing https://ash-speed.hetzner.com/

<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.362 s
True

<bound method AbstractFileSystem.listdir of <fsspec.implementations.http.HTTPFileSystem object at 0x7fb0beaca680>> took 0.110 s
[{'name': 'https://ash-speed.hetzner.com/10GB.bin',
  'size': None,
  'type': 'file'},
 {'name': 'https://ash-speed.hetzner.com/100MB.bin',
  'size': None,
  'type': 'file'},
 {'name': 'https://ash-speed.hetzner.com/1GB.bin',
  'size': None,
  'type': 'file'}]

<function HTTPFileSystem._info at 0x7fb0be244550> took 0.763 s
{'ETag': '"60f52d50-143"',
 'mimetype': 'text/html',
 'name': 'https://ash-speed.hetzner.com/',
 'size': 323,
 'type': 'file',
 'url': 'https://ash-speed.hetzner.com/'}

<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.108 s
True

<function HTTPFileSystem._isdir at 0x7fb0be244670> took 0.108 s
True

# Testing https://ash-speed.hetzner.com/100MB.bin

<function HTTPFileSystem._exists at 0x7fb0be244700> took 0.216 s
True

<function HTTPFileSystem._info at 0x7fb0be244550> took 1.098 s
{'ETag': '"60c9b8bd-6400000"',
 'mimetype': 'application/octet-stream',
 'name': 'https://ash-speed.hetzner.com/100MB.bin',
 'size': 104857600,
 'type': 'file',
 'url': 'https://ash-speed.hetzner.com/100MB.bin'}

<function HTTPFileSystem._isfile at 0x7fb0be2445e0> took 0.428 s
True

<function HTTPFileSystem._isdir at 0x7fb0be244670> took 38.450 s
False

Imho, isdir should be implemented via a listdir to the parent if there is no other way. I am also wondering what it does check. Is it simply doing a mimetype check whether it is HTML? If so, then the first 1000 or so bytes would suffice. But then, wouldn't it detect arbitrary HTML files inside a given "folder" wrongly as a folder?

My current workaround is to call info first and only call isdir if mimetype is text/html. This logic could also be implemented in HTTPFileSystem if there is no better way.

The text was updated successfully, but these errors were encountered:

martindurant · 2024-10-07T14:25:39Z

The way that ls works for HTTP, is to download the URL/page, and look for links to URLs that look like children of the original page. This works well for "ftp-style" servers (like python -m http.server). That's the only way to know if a URL is a directory.

I guess, it would be reasonable to shortcut isdir to return False for ANY URL that isn't HTML?

martindurant · 2024-10-07T15:31:51Z

I suppose the shortcut would be in info or ls actually - would you like to have a go at coding that? Before calling .text here, checking the content-type of r and returning nothing for any request that isn't HTML should avoid the download.

mxmlnkn mentioned this issue Oct 11, 2024

Ls / listdir fails when using a public gateway because a non-CID path is requested fsspec/ipfsspec#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTPFileSystem isdir downloads the whole file #1707

HTTPFileSystem isdir downloads the whole file #1707

mxmlnkn commented Oct 4, 2024 •

edited

Loading

martindurant commented Oct 7, 2024

martindurant commented Oct 7, 2024

HTTPFileSystem isdir downloads the whole file #1707

HTTPFileSystem isdir downloads the whole file #1707

Comments

mxmlnkn commented Oct 4, 2024 • edited Loading

martindurant commented Oct 7, 2024

martindurant commented Oct 7, 2024

mxmlnkn commented Oct 4, 2024 •

edited

Loading