-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #225 zfs-check: efficient handling of sparse files #234
base: master
Are you sure you want to change the base?
Conversation
BlockHasher.py * hash_class is sourced from cli args instead of hardcoding it. * hash_factory() lays the groundwork to support arbitrary hash libs. * Detection of and use of xxhash lib. ZfsCheck.py * Implement new cli arg --hash. The choices for the arg are generated based on what is detected in the python env. * The input to --hash is is validated against the arg choices. * Implemented helper method determine_algorithms_available(). This tries to pick a performant default with a fallback to sha1. * Detection of and use of xxhash lib.
very nice, however i think its a bit tricky to make the default algorithm dynamic: if one node has xxhash and the other node hasn't , it will look to the user as if all the checksums mismatch. so maybe keep the default a safe sha1? we can add information to the performance tips page about installing the xxhash and select it? |
That is fair. I had not considered it because I'm used to being cognitive of the technical details of the checksum etc. Let me have a think. I agree it makes sense for the out of box / default behaviour to be very predictable. Best user experience etc. Perhaps stderr output should contain checksum details and a hint to performance tips? |
Additional thoughts. Perhaps the stdout should contain a predictable header (defining the specification of the operation - how This approach is also nice for the use case where someone stores the checksums long term so they can be aware of the the spec that was used to generate the checksums? Perhaps offering a generated cli invocation which could be copy + pasted for repeatability? Hypothetically, in the |
That sounds good, first line as a comment with the "hashing" config? (Things like the hash and chunksize and skip option?) |
Yes, along those lines. I was thinking that we could make it an object like json so its easy to parse. I think using the It looks like the You've probably seen that a new OpenZFS silent corruption bug has recently surfaced (silent for ~18 months?!), I'm currently doing analysis on the available info to see how I might be impacted and how to "check" it 😉 see: openzfs/zfs#15526 and https://redd.it/1826lgs. I pinged you a chat on reddit if you want to chat about this PR or the corruption. |
oh no another silent corruption bug in zfs? :( i think that the header should be read earlier. Blockhasher is a low level function, and by then its too late to change any settings? but i have to look into it. |
Coincidently the datacorruption bug is also related to sparse handling of files it seems. They just released a fix: https://github.com/openzfs/zfs/releases/tag/zfs-2.1.14 |
Per #225, this PR offers a solution - critique and changes very welcome.
The PR introduces two main changes:
xxhash
module is available,xxh3_64
is becomes the default--hash
algorithm, else fallback tosha1
.zfs-check --hash
which enables invocation with a specific hash algorithm.This PR proposes to use xxHash - an extremely fast non-cryptographic hash algorithm, as the default hash algorithm for
zfs-check
. This algorithm is able to hash sparse chunks efficiently. This previous revisions hardcodedsha1
algorithm was inefficient in handling sparse chunks. This is documented in #225.One is able to get a quick idea of the performance of various hash algorithms here: https://github.com/Cyan4973/xxHash#benchmarks. Its clear to see
sha1
is near the bottom of the leader-board.💡 It might be desirable to add
xxhash
torequirements.txt
BUT the caveat is that operating system needs sharedlibxxhash
installed. I've omitted this for now, deferring to the leaders of the project to make a decision on this.To manually install
xxhash
in a python env:python3 -m pip install --upgrade xxhash
.Note that the code in the PR gracefully handles the absence of the
xxhash
module and implements fallback logic.BlockHasher.py
ZfsCheck.py
--hash
. The choices for the arg are generated based on what is detected in the python env.--hash
is is validated against the arg choices.