Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Change zstd compression levels mapping #643

Open
wants to merge 86 commits into
base: master
Choose a base branch
from

Conversation

LRFLEW
Copy link

@LRFLEW LRFLEW commented Jan 28, 2024

I noticed that bcachefs is mapping its compression levels (1-15) to zstd's compression levels (1-22) using floor(level * 3 / 2). This PR proposes to change this mapping to be direct, and use only zstd compression levels 1-15 (see later comments about the new mapping). My reasoning for this change is as follows:

  • This mapping, requiring integer truncation, results in an uneven mapping of compression levels. Increasing the bcachefs compression level by one will increase the underlying compression level by either one or two depending on the starting level. Using a direct mapping of compression levels fixes this.
  • bcachefs uses a default compression level of 3, which seems to map to zstd's default compression level of 3. However, the remapping of compression levels means that a bcachefs compression level of 3 is actually a zstd compression level of 4, which is more than the default for zstd. Using a direct mapping will bring the default compression levels of bcachefs and zstd in line with each other.
  • Referring to zstd's compression levels as being "from 1 to 22" is a bit disingenuous, as it ignores the seven "negative" compression levels (accessed in the command line tool with --fast=N). Having bcachefs ignore some of the higher compression levels is not a significant deal, considering that it's already ignoring some of the lower compression levels.
  • Similarly, with the command-line tool zstd, the compression levels 20-22 are not accessible without an additional command-line argument (--ultra) to tell it you really want to use compression levels that high. Counting those higher compression levels as being equivalent to the others can be misleading, and excluding them shouldn't be a significant change.
  • btrfs's zstd compression also has 15 compression levels, but they map their compression levels to zstd's compression levels 1-15 directly. Changing bcachefs's compression level mapping to match btrfs means that existing benchmarks for btrfs compression levels will be more accurate for bcachefs.

If keeping access to these higher compression levels is a priority, then perhaps a non-linear mapping would be better suited. One option would be to map the lower levels directly to the zstd levels, and then have the higher levels skip every other compression level (clamped to <=22). This would provide a compromise between providing more low compression levels and maintaining access to the max compression level, all while addressing the above issues. Let me know if you'd rather this solution, and I can update the PR with it.

@boomshroom
Copy link

Since background_compression happens, well, in the background, it's not as important for it to be fast and it makes sense to let the system use spare background cycles getting the maximum possible compression for very infrequently accessed files. As such, being able to access level 22 for this purpose would be nice and I'd appreciate not losing it.

If we wanted a piecewise-linear mapping with a slope of 1 at (0,0) and a slope of 2 at (15,22), then they would meet at exactly (8,8), giving one extra point to the lower range.

@LRFLEW
Copy link
Author

LRFLEW commented Jan 29, 2024

I noticed that piecewise-linear mapping while experimenting with it, and I do think that's a good option. The only issue I have with it is that levels 20-22 might want to be considered separate.

Levels >= 20, labeled --ultra, should be used with caution, as they require more memory. (source)

Because of that, I think level 19, the maximum non-ultra level, should be included in the mapping. As such, I would propose using the mapping points (0,0) and (14,19), giving an intersection point of (9,9). Then the top level 15 would be mapped to the top ultra level 22. The only possible issue I see is that it adds additional complexity to the mapping.

@koverstreet koverstreet force-pushed the master branch 6 times, most recently from d0860cf to 253d7fb Compare February 10, 2024 22:41
Kent Overstreet added 22 commits February 11, 2024 00:58
This prevents going emergency read only when the user has specified
replicas_required > replicas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't want journal write completions to be blocked behind btree
transactions - io_complete_wq is used for btree updates after data and
metadata writes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, any time we failed to get a journal reservation we'd retry,
with the journal lock held; but this isn't necessary given
wait_event()/wake_up() ordering.

This avoids performance cliffs when the journal starts to get backed up
and lock contention shoots up.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For DT_SUBVOL, we now print both parent and child subvol IDs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Most bcachefs workqueues are used for completions, and should be
WQ_HIGHPRI - this helps reduce queuing delays, we want to complete
quickly once we can no longer signal backpressure by blocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Minor renaming for clarity, bit of refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __
from the name - this is a normal interface.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eliminates some error paths - no longer have a hardcoded
BCH_REPLICAS_MAX limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This gives us a way to record the date and time every journal entry was
written - useful for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for having multiple journal writes in flight.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for having multiple journal writes in flight.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Recently a severe performance regression was discovered, which bisected
to

  a6548c8 bcachefs: Avoid flushing the journal in the discard path

It turns out the old behaviour, which issued excessive journal flushes,
worked around a performance issue where queueing delays would cause the
journal to not be able to write quickly enough and stall.

The journal flushes masked the issue because they periodically flushed
the device write cache, reducing write latency for non flushes.

This patch reworks the journalling code to allow more than one
(non-flush) write to be in flight at a time. With this patch, doing 4k
random writes and an iodepth of 128, we are now able to hit 560k iops to
a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_and_journal_iter is old code that we want to get rid of, but we're
not ready to yet.

lack of btree node prefetching is, it turns out, a real performance
issue for fsck on spinning rust, so - add it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Files within a subvolume cannot be renamed into another subvolume, but
subvolumes themselves were intended to be.

This implements subvolume renaming - we need to ensure that there's only
a single dirent that points to a subvolume key (not multiple versions in
different snapshots), and we need to ensure that dirent.d_parent_subol
and inode.bi_parent_subvol are updated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
switch the statfs code from something horrible and open coded to the
more standard uuid_to_fsid()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
@koverstreet koverstreet force-pushed the master branch 2 times, most recently from e4b827e to b3a7824 Compare November 4, 2024 07:06
@koverstreet koverstreet force-pushed the master branch 3 times, most recently from 48c34ec to c7e3dd8 Compare November 15, 2024 01:20
@koverstreet koverstreet force-pushed the master branch 7 times, most recently from 8903f5f to bc01863 Compare November 30, 2024 02:27
@koverstreet koverstreet force-pushed the master branch 8 times, most recently from f1d736e to f22a971 Compare December 8, 2024 05:20
@koverstreet koverstreet force-pushed the master branch 6 times, most recently from 79c1539 to f8ed207 Compare December 12, 2024 07:29
@koverstreet koverstreet force-pushed the master branch 3 times, most recently from 379189f to c684780 Compare December 25, 2024 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants