Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.11, 6.12] Constant heavy reads when there is unfinishable "Pending rebalance work" #795

Open
nitinkmr333 opened this issue Dec 4, 2024 · 2 comments

Comments

@nitinkmr333
Copy link

nitinkmr333 commented Dec 4, 2024

On multi-device filesystem, I have noticed that whenever background_target becomes full, there are constant heavy reads by the rebalance thread.

Steps to reproduce:

Create two loop devices. One will be used as foreground_target (disk0), other will be background_target (disk1)-

❯ mkdir -p ~/bcachefs
❯ cd ~/bcachefs
❯ dd if=/dev/zero of=disk0 bs=1G count=40 status=progress
42949672960 bytes (43 GB, 40 GiB) copied, 16 s, 2.7 GB/s
40+0 records in
40+0 records out
42949672960 bytes (43 GB, 40 GiB) copied, 16.1028 s, 2.7 GB/s
❯ dd if=/dev/zero of=disk1 bs=1G count=40 status=progress
41875931136 bytes (42 GB, 39 GiB) copied, 15 s, 2.7 GB/s
40+0 records in
40+0 records out
42949672960 bytes (43 GB, 40 GiB) copied, 15.7211 s, 2.7 GB/s

Here, both are 40GB disks.

Add them as loop devices (for mounting)-

❯ sudo losetup --find --show disk0
/dev/loop0
❯ sudo losetup --find --show disk1
/dev/loop1
❯ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0    40G  0 loop
loop1         7:1    0    40G  0 loop

Format the loop devices as bcachefs. disk0 label is ssd (foreground_target) & disk1 label is hdd (background_target)-

❯ sudo bcachefs format --label ssd /dev/loop0 --label hdd /dev/loop1 --foreground_target=ssd --background_target=hdd
External UUID:                             99e865e6-ee40-480a-bd5d-c2fb1b805583
Internal UUID:                             80195311-407a-492f-a297-5d2e3e78892d
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              1
Label:                                     (none)
Version:                                   1.13: inode_has_child_snapshots
Version upgrade complete:                  0.0: (unknown version)
Oldest version on disk:                    1.13: inode_has_child_snapshots
Created:                                   Wed Dec  4 19:11:21 2024
Sequence number:                           0
Time of last write:                        Thu Jan  1 05:30:00 1970
Superblock size:                           1.25 KiB/1.00 MiB
Clean:                                     0
Devices:                                   2
Sections:                                  members_v1,disk_groups,members_v2
Features:                                  new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:

Options:
  block_size:                              512 B
  btree_node_size:                         256 KiB
  errors:                                  continue [fix_safe] panic ro
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash
  data_checksum:                           none [crc32c] crc64 xxhash
  compression:                             none
  background_compression:                  none
  str_hash:                                crc32c crc64 [siphash]
  metadata_target:                         none
  foreground_target:                       ssd
  background_target:                       hdd
  promote_target:                          none
  erasure_code:                            0
  inodes_32bit:                            1
  shard_inode_numbers:                     1
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      8
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   1
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none
  nocow:                                   0

members_v2 (size 304):
Device:                                    0
  Label:                                   ssd (0)
  UUID:                                    5819c971-b9fe-448a-b1d0-d488591e61f6
  Size:                                    40.0 GiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             256 KiB
  First bucket:                            0
  Buckets:                                 163840
  Last mount:                              (never)
  Last superblock write:                   0
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                (none)
  Btree allocated bitmap blocksize:        1.00 B
  Btree allocated bitmap:                  0000000000000000000000000000000000000000000000000000000000000000
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   0
Device:                                    1
  Label:                                   hdd (1)
  UUID:                                    18483694-8f70-454e-a5cd-719c2499ac11
  Size:                                    40.0 GiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             256 KiB
  First bucket:                            0
  Buckets:                                 163840
  Last mount:                              (never)
  Last superblock write:                   0
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                (none)
  Btree allocated bitmap blocksize:        1.00 B
  Btree allocated bitmap:                  0000000000000000000000000000000000000000000000000000000000000000
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   0
starting version 1.13: inode_has_child_snapshots opts=foreground_target=ssd,background_target=hdd
initializing new filesystem
going read-write
initializing freespace
shutdown complete, journal seq 16

Mount the filesystem and write 60GB file (bigger than background_target)-

❯ sudo bcachefs mount /dev/loop0:/dev/loop1 /mnt
❯ sudo dd if=/dev/zero of=/mnt/hugefile bs=1G count=60 status=progress
64424509440 bytes (64 GB, 60 GiB) copied, 75 s, 854 MB/s
60+0 records in
60+0 records out
64424509440 bytes (64 GB, 60 GiB) copied, 75.4496 s, 854 MB/s

bcachefs fs usage-

❯ sudo bcachefs fs usage /mnt -h
Filesystem: 99e865e6-ee40-480a-bd5d-c2fb1b805583
Size:                       73.6 GiB
Used:                       60.2 GiB
Online reserved:                 0 B

Data type       Required/total  Durability    Devices
btree:          1/1             1             [loop0]              228 MiB
user:           1/1             1             [loop0]             21.6 GiB
user:           1/1             1             [loop1]             38.4 GiB
cached:         1/1             1             [loop0]             16.3 GiB

Btree usage:
extents:            87.0 MiB
inodes:              256 KiB
dirents:             256 KiB
alloc:              42.3 MiB
subvolumes:          256 KiB
snapshots:           256 KiB
lru:                2.75 MiB
freespace:           256 KiB
need_discard:        256 KiB
backpointers:       80.0 MiB
bucket_gens:         256 KiB
snapshot_trees:      256 KiB
rebalance_work:     13.8 MiB
accounting:          256 KiB

Pending rebalance work:
21.6 GiB

hdd (device 1):                loop1              rw
                                data         buckets    fragmented
  free:                     1.26 GiB            5177
  sb:                       3.00 MiB              13       252 KiB
  journal:                   320 MiB            1280
  btree:                         0 B               0
  user:                     38.4 GiB          157370
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                 40.0 GiB          163840

ssd (device 0):                loop0              rw
                                data         buckets    fragmented
  free:                     1.27 GiB            5205
  sb:                       3.00 MiB              13       252 KiB
  journal:                   320 MiB            1280
  btree:                     228 MiB             912
  user:                     21.6 GiB           88390
  cached:                   16.3 GiB           66785       256 KiB
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:              314 MiB            1255
  unstriped:                     0 B               0
  capacity:                 40.0 GiB          163840

There is some pending rebalance work but background_target is full, so it cannot move the data. I can see rebalance thread doing constant reads even after data is written-
Screenshot_20241204_193257

I expect some constant I/O by filesystem to check if background_target has free space, but 300+MB/s seems excessive. I tried waiting for more than an hour but it did not stop. It triggers again if I remount the drive. It only stops if I delete the file I created and free up the background_target.

Underlying filesystem (where loop devices are created) is btrfs (with compression=zstd:3).

Host-

Host- NixOS
❯ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.11.5-zen1, NixOS, 24.11 (Vicuna), 24.11.20241202.f9f0d5c`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.24.10`
 - nixpkgs: `/nix/store/45bzbkwnyb6nikgc7jkrn7vjibhy4xhk-source`

bcachefs-tools version- 6.13.0

I will do some more testing on actual hardware.

@nitinkmr333
Copy link
Author

nitinkmr333 commented Dec 12, 2024

I can confirm this also happens on actual hardware. There are heavy reads when background target is full. Writes are unaffected.

@nitinkmr333
Copy link
Author

nitinkmr333 commented Dec 29, 2024

It looks like the issue is related to Pending rebalance work and not just background_target itself. We face this bug if there is Pending rebalance work that needs to be done, but cannot be completed for some reason (maybe we are constantly rescanning the pending rebalance, resulting in I/O?).

For example, we can create a filesystem with 2 disks (foreground_target=ssd, background_target=hdd, replicas=1), and write some data with data_replicas=2-

❯ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0  1000M  0 loop /mnt
loop1         7:1    0  1000M  0 loop

show-super-

❯ sudo bcachefs show-super /dev/loop0
Device:                                     (unknown device)
External UUID:                             a7f2bff7-29ee-4e49-9e01-0cbe16c7332a
Internal UUID:                             63201fc7-0bff-4422-baa4-d243bc81a483
Magic number:                              c68573f6-66ce-90a9-d96a-60cf803df7ef
Device index:                              0
Label:                                     (none)
Version:                                   1.13: inode_has_child_snapshots
Version upgrade complete:                  1.13: inode_has_child_snapshots
Oldest version on disk:                    1.13: inode_has_child_snapshots
Created:                                   Sun Dec 29 14:58:13 2024
Sequence number:                           20
Time of last write:                        Sun Dec 29 15:01:46 2024
Superblock size:                           4.67 KiB/1.00 MiB
Clean:                                     0
Devices:                                   2
Sections:                                  members_v1,replicas_v0,disk_groups,clean,journal_v2,counters,members_v2,errors,ext,downgrade
Features:                                  new_siphash,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                           alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                              512 B
  btree_node_size:                         128 KiB
  errors:                                  continue [fix_safe] panic ro 
  metadata_replicas:                       1
  data_replicas:                           1
  metadata_replicas_required:              1
  data_replicas_required:                  1
  encoded_extent_max:                      64.0 KiB
  metadata_checksum:                       none [crc32c] crc64 xxhash 
  data_checksum:                           none [crc32c] crc64 xxhash 
  compression:                             none
  background_compression:                  none
  str_hash:                                crc32c crc64 [siphash] 
  metadata_target:                         none
  foreground_target:                       ssd
  background_target:                       hdd
  promote_target:                          none
  erasure_code:                            0
  inodes_32bit:                            1
  shard_inode_numbers:                     1
  inodes_use_key_cache:                    1
  gc_reserve_percent:                      8
  gc_reserve_bytes:                        0 B
  root_reserve_percent:                    0
  wide_macs:                               0
  promote_whole_extents:                   1
  acl:                                     1
  usrquota:                                0
  grpquota:                                0
  prjquota:                                0
  journal_flush_delay:                     1000
  journal_flush_disabled:                  0
  journal_reclaim_delay:                   100
  journal_transaction_names:               1
  allocator_stuck_timeout:                 30
  version_upgrade:                         [compatible] incompatible none 
  nocow:                                   0

members_v2 (size 304):
Device:                                    0
  Label:                                   ssd (0)
  UUID:                                    72944387-5f76-407e-8152-6e25d95d8cc3
  Size:                                    1000 MiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             128 KiB
  First bucket:                            0
  Buckets:                                 8000
  Last mount:                              Sun Dec 29 15:01:46 2024
  Last superblock write:                   20
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                journal,btree,user
  Btree allocated bitmap blocksize:        4.00 KiB
  Btree allocated bitmap:                  0000010000000000000000000000000000000000000000000000000001100000
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   1
Device:                                    1
  Label:                                   hdd (1)
  UUID:                                    75c87134-d657-4ae5-91aa-4fff722d2a11
  Size:                                    1000 MiB
  read errors:                             0
  write errors:                            0
  checksum errors:                         0
  seqread iops:                            0
  seqwrite iops:                           0
  randread iops:                           0
  randwrite iops:                          0
  Bucket size:                             128 KiB
  First bucket:                            0
  Buckets:                                 8000
  Last mount:                              Sun Dec 29 15:01:46 2024
  Last superblock write:                   20
  State:                                   rw
  Data allowed:                            journal,btree,user
  Has data:                                user
  Btree allocated bitmap blocksize:        1.00 B
  Btree allocated bitmap:                  0000000000000000000000000000000000000000000000000000000000000000
  Durability:                              1
  Discard:                                 0
  Freespace initialized:                   1

Now, write some data to a folder having data_replicas=2 (using xattr)-

cd /mnt
sudo mkdir data_xattr
sudo bcachefs set-file-option --data_replicas=2 data_xattr
sudo dd if=/dev/zero of=data_xattr/file bs=200M count=1 status=progress

We have enough free space in the background_target but can only store 1 replica, hence there is Pending rebalance work-

❯ sudo bcachefs fs usage -h /mnt
Filesystem: a7f2bff7-29ee-4e49-9e01-0cbe16c7332a
Size:                       1.80 GiB
Used:                        403 MiB
Online reserved:                 0 B

Data type       Required/total  Durability    Devices
btree:          1/1             1             [loop0]             3.13 MiB
user:           1/2             2             [loop0 loop1]        400 MiB

Btree usage:
extents:             512 KiB
inodes:              128 KiB
dirents:             128 KiB
alloc:               640 KiB
subvolumes:          128 KiB
snapshots:           128 KiB
lru:                 128 KiB
freespace:           128 KiB
need_discard:        128 KiB
backpointers:        640 KiB
bucket_gens:         128 KiB
snapshot_trees:      128 KiB
rebalance_work:      128 KiB
accounting:          128 KiB

Pending rebalance work:
200 MiB

hdd (device 1):                loop1              rw
                                data         buckets    fragmented
  free:                      789 MiB            6313
  sb:                       3.00 MiB              25       124 KiB
  journal:                  7.75 MiB              62
  btree:                         0 B               0
  user:                      200 MiB            1600
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                 1000 MiB            8000

ssd (device 0):                loop0              rw
                                data         buckets    fragmented
  free:                      786 MiB            6288
  sb:                       3.00 MiB              25       124 KiB
  journal:                  7.75 MiB              62
  btree:                    3.13 MiB              25
  user:                      200 MiB            1600
  cached:                        0 B               0
  parity:                        0 B               0
  stripe:                        0 B               0
  need_gc_gens:                  0 B               0
  need_discard:                  0 B               0
  unstriped:                     0 B               0
  capacity:                 1000 MiB            8000

This causes heavy reads on filesystem.

@nitinkmr333 nitinkmr333 changed the title [6.11] Constant heavy reads when background_target is full [6.11, 6.12] Constant heavy reads when there is unfinishable "Pending rebalance work" Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant