Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bch-rebalance dies with IP at 0x0, immediately after filesystem goes readwrite #814

Open
thememika opened this issue Jan 11, 2025 · 3 comments

Comments

@thememika
Copy link

Hello!
I'm using commit: https://evilpiepirate.org/git/bcachefs.git/commit/?id=3dfbfa502c5592e1e29ae0a4ac6eb14b1c4aad85
Filesystem works when readonly, but going readwrite leads to a weird crash

[   85.434223] [  T993] bcachefs (dm-17): going read-write
[   85.442160] [  T996] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   85.442163] [  T996] #PF: supervisor instruction fetch in kernel mode
[   85.442165] [  T996] #PF: error_code(0x0010) - not-present page
[   85.442167] [  T996] PGD 0 P4D 0
[   85.442170] [  T996] Oops: Oops: 0010 [#1] PREEMPT_RT SMP
[   85.442174] [  T996] CPU: 2 UID: 0 PID: 996 Comm: bch-rebalance/d Not tainted 6.12.0-blahaj-lts-rt-2+ #50
[   85.442177] [  T996] Hardware name: OEM X79G/X79G, BIOS 4.6.5 04/01/2024
[   85.442179] [  T996] RIP: 0010:0x0
[   85.442184] [  T996] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   85.442185] [  T996] RSP: 0018:ffffc9000a1afa20 EFLAGS: 00010246
[   85.442187] [  T996] RAX: 0000000000000001 RBX: ffff888122805240 RCX: 00000006f4c88318
[   85.442189] [  T996] RDX: 0000000000000000 RSI: ffff888124697808 RDI: ffff888124697800
[   85.442191] [  T996] RBP: ffffc9000a1afa58 R08: fffffffffff7071c R09: 0000000000000000
[   85.442192] [  T996] R10: 0000000000000000 R11: ffffffff84f89bc8 R12: 0000000000000000
[   85.442194] [  T996] R13: 0000000000000000 R14: ffff888124697800 R15: 0000000000000000
[   85.442195] [  T996] FS:  0000000000000000(0000) GS:ffff888ffb600000(0000) knlGS:0000000000000000
[   85.442198] [  T996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   85.442199] [  T996] CR2: ffffffffffffffd6 CR3: 00000000072f0003 CR4: 00000000001726f0
[   85.442202] [  T996] Call Trace:
[   85.442203] [  T996]  <TASK>
[   85.442205] [  T996]  ? __die_body+0x6e/0xb0
[   85.442212] [  T996]  ? __die+0x8b/0xc0
[   85.442215] [  T996]  ? page_fault_oops+0x362/0x420
[   85.442220] [  T996]  ? bch2_bkey_pack_pos_lossy+0x327/0xaa0
[   85.442227] [  T996]  ? do_user_addr_fault+0x6cb/0x13b0
[   85.442231] [  T996]  ? __bch2_btree_node_relock+0x8a/0x350
[   85.442236] [  T996]  ? bch2_btree_node_iter_init+0x2cc/0xfe0
[   85.442242] [  T996]  ? sched_clock_noinstr+0xd/0x10
[   85.442247] [  T996]  ? sched_clock_noinstr+0xd/0x10
[   85.442250] [  T996]  ? __lock_acquire+0xc7a/0xe60
[   85.442269] [  T996]  ? exc_page_fault+0x74/0x120
[   85.442273] [  T996]  ? asm_exc_page_fault+0x2b/0x30
[   85.442280] [  T996]  bch2_io_timer_add+0xf9/0x150
[   85.442283] [  T996]  ? bch2_rebalance_thread+0x80/0x920
[   85.442289] [  T996]  bch2_kthread_io_clock_wait+0xca/0x190
[   85.442291] [  T996]  ? bch2_io_clock_schedule_timeout+0xe0/0xe0
[   85.442292] [  T996]  ? bch2_rebalance_thread+0x80/0x920
[   85.442298] [  T996]  bch2_rebalance_thread+0x80/0x920
[   85.442300] [  T996]  ? bch2_rebalance_thread+0x4a/0x920
[   85.442313] [  T996]  ? bch2_rebalance_thread+0x1ab/0x920
[   85.442317] [  T996]  ? bch2_rebalance_thread+0x1ab/0x920
[   85.442335] [  T996]  ? local_clock_noinstr+0x30/0xc0
[   85.442338] [  T996]  ? local_clock+0x19/0x30
[   85.442343] [  T996]  ? lock_release+0x126/0x4d0
[   85.442345] [  T996]  ? kthread+0x11b/0x170
[   85.442351] [  T996]  kthread+0x157/0x170
[   85.442354] [  T996]  ? bch2_rebalance_start+0x100/0x100
[   85.442358] [  T996]  ? kthread_blkcg+0x40/0x40
[   85.442361] [  T996]  ret_from_fork+0x3a/0x50
[   85.442365] [  T996]  ? kthread_blkcg+0x40/0x40
[   85.442369] [  T996]  ret_from_fork_asm+0x11/0x20
[   85.442376] [  T996]  </TASK>
[   85.442377] [  T996] Modules linked in:
[   85.442433] [  T996] CR2: 0000000000000000
[   85.442436] [  T996] ---[ end trace 0000000000000000 ]---
[   85.450587] [  T996] pstore: backend (erst) writing error (-28)
[   85.450587] [  T996] RIP: 0010:0x0
[   85.450587] [  T996] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   85.450587] [  T996] RSP: 0018:ffffc9000a1afa20 EFLAGS: 00010246
[   85.450587] [  T996] RAX: 0000000000000001 RBX: ffff888122805240 RCX: 00000006f4c88318
[   85.450587] [  T996] RDX: 0000000000000000 RSI: ffff888124697808 RDI: ffff888124697800
[   85.450587] [  T996] RBP: ffffc9000a1afa58 R08: fffffffffff7071c R09: 0000000000000000
[   85.450587] [  T996] R10: 0000000000000000 R11: ffffffff84f89bc8 R12: 0000000000000000
[   85.450587] [  T996] R13: 0000000000000000 R14: ffff888124697800 R15: 0000000000000000
[   85.450587] [  T996] FS:  0000000000000000(0000) GS:ffff888ffb600000(0000) knlGS:0000000000000000
[   85.450587] [  T996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   85.450587] [  T996] CR2: ffffffffffffffd6 CR3: 00000000072f0003 CR4: 00000000001726f0
[   85.450587] [  T996] note: bch-rebalance/d[996] exited with irqs disabled

Thanks for any help.

@thememika
Copy link
Author

thememika commented Jan 11, 2025

Are you executing a null pointer to function or something. I'm going to faddr2line later when i get to a machine. Last function belonging to bcachefs (which ran before trap) seems to be bch2_io_timer_add

@thememika
Copy link
Author

I think I found it where it jumps to 0: https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/clock.c?id=3dfbfa502c5592e1e29ae0a4ac6eb14b1c4aad85#n28

if (time_after_eq64((u64) atomic64_read(&clock->now), timer->expire)) {
		spin_unlock(&clock->timer_lock);
        ----> 	timer->fn(timer);
		return;
}

timer is of type pointer to 'struct io_timer'. This structure is declared in fs/bcachefs/clock_types.h:

typedef void (*io_timer_fn)(struct io_timer *);

struct io_timer {
	io_timer_fn		fn;
	void			*fn2;
	u64			expire;
};

fn is a pointer to function which returns 'struct io_timer*'.
It is being null when it's executed. Yet I don't understand why yet

@thememika
Copy link
Author

thememika commented Jan 11, 2025

@koverstreet can you help? I didn't edit bcachefs code.
My previous commit was https://evilpiepirate.org/git/bcachefs.git/commit/?h=bcachefs-testing&id=2794dc06189a973519cef3b44d04e73f8f059ce9.
I cloned your master branch latest (with the fix for erofs_journal) and copied over the bcachefs folder from it into my tree. There wasn't any conflict during build (it's how I always update).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant