-
Notifications
You must be signed in to change notification settings - Fork 100
/
ChangeLog
350 lines (330 loc) · 16.9 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
Latest:
------
For even more detail, use "git log" or visit https://github.com/LINBIT/drbd/commits/master.
9.2.12 (api:genl2/proto:86-101,118-122/transport:19)
--------
* Fix a complicated distributed deadlock corner case that caused
DRBD to being unable to reconnect after losing connection during
a resync
* Fix the RDMA transport for use with an intel card; fixed various
aspects where we depended on Mellanox cards' behavior
* Changes merged from 9.1.22
- Fix a corner case that can happen when DRBD establishes multiple
connections in parallel, which could lead one connection to end up in
an inconsistent replication state of WFBitMapT/Established
- Fix a corner case in which a reconciliation resync ends up in
WFBitMapT/Established
- Restrict protocol compatibility to the most recent 8.4 and 9.0 releases
- Fix a corner case causing a module ref leak on drbd_transport_tcp;
if it hits, you can not rmmod it
- rate-limit resync progress while resync is paused
- resync-target inherits history UUIDs when resync finishes,
this can prevent unexpected "unrelared data" events later
- Updated compatibility code for Linux 6.11 and 6.12
9.2.11 (api:genl2/proto:86-122/transport:19)
--------
* Do not block del-minor or down operations if the RDMA/Infiniband
stack cleans up slowly.
* Changes merged from 9.1.22
- Upgrade from partial resync to a full resync if necessary when the
user manually resolves a split-brain situation
- Fix a potential NULL deref when a disk fails while doing a
forget-peer operation.
- Fix a rcu_read_lock()/rcu_read_unlock() imbalance
- Restart the open() syscall when a process auto promoting a drbd device gets
interrupted by a signal
- Remove a deadlock that caused DRBD to connect sometimes
exceptionally slow
- Make detach operations interruptible
- Added dev_is_open to events2 status information
- Improve log readability for 2PC state changes and drbd-threads
- Updated compability code for Linux 6.9
9.2.10 (api:genl2/proto:86-122/transport:19)
--------
* Changes merged from 9.1.21
- fix a deadlock that can trigger when deleting a connection and
another connection going down in parallel. This is a regression of
9.1.20
- Fix an out-of-bounds access when scanning the bitmap. It leads to a
crash when the bitmap ends on a page boundary, and this is also a
regression in 9.1.20.
9.2.9 (api:genl2/proto:86-122/transport:19)
--------
* Allow resync operations between secondaries if the sync source is
not connected with the primary node
* Changes merged from 9.1.20
- Fix a kernel crash that is sometimes triggered when downing drbd
resources in a specific, unusual order (was triggered by the
Kubernetes CSI driver)
- Fix a rarely triggering kernel crash upon adding paths to a
connection by rehauling the path lists' locking
- Fix the continuation of an interrupted initial resync
- Fix the state engine so that an incapable primary does not outdate
indirectly reachable secondary nodes
- Fix a logic bug that caused drbd to pretend that a peer's disk is
outdated when doing a manual disconnect on a down connection; with
that cured impact on fencing and quorum.
- Fix forceful demotion of suspended devices
- Rehaul of the build system to apply compatibility patches out of
place that allows one to build for different target kernels from a
single drbd source tree
- Updated compability code for Linux 6.8
9.2.8 (api:genl2/proto:86-122/transport:19)
--------
* Fix the not-terminating-resync phenomenon between two nodes with
backing disk in the presence of a diskless primary node under
heavy I/O
* Fix a rare race condition aborting connections claiming wrong
protocol magic
* Fix various problems of the checksum-based resync, including kernel
crashes
* Fix soft lockup messages in the RDMA transport under heavy I/O
* changes merged from drbd-9.1.19
- Fix a resync decision case where drbd wrongly decided to do a full
resync, where a partial resync was sufficient; that happened in a
specific connect order when all nodes were on the same data
generation (UUID)
- Fix the online resize code to obey cached size information about
temporal unreachable nodes
- Fix a rare corner case in which DRBD on a diskless primary node
failed to re-issue a read request to another node with a backing
disk upon connection loss on the connection where it shipped the
read request initially
- Make timeout during promotion attempts interruptible
- No longer write activity-log updates on the secondary node in a
cluster with precisely two nodes with backing disk; this is a
performance optimization
- Reduce CPU usage of acknowledgment processing
9.2.7 (api:genl2/proto:86-122/transport:19)
--------
* Fixed wrong tx-timeouts in the RDMA transport
* Recreate buffers promptly for the RDMA transport, improving
performance a lot
* changes merged from drbd-9.1.18
- Fixed connecting nodes with differently sized backing disks,
specifically when the smaller node is primary, before establishing
the connections
- Fixed thawing a device that has I/O frozen after loss of quorum
when a configuration change eases its quorum requirements
- Properly fail TLS if requested (only available in drbd-9.2)
- Fixed a race condition that can cause auto-demote to trigger right
after an explicit promote
- Fixed a rare race condition that could mess up the handshake result
before it is committed to the replication state.
- Preserve "tiebreaker quorum" over a reboot of the last node (3-node
clusters only)
- Update compatibility code for Linux 6.6
9.2.6 (api:genl2/proto:86-122/transport:19)
--------
* a series of fixes to the RDMA transport, making it compatible with
more recent Mellanox cards and fixes in general to the RDMA code
* Tuning parameter rdma-ctrl-(snd|rcv)buf-size for fine tuning
* Makefile updates for compiling with OFED
* optional TLS encryption for the TCP transport, based on kTLS with
TLS handshakes in userspace
* a new load-balancing TCP transport "lb-tcp" that establises all
configured paths in paralle and distributes the packet load
over them
* a new config net option 'load-balance-paths' that easens
the steps of renaming the transports tcp to tcp-legacy and
lb-tcp to tcp and the final removal of the older tcp
implementation
* changes merged from drbd-9.1.17
- fix a potential crash when configuring drbd to bind to a
non-existent local IP address (this is a regression of drbd-9.1.8)
- Cure a very seldom triggering race condition bug during
establishing connections; when you triggered it, you got an OOPS
hinting to list corruption
- fix a race condition regarding operations on the bitmap while
forgetting a bitmap slot and a pointless warning
- Fix handling of unexpected (on a resource in secondary role) write
requests
- Fix a corner case that can cause a process to hang when closing the
DRBD device, while a connection gets re-established
- Correctly block signal delivery during auto-demote
- Improve the reliability of establishing connections
- Do not clear the transport with `net-options --set-defaults`. This
fix avoids unexpected disconnect/connect cycles upon an `adjust`
when using the 'lb-tcp' or 'rdma' transports in drbd-9.2.
- New netlink packet to report path status to drbdsetup
- Improvements to the content and rate-limiting of many log messages
- Update compatibility code and follow Linux upstream development
until Linux 6.5
9.2.5 (api:genl2/proto:86-122/transport:18)
--------
* changes merged from drbd-9.1.16
- shorten times DRBD keeps IRQs on one CPU disabled. Could lead
to connection interruption under specific conditions
- fix a corner case where resync did not start after resync-pause
state flapped
- fix online adding of volumes/minors to an already connected resource
- fix a possible split-brain situation with quorum enabled with
ping-timeout set to (unusual) high value
- fix a locking problem that could lead to kernel OOPS
- ensure resync can continue (bitmap-based) after interruption
also when it started as a full-resync first
- correctly handle meta-data when forgetting diskless peers
- fix a possibility of getting a split-brain although quorum enabled
- correctly propagate UUIDs after resync following a resize operation.
Consequence could be a full resync instead of a bitmap-based one
- fix a rare race condition that can cause a drbd device to end up
with WFBitMapS/Established replication states
9.2.4 (api:genl2/proto:86-122/transport:18)
--------
* fix a possible deadlock when disconnecting during a resync
* fix a possible hard kernel-lockup
* changes merged from drbd-9.1.15
- fix how flush requests are marked when submitted to the Linux IO
stack on the secondary node
- when establishing a connection failed with a two-pc timeout, a
receiver thread deadlocked, causing drbdsetup calls to block on
that resource (difficult to trigger)
- fixed a NULL-ptr deref (a OOPS) caused by a rare race condition
while taking a resource down
- updated kernel compatibility to at least Linux head and also fixed
a bug in the compat checks/rules that caused OOPSes of the previous
drbd releases when compiled with Linux-6.2 (or on RHEL 9.2 kernel).
- fix an aspect of the data-generation (UUID) handling where DRBD
failed to do a resync when a diskless node in the remaining
partition promotes and demotes while a diskful node is isolated
- fix an aspect of the data-generation (UUID) handling where DRBD
considered a node to have unrelated data; this bug was triggered by
a sequence involving removing two nodes from a cluster and readding
one with the "day-0" UUIDs.
- do not block specific state changes (promote, demote, attach, and
detach) when only some nodes add a new minor
9.2.3 (api:genl2/proto:86-122/transport:18)
--------
* improve matching ACKs to in-memory request objects;
inexact matches were a source of unexpected connection losses
* merge changes from drbd-9.1.14
- fix a race with concurrent promotion and demotion, which can
lead to an unexpected "split-brain" later on
- fix a specific case where promotion was allowed where it should not
- fix a race condition between auto-promote and a second two-phase
commit that can lead to a DRBD thread locking up in an endless loop
- fix several bugs with "resync-after":
- missing resync-resume when minor numbers run in opposite
direction as the resync-after dependencies
- a race that might lead to an OOPS in add_timer()
- fix an OOPS when reading from in_flight_summary in debugfs
- fix a race that might lead to an endless loop of printing
"postponing start_resync" while starting a resync
- fix diskless node with a diskfull with a 4KiB backend
- simplify remembering two-pc parents, maybe fixing a one-time-seen bug
- derive abort_local_transaction timeout from ping-timeout
9.2.2 (api:genl2/proto:86-121/transport:18)
--------
* fix spurious PingAck timeout a second time; we need to use a drbd
owned workqueue to guarantee the required low-latency replies
* Fix connection abort during resync with log message
"Unexpected resync write ack at ..." a regression of drbd-9.2
* fix a race condition that can lead to NULL-ptr deref during resync
* merged changes from drbd-9.1.13
- when calculating if a partition has quorum, take into account if
the missing nodes might have quorum
- fix forget-peer for diskless peers
- clear the resync_again counter upon disconnect
- also call the unfence handler when no resync happens
- do not set bitmap bits when attaching to an up-to-date disk (late)
- work on bringing the out-of-tree DRBD9 closer to DRBD in the upstream
kernel; Use lru_cahche.ko from the installed kernel whenever possible
9.2.1 (api:genl2/proto:86-121/transport:18)
--------
* fix spurious PingAck timeout, a regression of ack-processing in bottom half
(introduced with 9.2.0)
* support merging of discards during resync even if the discard granularity
of the backing device is larger than 128MiB
* merged changes from the drbd-9.1 branch, including
- fix a race that could result in connection attempts getting aborted
with the message "sock_recvmsg returned -11"
- rate limit messages in case the peer can not write the backing storage
and it does not finish the necessary state transitions
- reduced the receive timeout during connecting to the intended 5 seconds
(ten times ping-ack timeout)
- losing the connection at a specific point in time during establishing
a connection could cause a transition to StandAlone; fixed that, so
that it keeps trying to connect
- fix a race that could lead to a fence-peer handler being called
unexpectedly when the fencing policy is changed at the moment before
promoting
9.2.0 (api:genl2/proto:86-121/transport:18)
--------
* merged changes from the drbd-9.1 branch, including
- fix a race that could lead to an unexpected loss of connection
related to internal concurrency; it manifested itself as
"BAD! BarrierAck #X received with n_writes=..." in the logs
- fix a race that could lead to an unexpected loss of connection
if a node in the quorate partition promotes (too) quickly
- follow upstream and compat code for up to Linux 5.19
9.2.0-rc.8 (api:genl2/proto:86-121/transport:18)
--------
* fix the RDMA transport to work on more recent kernels (like RHEL9)
* improve transmit timeout handling of the RDMA transport
* register DRBD has pernet device for namespace management
* merged fixes from the drbd-9.1 branch, including
- request handling (9.1.11)
- fix quorum when fresh nodes join a quorate but incomplete partition
- minor state handling fixes/improvements
9.2.0-rc.7 (api:genl2/proto:86-121/transport:18)
--------
* support for network namespaces
* multiple fixes to the merge-discards-during-resync functionality
* fix reference counting in AL with drbd-8.4 peers
* stricter limit for the set of characters allowed in resource names
* merge changes from the drbd 9.1.9 release
9.2.0-rc.6 (api:genl2/proto:86-121/transport:18)
--------
* fixes to the new way of coordinating resync and application IO
* merge discard requests during resync on a resync target node;
This can speed up the resync progress by multiple orders of magnitude
* Merged changes from the 9.1.8 release
- restore protocol compatibility with drbd-8.4
- detect peers that died silently when starting a two-phase-commit
- correctly abort two-phase-commits when a connection breaks between
phases 1 and 2
- allow re-connect to a node that was forced into secondary role and
where an opener is still present from the last time it was primary
- fix a race condition that allowed to configure two peers with the
same node id
- ensure that an open() call fails within the auto-promote timeout
if it can not succeed
- build fixes for RHEL9
- following upstream changes to DRBD up to Linux 5.17 and updated compat
9.2.0-rc.4 (api:genl2/proto:110-121/transport:18)
--------
* Merged fixes from the 9.1.6 release
- fix IO to internal meta-data for backing device larger than 128TB
- fix resending requests towards diskless peers, this is relevant when
fencing is enabled, but the connection is re-established before fencing
succeeds; when the bug triggered it lead to "stuck" requests
- remove lockless buffer pages handling; it still contained very hard to
trigger bugs
- make sure DRBD's resync does not cause unnecessary allocation in
a thinly provisioned backing device on a resync target node
- avoid unnecessary resync (or split-brain) due to a wrongly generated
new current UUID when an already IO frozen DBRD gets new writes
- small fix to autopromote, when an application tries a read-only open
before it does a read-write open immediately after the peer primary
vanished ungracefully
- split out the secure boot key into a package on its own, that is
necessary to allow installation of multiple drbd kernel module packages
- Support for building containers for flacar linux
9.2.0-rc.3 (api:genl2/proto:110-121/transport:18)
--------
* fix a corner case that might cause conflicting requests (touching
a storage area that is under resync) to not terminate
* merge from drbd-9.0: fix failing read-only open immediately after
a primary peer left the cluster ungracefully
9.2.0-rc.2 (api:genl2/proto:110-121/transport:18)
--------
* Fix broken wire compatibility with drbd-9.x
* Fix DKMS builds when the kernel config has CONFIG_INFINIBAND=n
9.2.0-rc.1 (api:genl2/proto:110-121/transport:18)
--------
* was forked off between 9.1.4 and 9.1.5
* implemented a geniue way of coordinating resync and application IO;
removed the internal resync_lru
* receive and process ack-packets in TCP/RDMA SOFTIRQ context, that
improves latency on all write operations; removed ack_receiver thread
* add RDMA transport for IB, RoCE networking