ChangeLog

Latest:
------
 For even more detail, use "git log" or visit https://github.com/LINBIT/drbd/commits/master.

9.2.12 (api:genl2/proto:86-101,118-122/transport:19)
--------
 * Fix a complicated distributed deadlock corner case that caused
   DRBD to being unable to reconnect after losing connection during
   a resync
 * Fix the RDMA transport for use with an intel card; fixed various
   aspects where we depended on Mellanox cards' behavior
 * Changes merged from 9.1.22
  - Fix a corner case that can happen when DRBD establishes multiple
    connections in parallel, which could lead one connection to end up in
    an inconsistent replication state of WFBitMapT/Established
  - Fix a corner case in which a reconciliation resync ends up in
    WFBitMapT/Established
  - Restrict protocol compatibility to the most recent 8.4 and 9.0 releases
  - Fix a corner case causing a module ref leak on drbd_transport_tcp;
    if it hits, you can not rmmod it
  - rate-limit resync progress while resync is paused
  - resync-target inherits history UUIDs when resync finishes,
    this can prevent unexpected "unrelared data" events later
  - Updated compatibility code for Linux 6.11 and 6.12

9.2.11 (api:genl2/proto:86-122/transport:19)
--------
 * Do not block del-minor or down operations if the RDMA/Infiniband
   stack cleans up slowly.
 * Changes merged from 9.1.22
  - Upgrade from partial resync to a full resync if necessary when the
    user manually resolves a split-brain situation
  - Fix a potential NULL deref when a disk fails while doing a
    forget-peer operation.
  - Fix a rcu_read_lock()/rcu_read_unlock() imbalance
  - Restart the open() syscall when a process auto promoting a drbd device gets
    interrupted by a signal
  - Remove a deadlock that caused DRBD to connect sometimes
    exceptionally slow
  - Make detach operations interruptible
  - Added dev_is_open to events2 status information
  - Improve log readability for 2PC state changes and drbd-threads
  - Updated compability code for Linux 6.9

9.2.10 (api:genl2/proto:86-122/transport:19)
--------
 * Changes merged from 9.1.21
  - fix a deadlock that can trigger when deleting a connection and
    another connection going down in parallel. This is a regression of
    9.1.20
  - Fix an out-of-bounds access when scanning the bitmap. It leads to a
    crash when the bitmap ends on a page boundary, and this is also a
    regression in 9.1.20.

9.2.9 (api:genl2/proto:86-122/transport:19)
--------
 * Allow resync operations between secondaries if the sync source is
   not connected with the primary node
 * Changes merged from 9.1.20
  - Fix a kernel crash that is sometimes triggered when downing drbd
    resources in a specific, unusual order (was triggered by the
    Kubernetes CSI driver)
  - Fix a rarely triggering kernel crash upon adding paths to a
    connection by rehauling the path lists' locking
  - Fix the continuation of an interrupted initial resync
  - Fix the state engine so that an incapable primary does not outdate
    indirectly reachable secondary nodes
  - Fix a logic bug that caused drbd to pretend that a peer's disk is
    outdated when doing a manual disconnect on a down connection; with
    that cured impact on fencing and quorum.
  - Fix forceful demotion of suspended devices
  - Rehaul of the build system to apply compatibility patches out of
    place that allows one to build for different target kernels from a
    single drbd source tree
  - Updated compability code for Linux 6.8

9.2.8 (api:genl2/proto:86-122/transport:19)
--------
 * Fix the not-terminating-resync phenomenon between two nodes with
   backing disk in the presence of a diskless primary node under
   heavy I/O
 * Fix a rare race condition aborting connections claiming wrong
   protocol magic
 * Fix various problems of the checksum-based resync, including kernel
   crashes
 * Fix soft lockup messages in the RDMA transport under heavy I/O
 * changes merged from drbd-9.1.19
  - Fix a resync decision case where drbd wrongly decided to do a full
    resync, where a partial resync was sufficient; that happened in a
    specific connect order when all nodes were on the same data
    generation (UUID)
  - Fix the online resize code to obey cached size information about
    temporal unreachable nodes
  - Fix a rare corner case in which DRBD on a diskless primary node
    failed to re-issue a read request to another node with a backing
    disk upon connection loss on the connection where it shipped the
    read request initially
  - Make timeout during promotion attempts interruptible
  - No longer write activity-log updates on the secondary node in a
    cluster with precisely two nodes with backing disk; this is a
    performance optimization
  - Reduce CPU usage of acknowledgment processing

9.2.7 (api:genl2/proto:86-122/transport:19)
--------
 * Fixed wrong tx-timeouts in the RDMA transport
 * Recreate buffers promptly for the RDMA transport, improving
   performance a lot
 * changes merged from drbd-9.1.18
  - Fixed connecting nodes with differently sized backing disks,
    specifically when the smaller node is primary, before establishing
    the connections
  - Fixed thawing a device that has I/O frozen after loss of quorum
    when a configuration change eases its quorum requirements
  - Properly fail TLS if requested (only available in drbd-9.2)
  - Fixed a race condition that can cause auto-demote to trigger right
    after an explicit promote
  - Fixed a rare race condition that could mess up the handshake result
    before it is committed to the replication state.
  - Preserve "tiebreaker quorum" over a reboot of the last node (3-node
    clusters only)
  - Update compatibility code for Linux 6.6

9.2.6 (api:genl2/proto:86-122/transport:19)
--------
 * a series of fixes to the RDMA transport, making it compatible with
   more recent Mellanox cards and fixes in general to the RDMA code
 * Tuning parameter rdma-ctrl-(snd|rcv)buf-size for fine tuning
 * Makefile updates for compiling with OFED
 * optional TLS encryption for the TCP transport, based on kTLS with
   TLS handshakes in userspace
 * a new load-balancing TCP transport "lb-tcp" that establises all
   configured paths in paralle and distributes the packet load
   over them
 * a new config net option 'load-balance-paths' that easens
   the steps of renaming the transports tcp to tcp-legacy and
   lb-tcp to tcp and the final removal of the older tcp
   implementation
 * changes merged from drbd-9.1.17
  - fix a potential crash when configuring drbd to bind to a
    non-existent local IP address (this is a regression of drbd-9.1.8)
  - Cure a very seldom triggering race condition bug during
    establishing connections; when you triggered it, you got an OOPS
    hinting to list corruption
  - fix a race condition regarding operations on the bitmap while
    forgetting a bitmap slot and a pointless warning
  - Fix handling of unexpected (on a resource in secondary role) write
    requests
  - Fix a corner case that can cause a process to hang when closing the
    DRBD device, while a connection gets re-established
  - Correctly block signal delivery during auto-demote
  - Improve the reliability of establishing connections
  - Do not clear the transport with `net-options --set-defaults`. This
    fix avoids unexpected disconnect/connect cycles upon an `adjust`
    when using the 'lb-tcp' or 'rdma' transports in drbd-9.2.
  - New netlink packet to report path status to drbdsetup
  - Improvements to the content and rate-limiting of many log messages
  - Update compatibility code and follow Linux upstream development
    until Linux 6.5

9.2.5 (api:genl2/proto:86-122/transport:18)
--------
 * changes merged from drbd-9.1.16
  - shorten times DRBD keeps IRQs on one CPU disabled. Could lead
    to connection interruption under specific conditions
  - fix a corner case where resync did not start after resync-pause
    state flapped
  - fix online adding of volumes/minors to an already connected resource
  - fix a possible split-brain situation with quorum enabled with
    ping-timeout set to (unusual) high value
  - fix a locking problem that could lead to kernel OOPS
  - ensure resync can continue (bitmap-based) after interruption
    also when it started as a full-resync first
  - correctly handle meta-data when forgetting diskless peers
  - fix a possibility of getting a split-brain although quorum enabled
  - correctly propagate UUIDs after resync following a resize operation.
    Consequence could be a full resync instead of a bitmap-based one
  - fix a rare race condition that can cause a drbd device to end up
    with WFBitMapS/Established replication states

9.2.4 (api:genl2/proto:86-122/transport:18)
--------
 * fix a possible deadlock when disconnecting during a resync
 * fix a possible hard kernel-lockup
 * changes merged from drbd-9.1.15
  - fix how flush requests are marked when submitted to the Linux IO
    stack on the secondary node
  - when establishing a connection failed with a two-pc timeout, a
    receiver thread deadlocked, causing drbdsetup calls to block on
    that resource (difficult to trigger)
  - fixed a NULL-ptr deref (a OOPS) caused by a rare race condition
    while taking a resource down
  - updated kernel compatibility to at least Linux head and also fixed
    a bug in the compat checks/rules that caused OOPSes of the previous
    drbd releases when compiled with Linux-6.2 (or on RHEL 9.2 kernel).
  - fix an aspect of the data-generation (UUID) handling where DRBD
    failed to do a resync when a diskless node in the remaining
    partition promotes and demotes while a diskful node is isolated
  - fix an aspect of the data-generation (UUID) handling where DRBD
    considered a node to have unrelated data; this bug was triggered by
    a sequence involving removing two nodes from a cluster and readding
    one with the "day-0" UUIDs.
  - do not block specific state changes (promote, demote, attach, and
    detach) when only some nodes add a new minor

9.2.3 (api:genl2/proto:86-122/transport:18)
--------
 * improve matching ACKs to in-memory request objects;
   inexact matches were a source of unexpected connection losses
 * merge changes from drbd-9.1.14
  - fix a race with concurrent promotion and demotion, which can
    lead to an unexpected "split-brain" later on
  - fix a specific case where promotion was allowed where it should not
  - fix a race condition between auto-promote and a second two-phase
    commit that can lead to a DRBD thread locking up in an endless loop
  - fix several bugs with "resync-after":
   - missing resync-resume when minor numbers run in opposite
     direction as the resync-after dependencies
   - a race that might lead to an OOPS in add_timer()
  - fix an OOPS when reading from in_flight_summary in debugfs
  - fix a race that might lead to an endless loop of printing
    "postponing start_resync" while starting a resync
  - fix diskless node with a diskfull with a 4KiB backend
  - simplify remembering two-pc parents, maybe fixing a one-time-seen bug
  - derive abort_local_transaction timeout from ping-timeout

9.2.2 (api:genl2/proto:86-121/transport:18)
--------
 * fix spurious PingAck timeout a second time; we need to use a drbd
   owned workqueue to guarantee the required low-latency replies
 * Fix connection abort during resync with log message
   "Unexpected resync write ack at ..." a regression of drbd-9.2
 * fix a race condition that can lead to NULL-ptr deref during resync
 * merged changes from drbd-9.1.13
  - when calculating if a partition has quorum, take into account if
    the missing nodes might have quorum
  - fix forget-peer for diskless peers
  - clear the resync_again counter upon disconnect
  - also call the unfence handler when no resync happens
  - do not set bitmap bits when attaching to an up-to-date disk (late)
  - work on bringing the out-of-tree DRBD9 closer to DRBD in the upstream
    kernel; Use lru_cahche.ko from the installed kernel whenever possible

9.2.1 (api:genl2/proto:86-121/transport:18)
--------
 * fix spurious PingAck timeout, a regression of ack-processing in bottom half
   (introduced with 9.2.0)
 * support merging of discards during resync even if the discard granularity
   of the backing device is larger than 128MiB
 * merged changes from the drbd-9.1 branch, including
  - fix a race that could result in connection attempts getting aborted
    with the message "sock_recvmsg returned -11"
  - rate limit messages in case the peer can not write the backing storage
    and it does not finish the necessary state transitions
  - reduced the receive timeout during connecting to the intended 5 seconds
    (ten times ping-ack timeout)
  - losing the connection at a specific point in time during establishing
    a connection could cause a transition to StandAlone; fixed that, so
    that it keeps trying to connect
  - fix a race that could lead to a fence-peer handler being called
    unexpectedly when the fencing policy is changed at the moment before
    promoting

9.2.0 (api:genl2/proto:86-121/transport:18)
--------
 * merged changes from the drbd-9.1 branch, including
  - fix a race that could lead to an unexpected loss of connection
    related to internal concurrency; it manifested itself as
    "BAD! BarrierAck #X received with n_writes=..." in the logs
  - fix a race that could lead to an unexpected loss of connection
    if a node in the quorate partition promotes (too) quickly
  - follow upstream and compat code for up to Linux 5.19

9.2.0-rc.8 (api:genl2/proto:86-121/transport:18)
--------
 * fix the RDMA transport to work on more recent kernels (like RHEL9)
 * improve transmit timeout handling of the RDMA transport
 * register DRBD has pernet device for namespace management
 * merged fixes from the drbd-9.1 branch, including
  - request handling (9.1.11)
  - fix quorum when fresh nodes join a quorate but incomplete partition
  - minor state handling fixes/improvements

9.2.0-rc.7 (api:genl2/proto:86-121/transport:18)
--------
 * support for network namespaces
 * multiple fixes to the merge-discards-during-resync functionality
 * fix reference counting in AL with drbd-8.4 peers
 * stricter limit for the set of characters allowed in resource names
 * merge changes from the drbd 9.1.9 release

9.2.0-rc.6 (api:genl2/proto:86-121/transport:18)
--------
 * fixes to the new way of coordinating resync and application IO
 * merge discard requests during resync on a resync target node;
   This can speed up the resync progress by multiple orders of magnitude
 * Merged changes from the 9.1.8 release
  - restore protocol compatibility with drbd-8.4
  - detect peers that died silently when starting a two-phase-commit
  - correctly abort two-phase-commits when a connection breaks between
    phases 1 and 2
  - allow re-connect to a node that was forced into secondary role and
    where an opener is still present from the last time it was primary
  - fix a race condition that allowed to configure two peers with the
    same node id
  - ensure that an open() call fails within the auto-promote timeout
    if it can not succeed
  - build fixes for RHEL9
  - following upstream changes to DRBD up to Linux 5.17 and updated compat

9.2.0-rc.4 (api:genl2/proto:110-121/transport:18)
--------
 * Merged fixes from the 9.1.6 release
  - fix IO to internal meta-data for backing device larger than 128TB
  - fix resending requests towards diskless peers, this is relevant when
    fencing is enabled, but the connection is re-established before fencing
    succeeds; when the bug triggered it lead to "stuck" requests
  - remove lockless buffer pages handling; it still contained very hard to
    trigger bugs
  - make sure DRBD's resync does not cause unnecessary allocation in
    a thinly provisioned backing device on a resync target node
  - avoid unnecessary resync (or split-brain) due to a wrongly generated
    new current UUID when an already IO frozen DBRD gets new writes
  - small fix to autopromote, when an application tries a read-only open
    before it does a read-write open immediately after the peer primary
    vanished ungracefully
  - split out the secure boot key into a package on its own, that is
    necessary to allow installation of multiple drbd kernel module packages
  - Support for building containers for flacar linux

9.2.0-rc.3 (api:genl2/proto:110-121/transport:18)
--------
 * fix a corner case that might cause conflicting requests (touching
   a storage area that is under resync) to not terminate
 * merge from drbd-9.0: fix failing read-only open immediately after
   a primary peer left the cluster ungracefully

9.2.0-rc.2 (api:genl2/proto:110-121/transport:18)
--------
 * Fix broken wire compatibility with drbd-9.x
 * Fix DKMS builds when the kernel config has CONFIG_INFINIBAND=n

9.2.0-rc.1 (api:genl2/proto:110-121/transport:18)
--------
 * was forked off between 9.1.4 and 9.1.5
 * implemented a geniue way of coordinating resync and application IO;
   removed the internal resync_lru
 * receive and process ack-packets in TCP/RDMA SOFTIRQ context, that
   improves latency on all write operations; removed ack_receiver thread
 * add RDMA transport for IB, RoCE networking