HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster #7665

aryangupta1998 · 2025-01-08T12:14:15Z

What changes were proposed in this pull request?

In case of resetDeletedBlockRetryCount with --all option, scm takes lock and tries to get all the transaction with max retry and then updates DB with 0 count. In some large scale env this count can be huge which can lead to multiple problem.

i) Lock can lead to block all other normal operation.

ii) Since message is passed through ratis, which will fail because of size.

Instead of doing like above we should do this operation in batches to avoid long lock and ratis message size failure.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11714

How was this patch tested?

Tested Manually.

…ause long db lock in large cluster

nandakumar131 · 2025-01-10T16:00:03Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java

+  @Override
+  public List<DeletedBlocksTransaction> getFailedTransactionsBatch(
+      int batchSize, long startTxId) throws IOException {
+    List<DeletedBlocksTransaction> failedTXs = new ArrayList<>();
+
+    lock.lock();
+    try {
+      try (
+          TableIterator<Long, ? extends Table.KeyValue<Long, DeletedBlocksTransaction>> iter =
+              deletedBlockLogStateManager.getReadOnlyIterator()) {
+
+        iter.seek(startTxId);
+
+        while (iter.hasNext() && failedTXs.size() < batchSize) {
+          DeletedBlocksTransaction delTX = iter.next().getValue();
+          if (delTX.getCount() == -1) {
+            failedTXs.add(delTX);
+          }
+        }
+      }
+    } finally {
+      lock.unlock();
+    }
+
+    return failedTXs;
+  }
+


Why do we need this additional method? The same thing can be achieved with the existing getFailedTransactions method.

nandakumar131 · 2025-01-10T16:14:04Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/block/DeletedBlockLogImpl.java

+
+        } while (!batch.isEmpty());
+      } else {
+        // Process txIDs provided by the user in batches


The user provided list of txIDs reaches SCM via RPC call, so it's ok to process this in single go.

Aryan Gupta and others added 3 commits January 8, 2025 17:41

HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can c…

8ce09f7

…ause long db lock in large cluster

Fixed TestDeletedBlocksTxnShell.

b8def14

Merge remote-tracking branch 'origin/master' into HDDS-11714

840a7d8

nandakumar131 requested changes Jan 10, 2025

View reviewed changes

Addressed Comments.

e2df43e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster #7665

HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster #7665

aryangupta1998 commented Jan 8, 2025

nandakumar131 Jan 10, 2025

aryangupta1998 Jan 10, 2025

nandakumar131 Jan 10, 2025

aryangupta1998 Jan 10, 2025

HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster #7665

Are you sure you want to change the base?

HDDS-11714. resetDeletedBlockRetryCount with --all may fail and can cause long db lock in large cluster #7665

Conversation

aryangupta1998 commented Jan 8, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

nandakumar131 Jan 10, 2025

Choose a reason for hiding this comment

aryangupta1998 Jan 10, 2025

Choose a reason for hiding this comment

nandakumar131 Jan 10, 2025

Choose a reason for hiding this comment

aryangupta1998 Jan 10, 2025

Choose a reason for hiding this comment