HDDS-12356. granular locking framework for obs #8217

sumitagrawl · 2025-04-02T13:11:27Z

What changes were proposed in this pull request?

Granular locking framework for OBS for existing flow. Its just framework binding code flow but pending to call the lock in respective flow.

locking is added for external request at entry point. This provides execution of request at leader and existing flow simultaneouly without impacting for cache.

refer obs-locking.md for locking added for obs request (HDDS-11898. design doc leader side execution)
refer leader-execution.md for Step-by-step integration of existing request (interoperability)

Next PR will include:

integration of locking framework to flow
locking for obs key operation, bucket operation, volume operation and MPU cases
https://issues.apache.org/jira/browse/HDDS-12386 configuration for lock bucket length and timeout

Parent Jira:
https://issues.apache.org/jira/browse/HDDS-11900

Its Parent for Epic;
https://issues.apache.org/jira/browse/HDDS-11897

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12356

How was this patch tested?

UT cases added for lock

adoroszlai · 2025-04-02T13:13:14Z

Why do we need yet another PR?

errose28 · 2025-04-02T14:25:28Z

@adoroszlai We are working on prototypes to discuss different approaches. I agree it is cluttering the main PR queue and I suggest others to raise the PR against their own fork and link it in a Jira comment like this instead cc @szetszwo

adoroszlai · 2025-04-02T14:28:39Z

We are working on prototypes to discuss different approaches

That's fine. But this is the third PR from @sumitagrawl with basically the same text description. Having 3 different approaches is OK even from same developer, but please describe properly.

Based on commits, this seems to be a continuation of #8176. There is absolutely no need to open a new PR for addressing review comments, it just splits the conversation unnecessarily.

errose28

Thanks @sumitagrawl. I'm still not sure what the story on the lock stats is. Are we planning to expose these as metrics? They would have to be aggregated so I'm not sure how helpful that would be. Are we planning to log warnings when wait time or hold time crosses a given value? That would probably be useful. How we use the lock stats will depend on whether we need a long running counter to aggregate the information, or a new set of stats for each lock acquisition. Maybe we will need both.

errose28 · 2025-04-02T14:40:28Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+    keyLocks = SimpleStriped.readWriteLock(NUM_KEY_STRIPES, false);
+  }
+
+  public OmLockObject lock(OmLockInfo lockInfo) throws IOException {


Why return OmLockObject and require caller to pass it back to unlock instead of using Autocloseable?

Defined OmLockObject as autoclosable, so based on usecase, it can be used with try-with-resouce.

We need return this object as it holds LockStats which is required to be set to HadoopRPC metrics with each request calling this interface.
This might be added to ozone metrics also to capture lock stats.

errose28 · 2025-04-02T14:41:45Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+
+  public OmLockObject lock(OmLockInfo lockInfo) throws IOException {
+    OmLockObject omLockObject = new OmLockObject(lockInfo);
+    List<Lock> locks = omLockObject.getLocks();


Nothing initializes the OmLockObject lock list now and we already have the locks inside the stripes in this class. Why would we try to get them out of OmLockObject, or even add them there in the first place?

As response for above comments, its required as,

it keep track of lock stats to be responsed to Hadoop Call metrics for each request

errose28 · 2025-04-02T14:45:29Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+  public static class OmLockObject {
+    private final OmLockInfo omLockInfo;
+    private final List<Lock> locks = new ArrayList<>();
+    private final OmLockStats lockStats = new OmLockStats();


Are these fields intended to be static?

No, as this is referred for each request, and can not be static.

errose28 · 2025-04-02T14:47:01Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+      return writeLockNanos;
+    }
+
+    void add(long timeNanos, Type type) {


We can remove the Type enum all together and just create one method for updating each stat instead of one method to unpack the enum.

szetszwo · 2025-04-02T19:03:30Z

That's fine. But this is the third PR from @sumitagrawl with basically the same text description. Having 3 different approaches is OK even from same developer, but please describe properly.

@sumitagrawl , I am also confused for these similar PRs. We should have a better description on why there are three PRs and what are difference between them, and close the previous PRs if there are no longer needed.

Based on commits, this seems to be a continuation of #8176. There is absolutely no need to open a new PR for addressing review comments, it just splits the conversation unnecessarily.

Agree. It is even better to keep using the same PR.

sodonnel · 2025-04-03T09:49:47Z

This work to re-architect the locking should probably be performed on a branch and merged to master when it is completed and tested, in order to keep master shipable.

sumitagrawl · 2025-04-04T05:32:11Z

This work to re-architect the locking should probably be performed on a branch and merged to master when it is completed and tested, in order to keep master shipable.

This is done incrementally with no impact to existing flow. Original PR is splitted for merge as suggested to @szetszwo for better review and merge back to master in continuation.

sumitagrawl · 2025-04-04T05:33:51Z

@errose28 @szetszwo @swamirishi PR is updated is all review fix.

kerneltime · 2025-04-04T05:50:27Z

This work to re-architect the locking should probably be performed on a branch and merged to master when it is completed and tested, in order to keep master shipable.

The introduction of locking is not re architecting but adding the capability to serialize in an alternate scheme. Going with feature branch might be too heavy handed for this.

sodonnel · 2025-04-04T09:28:43Z

The introduction of locking is not re architecting but adding the capability to serialize in an alternate scheme. Going with feature branch might be too heavy handed for this.

@kerneltime Recall the addition of a much smaller change to introduce atomic rewrite, where yourself and @errose28 insisted on doing the development on a branch. That change was much smaller and less impactful than this one.

While I agree this change is "framework code" and will not be impactful to existing logic, the followup tasks might be impactful and done in several PRs if they result in large changes:

Next PR will include:

    integration of locking framework to flow
    locking for obs key operation, bucket operation, volume operation and MPU cases
    https://issues.apache.org/jira/browse/HDDS-12386 configuration for lock bucket length and timeout

In general I don't agree with pushing development to a branch too quickly, because It is a pain to get the branch merged as it requires a vote. However, we do need consistency on the project and to decide on what requires a branch and what does not. Especially if the goal is to always have "master shippable".

Perhaps the test is, if a PR can leave master in a state where it is not shippable without another PR (eg half the code using the old locking and half using new locking), then the feature should be on a branch, even if the branch is only 3 or 4 commits before merging back to master.

Another question is whether such a branch should need a vote to merge? I'd argue that a short lived branch with a handful of commits is a useful thing and should not require a vote to merge - it encourages small PRs which are easier to review and build up the feature in a more natural way. It would make using branches for this sort of thing light weight and tool to be used rather than a burden.

That all said, I am not involved in this area of work and I am not following how involved or risky this change is overall.

errose28 · 2025-04-04T15:36:55Z

@sodonnel I am actually in favor of doing this initial set of OBS migration tasks on a feature branch.

szetszwo

@sumitagrawl , thanks for working on this! Copied my comments on errose28#5 to here.

BTW, could you please address my comments posted on May 27 in #7583 ? If we are not clear about the design, it does not make sense to start writing code.

szetszwo · 2025-04-04T18:55:38Z

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmLockInfo.java

+   */
+  public static final class LockInfo implements Comparable<LockInfo> {
+    private final String name;
+    private final boolean isWriteLock;


Use an enum is better than a boolean; see "Never Use Booleans for Something That Has Two States Now, but Might Have More Later".

szetszwo · 2025-04-04T18:57:36Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+  public OmLockObject lock(OmLockInfo lockInfo) throws IOException {
+    OmLockObject omLockObject = new OmLockObject(lockInfo);
+    long startTime = Time.monotonicNowNanos();
+    Optional<OmLockInfo.LockInfo> optionalVolumeLock = lockInfo.getVolumeLock();
+    Optional<OmLockInfo.LockInfo> optionalBucketLock = lockInfo.getBucketLock();
+    Optional<Set<OmLockInfo.LockInfo>> optionalKeyLocks = lockInfo.getKeyLocks();
+    List<Lock> locks = new ArrayList<>();
+
+    if (optionalVolumeLock.isPresent()) {
+      OmLockInfo.LockInfo volumeLockInfo = optionalVolumeLock.get();
+      if (volumeLockInfo.isWriteLock()) {
+        omLockObject.setReadStatsType(false);
+        locks.add(volumeLocks.get(volumeLockInfo.getName()).writeLock());
+      } else {
+        locks.add(volumeLocks.get(volumeLockInfo.getName()).readLock());
+      }
+    }
+
+    if (optionalBucketLock.isPresent()) {
+      OmLockInfo.LockInfo bucketLockInfo = optionalBucketLock.get();
+      if (bucketLockInfo.isWriteLock()) {
+        omLockObject.setReadStatsType(false);
+        locks.add(bucketLocks.get(bucketLockInfo.getName()).writeLock());
+      } else {
+        locks.add(bucketLocks.get(bucketLockInfo.getName()).readLock());
+      }
+    }
+
+    if (optionalKeyLocks.isPresent()) {
+      for (ReadWriteLock keyLock: keyLocks.bulkGet(optionalKeyLocks.get())) {
+        omLockObject.setReadStatsType(false);
+        locks.add(keyLock.writeLock());
+      }
+    }
+
+    try {
+      acquireLocks(locks, omLockObject.getLocks());
+      lockStatsBegin(omLockObject.getLockStats(), Time.monotonicNowNanos(), startTime);
+    } catch (InterruptedException e) {
+      Thread.currentThread().interrupt();
+      throw new OMException("Waiting for locks is interrupted, " + lockInfo, OMException.ResultCodes.INTERNAL_ERROR);
+    } catch (TimeoutException e) {
+      throw new OMException("Timeout occurred for locks " + lockInfo, OMException.ResultCodes.TIMEOUT);
+    }
+    return omLockObject;
+  }


The lock overhead probably is quite high in this method. It may not be able to improve the performance much.

szetszwo · 2025-04-04T18:58:53Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+    return omLockObject;
+  }
+
+  private void acquireLocks(List<Lock> locks, Stack<Lock> acquiredLocks) throws TimeoutException, InterruptedException {


Concurrently, if the lock.tryLock(..) is interrupted, the partial lock won't be released. Getting everything right with timeout is not easy.

If we are adding timeout support, we have to update the design doc first.

added section for having timeout in design doc for lock

szetszwo · 2025-04-07T19:15:07Z

Comparing this PR with #8192 :

OmRequestGatekeeper (this PR)

bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   104.336 MB (before)
OmRequestGatekeeper
time  :     1.245 s
memory:   378.625 MB (after)
memory:   274.289 MB (diff)

OmLockManager (HDDS-12356. Introduce OmComponentLock and OmOperationLock #8192)

bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   102.547 MB (before)
OmLockManager
time  :     0.514 s
memory:   109.228 MB (after)
memory:     6.681 MB (diff)

szetszwo · 2025-04-07T19:30:46Z

Benchmark program for this PR: https://issues.apache.org/jira/secure/attachment/13075848/8217_benchmark.patch

szetszwo · 2025-04-07T21:47:48Z

UseEpsilonGC (no gc): https://blogs.oracle.com/javamagazine/post/epsilon-the-jdks-do-nothing-garbage-collector

OmRequestGatekeeper

bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   167.723 MB (before)
OmRequestGatekeeper
time  :     0.552 s
memory:  1303.723 MB (after)
memory:  1136.000 MB (diff)
GC    :         0 ms (0)

OmLockManager

bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   172.629 MB (before)
OmLockManager
time  :     0.283 s
memory:   368.629 MB (after)
memory:   196.000 MB (diff)
GC    :         0 ms (0)

errose28 · 2025-04-07T23:12:56Z

Some analysis of the benchmark would help, i.e. where is the memory savings coming from? Based on an initial look I think bucketKeyListCount = 1_000 is skewing the results. Each lock will usually only be taken on 1 key, and sometimes 2. I don't think optimizing on 1k namespace locks at a time is priority, but if there's simple way to get those savings in this PR we can look into it.

kerneltime · 2025-04-09T16:34:07Z

@sumitagrawl is away for a few days so the updates will be incoming after that.
I think a feature branch should be okay, but I want it to contain small changes that get merged in. If the change is a single PR, it does raise questions about why we are doing a feature branch. The migration of request processing for Ozone needs to be done incrementally, and there will be a period when request processing is in mixed mode.

szetszwo · 2025-04-09T17:28:34Z

@errose28 , thanks for looking at the benchmark results!

... where is the memory savings coming from? ...

The memory consumption is from the multiple unnecessary data structures used in this PR.

... bucketKeyListCount = 1_000 is skewing the results. Each lock will usually only be taken on 1 key, and sometimes 2. ...

According to @kerneltime and @sumitagrawl , most cases only have 1 key. That's why I used a smaller bucketKeyListCount compared to bucketKeyCount. We may try different combinations. Any suggestions?

swamirishi

@sumitagrawl thanks for the patch. Left some comments inline

swamirishi · 2025-05-05T12:43:44Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+  public OmRequestGatekeeper() {
+    volumeLocks = SimpleStriped.readWriteLock(NUM_VOLUME_STRIPES, false);
+    bucketLocks = SimpleStriped.readWriteLock(NUM_BUCKET_STRIPES, false);
+    keyLocks = SimpleStriped.readWriteLock(NUM_KEY_STRIPES, false);


Why are number of stripes not configurable?

swamirishi · 2025-05-05T12:53:03Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+    for (Lock lock: locks) {
+      if (lock.tryLock(LOCK_TIMEOUT_DEFAULT, TimeUnit.MILLISECONDS)) {
+        try {
+          acquiredLocks.add(lock);


Use acquiredLocks.push() instead

swamirishi · 2025-05-05T12:53:29Z

...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/lock/OmRequestGatekeeper.java

+          acquiredLocks.add(lock);
+        } catch (Throwable e) {
+          // We acquired this lock but were unable to add it to our acquired locks list.
+          lock.unlock();


This catch block seems unnecessary as no exception would be thrown

sumitagrawl added 2 commits March 27, 2025 14:14

HDDS-12356. granular locking framework

302e9a6

fix review comments

dcc371b

errose28 reviewed Apr 2, 2025

View reviewed changes

fix review comments

247d0c2

sumitagrawl mentioned this pull request Apr 4, 2025

HDDS-12356. granular locking framework #8176

Closed

sumitagrawl requested review from errose28, kerneltime, szetszwo and swamirishi April 4, 2025 05:32

sumitagrawl marked this pull request as ready for review April 4, 2025 06:10

szetszwo reviewed Apr 4, 2025

View reviewed changes

adoroszlai marked this pull request as draft April 7, 2025 19:27

sumitagrawl changed the title ~~HDDS-12356. granular locking framework~~ HDDS-12356. granular locking framework for obs Apr 21, 2025

swamirishi reviewed May 5, 2025

View reviewed changes

HDDS-12356. granular locking framework for obs #8217

Are you sure you want to change the base?

HDDS-12356. granular locking framework for obs #8217

Uh oh!

Conversation

sumitagrawl commented Apr 2, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Apr 2, 2025

Uh oh!

errose28 commented Apr 2, 2025

Uh oh!

adoroszlai commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo commented Apr 2, 2025

Uh oh!

sodonnel commented Apr 3, 2025

Uh oh!

sumitagrawl commented Apr 4, 2025

Uh oh!

sumitagrawl commented Apr 4, 2025

Uh oh!

kerneltime commented Apr 4, 2025

Uh oh!

sodonnel commented Apr 4, 2025

Uh oh!

errose28 commented Apr 4, 2025

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo commented Apr 7, 2025

Uh oh!

szetszwo commented Apr 7, 2025

Uh oh!

szetszwo commented Apr 7, 2025

Uh oh!

errose28 commented Apr 7, 2025

Uh oh!

kerneltime commented Apr 9, 2025

Uh oh!

szetszwo commented Apr 9, 2025

Uh oh!

swamirishi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Apr 2, 2025 •

edited

Loading