Skip to content

HDDS-12356. granular locking framework for obs #8217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sumitagrawl
Copy link
Contributor

What changes were proposed in this pull request?

Granular locking framework for OBS for existing flow. Its just framework binding code flow but pending to call the lock in respective flow.

locking is added for external request at entry point. This provides execution of request at leader and existing flow simultaneouly without impacting for cache.

  • refer obs-locking.md for locking added for obs request (HDDS-11898. design doc leader side execution)
  • refer leader-execution.md for Step-by-step integration of existing request (interoperability)

Next PR will include:

Parent Jira:
https://issues.apache.org/jira/browse/HDDS-11900

Its Parent for Epic;
https://issues.apache.org/jira/browse/HDDS-11897

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12356

How was this patch tested?

  • UT cases added for lock

@adoroszlai
Copy link
Contributor

Why do we need yet another PR?

@errose28
Copy link
Contributor

errose28 commented Apr 2, 2025

@adoroszlai We are working on prototypes to discuss different approaches. I agree it is cluttering the main PR queue and I suggest others to raise the PR against their own fork and link it in a Jira comment like this instead cc @szetszwo

@adoroszlai
Copy link
Contributor

adoroszlai commented Apr 2, 2025

We are working on prototypes to discuss different approaches

That's fine. But this is the third PR from @sumitagrawl with basically the same text description. Having 3 different approaches is OK even from same developer, but please describe properly.

Based on commits, this seems to be a continuation of #8176. There is absolutely no need to open a new PR for addressing review comments, it just splits the conversation unnecessarily.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sumitagrawl. I'm still not sure what the story on the lock stats is. Are we planning to expose these as metrics? They would have to be aggregated so I'm not sure how helpful that would be. Are we planning to log warnings when wait time or hold time crosses a given value? That would probably be useful. How we use the lock stats will depend on whether we need a long running counter to aggregate the information, or a new set of stats for each lock acquisition. Maybe we will need both.

keyLocks = SimpleStriped.readWriteLock(NUM_KEY_STRIPES, false);
}

public OmLockObject lock(OmLockInfo lockInfo) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return OmLockObject and require caller to pass it back to unlock instead of using Autocloseable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defined OmLockObject as autoclosable, so based on usecase, it can be used with try-with-resouce.

We need return this object as it holds LockStats which is required to be set to HadoopRPC metrics with each request calling this interface.
This might be added to ozone metrics also to capture lock stats.


public OmLockObject lock(OmLockInfo lockInfo) throws IOException {
OmLockObject omLockObject = new OmLockObject(lockInfo);
List<Lock> locks = omLockObject.getLocks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing initializes the OmLockObject lock list now and we already have the locks inside the stripes in this class. Why would we try to get them out of OmLockObject, or even add them there in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As response for above comments, its required as,

  • it keep track of lock stats to be responsed to Hadoop Call metrics for each request

public static class OmLockObject {
private final OmLockInfo omLockInfo;
private final List<Lock> locks = new ArrayList<>();
private final OmLockStats lockStats = new OmLockStats();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these fields intended to be static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as this is referred for each request, and can not be static.

return writeLockNanos;
}

void add(long timeNanos, Type type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the Type enum all together and just create one method for updating each stat instead of one method to unpack the enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@szetszwo
Copy link
Contributor

szetszwo commented Apr 2, 2025

That's fine. But this is the third PR from @sumitagrawl with basically the same text description. Having 3 different approaches is OK even from same developer, but please describe properly.

@sumitagrawl , I am also confused for these similar PRs. We should have a better description on why there are three PRs and what are difference between them, and close the previous PRs if there are no longer needed.

Based on commits, this seems to be a continuation of #8176. There is absolutely no need to open a new PR for addressing review comments, it just splits the conversation unnecessarily.

Agree. It is even better to keep using the same PR.

@sodonnel
Copy link
Contributor

sodonnel commented Apr 3, 2025

This work to re-architect the locking should probably be performed on a branch and merged to master when it is completed and tested, in order to keep master shipable.

@sumitagrawl
Copy link
Contributor Author

This work to re-architect the locking should probably be performed on a branch and merged to master when it is completed and tested, in order to keep master shipable.

This is done incrementally with no impact to existing flow. Original PR is splitted for merge as suggested to @szetszwo for better review and merge back to master in continuation.

@sumitagrawl
Copy link
Contributor Author

@errose28 @szetszwo @swamirishi PR is updated is all review fix.

@kerneltime
Copy link
Contributor

This work to re-architect the locking should probably be performed on a branch and merged to master when it is completed and tested, in order to keep master shipable.

The introduction of locking is not re architecting but adding the capability to serialize in an alternate scheme. Going with feature branch might be too heavy handed for this.

@sumitagrawl sumitagrawl marked this pull request as ready for review April 4, 2025 06:10
@sodonnel
Copy link
Contributor

sodonnel commented Apr 4, 2025

The introduction of locking is not re architecting but adding the capability to serialize in an alternate scheme. Going with feature branch might be too heavy handed for this.

@kerneltime Recall the addition of a much smaller change to introduce atomic rewrite, where yourself and @errose28 insisted on doing the development on a branch. That change was much smaller and less impactful than this one.

While I agree this change is "framework code" and will not be impactful to existing logic, the followup tasks might be impactful and done in several PRs if they result in large changes:

Next PR will include:

    integration of locking framework to flow
    locking for obs key operation, bucket operation, volume operation and MPU cases
    https://issues.apache.org/jira/browse/HDDS-12386 configuration for lock bucket length and timeout

In general I don't agree with pushing development to a branch too quickly, because It is a pain to get the branch merged as it requires a vote. However, we do need consistency on the project and to decide on what requires a branch and what does not. Especially if the goal is to always have "master shippable".

Perhaps the test is, if a PR can leave master in a state where it is not shippable without another PR (eg half the code using the old locking and half using new locking), then the feature should be on a branch, even if the branch is only 3 or 4 commits before merging back to master.

Another question is whether such a branch should need a vote to merge? I'd argue that a short lived branch with a handful of commits is a useful thing and should not require a vote to merge - it encourages small PRs which are easier to review and build up the feature in a more natural way. It would make using branches for this sort of thing light weight and tool to be used rather than a burden.

That all said, I am not involved in this area of work and I am not following how involved or risky this change is overall.

@errose28
Copy link
Contributor

errose28 commented Apr 4, 2025

@sodonnel I am actually in favor of doing this initial set of OBS migration tasks on a feature branch.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl , thanks for working on this! Copied my comments on errose28#5 to here.

BTW, could you please address my comments posted on May 27 in #7583 ? If we are not clear about the design, it does not make sense to start writing code.

*/
public static final class LockInfo implements Comparable<LockInfo> {
private final String name;
private final boolean isWriteLock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +54 to +99
public OmLockObject lock(OmLockInfo lockInfo) throws IOException {
OmLockObject omLockObject = new OmLockObject(lockInfo);
long startTime = Time.monotonicNowNanos();
Optional<OmLockInfo.LockInfo> optionalVolumeLock = lockInfo.getVolumeLock();
Optional<OmLockInfo.LockInfo> optionalBucketLock = lockInfo.getBucketLock();
Optional<Set<OmLockInfo.LockInfo>> optionalKeyLocks = lockInfo.getKeyLocks();
List<Lock> locks = new ArrayList<>();

if (optionalVolumeLock.isPresent()) {
OmLockInfo.LockInfo volumeLockInfo = optionalVolumeLock.get();
if (volumeLockInfo.isWriteLock()) {
omLockObject.setReadStatsType(false);
locks.add(volumeLocks.get(volumeLockInfo.getName()).writeLock());
} else {
locks.add(volumeLocks.get(volumeLockInfo.getName()).readLock());
}
}

if (optionalBucketLock.isPresent()) {
OmLockInfo.LockInfo bucketLockInfo = optionalBucketLock.get();
if (bucketLockInfo.isWriteLock()) {
omLockObject.setReadStatsType(false);
locks.add(bucketLocks.get(bucketLockInfo.getName()).writeLock());
} else {
locks.add(bucketLocks.get(bucketLockInfo.getName()).readLock());
}
}

if (optionalKeyLocks.isPresent()) {
for (ReadWriteLock keyLock: keyLocks.bulkGet(optionalKeyLocks.get())) {
omLockObject.setReadStatsType(false);
locks.add(keyLock.writeLock());
}
}

try {
acquireLocks(locks, omLockObject.getLocks());
lockStatsBegin(omLockObject.getLockStats(), Time.monotonicNowNanos(), startTime);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new OMException("Waiting for locks is interrupted, " + lockInfo, OMException.ResultCodes.INTERNAL_ERROR);
} catch (TimeoutException e) {
throw new OMException("Timeout occurred for locks " + lockInfo, OMException.ResultCodes.TIMEOUT);
}
return omLockObject;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock overhead probably is quite high in this method. It may not be able to improve the performance much.

return omLockObject;
}

private void acquireLocks(List<Lock> locks, Stack<Lock> acquiredLocks) throws TimeoutException, InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrently, if the lock.tryLock(..) is interrupted, the partial lock won't be released. Getting everything right with timeout is not easy.

If we are adding timeout support, we have to update the design doc first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added section for having timeout in design doc for lock

@szetszwo
Copy link
Contributor

szetszwo commented Apr 7, 2025

Comparing this PR with #8192 :

  • OmRequestGatekeeper (this PR)
bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   104.336 MB (before)
OmRequestGatekeeper
time  :     1.245 s
memory:   378.625 MB (after)
memory:   274.289 MB (diff)
bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   102.547 MB (before)
OmLockManager
time  :     0.514 s
memory:   109.228 MB (after)
memory:     6.681 MB (diff)

@adoroszlai adoroszlai marked this pull request as draft April 7, 2025 19:27
@szetszwo
Copy link
Contributor

szetszwo commented Apr 7, 2025

@szetszwo
Copy link
Contributor

szetszwo commented Apr 7, 2025

UseEpsilonGC (no gc): https://blogs.oracle.com/javamagazine/post/epsilon-the-jdks-do-nothing-garbage-collector

  • OmRequestGatekeeper
bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   167.723 MB (before)
OmRequestGatekeeper
time  :     0.552 s
memory:  1303.723 MB (after)
memory:  1136.000 MB (diff)
GC    :         0 ms (0)
  • OmLockManager
bucketKeyCount     = 1000000
bucketKeyListCount = 1000
memory:   172.629 MB (before)
OmLockManager
time  :     0.283 s
memory:   368.629 MB (after)
memory:   196.000 MB (diff)
GC    :         0 ms (0)

@errose28
Copy link
Contributor

errose28 commented Apr 7, 2025

Some analysis of the benchmark would help, i.e. where is the memory savings coming from? Based on an initial look I think bucketKeyListCount = 1_000 is skewing the results. Each lock will usually only be taken on 1 key, and sometimes 2. I don't think optimizing on 1k namespace locks at a time is priority, but if there's simple way to get those savings in this PR we can look into it.

@kerneltime
Copy link
Contributor

@sumitagrawl is away for a few days so the updates will be incoming after that.
I think a feature branch should be okay, but I want it to contain small changes that get merged in. If the change is a single PR, it does raise questions about why we are doing a feature branch. The migration of request processing for Ozone needs to be done incrementally, and there will be a period when request processing is in mixed mode.

@szetszwo
Copy link
Contributor

szetszwo commented Apr 9, 2025

@errose28 , thanks for looking at the benchmark results!

... where is the memory savings coming from? ...

The memory consumption is from the multiple unnecessary data structures used in this PR.

... bucketKeyListCount = 1_000 is skewing the results. Each lock will usually only be taken on 1 key, and sometimes 2. ...

According to @kerneltime and @sumitagrawl , most cases only have 1 key. That's why I used a smaller bucketKeyListCount compared to bucketKeyCount. We may try different combinations. Any suggestions?

@sumitagrawl sumitagrawl changed the title HDDS-12356. granular locking framework HDDS-12356. granular locking framework for obs Apr 21, 2025
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl thanks for the patch. Left some comments inline

public OmRequestGatekeeper() {
volumeLocks = SimpleStriped.readWriteLock(NUM_VOLUME_STRIPES, false);
bucketLocks = SimpleStriped.readWriteLock(NUM_BUCKET_STRIPES, false);
keyLocks = SimpleStriped.readWriteLock(NUM_KEY_STRIPES, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are number of stripes not configurable?

for (Lock lock: locks) {
if (lock.tryLock(LOCK_TIMEOUT_DEFAULT, TimeUnit.MILLISECONDS)) {
try {
acquiredLocks.add(lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use acquiredLocks.push() instead

acquiredLocks.add(lock);
} catch (Throwable e) {
// We acquired this lock but were unable to add it to our acquired locks list.
lock.unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch block seems unnecessary as no exception would be thrown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants