HDDS-11898. design doc pre-ratis execution #7583

sumitagrawl · 2024-12-16T16:18:51Z

What changes were proposed in this pull request?

Design doc for pre-ratis side execution

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11898

How was this patch tested?

NA

adoroszlai

Thanks @sumitagrawl for the docs.

Please combine the files into a single markdown file with headers (title, author, status, etc.), license (please see other design docs for example).

This will help readers know where to start, and it is also needed for display on the website: https://ozone.apache.org/docs/edge/design.html

sumitagrawl · 2025-01-06T10:23:17Z

Thanks @sumitagrawl for the docs.

Please combine the files into a single markdown file with headers (title, author, status, etc.), license (please see other design docs for example).

This will help readers know where to start, and it is also needed for display on the website: https://ozone.apache.org/docs/edge/design.html

@adoroszlai Please recheck, now have below as separate

Leader-execution.md
obs-lock.md
requests:

obs-create-key.md
obs-commit-key.md

Above kept separate as these are independent feature as part of leader execution and its design further will go independently,

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

errose28 · 2025-01-11T00:18:52Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- lock: granular level locking
+- unlock: unlock locked keys


What happens while we are holding the lock down? Shouldn't this be where processing is happening? This seems like a duplicate of the information in the "Leader Execution" section but both sections are missing steps. For example submitting to Ratis is not mentioned here anywhere.

errose28 · 2025-01-11T01:54:18Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- [Create key](request/obs-create-key.md)
+- [Commit key](request/obs-commit-key.md)
+
+### Execution persist and distribution


I think this whole section needs to be redesigned. In theory, Ratis + RocksDB should be able to exist in its own module as a replicated DB with no dependencies on anything Ozone specific. We will need this eventually to bring the same code flow to SCM (for rolling upgrade) and Recon (for non-voting follower) without rewriting these critical pieces that deal with replication and persistence. Actually moving the code to separate modules may be outside the scope of this feature, but we need to define the API surface such that it is possible to avoid having to rewrite/refactor what is soon to be already new code. For this example I will refer to the replicated DB as its own module, even if V1 of the code does not structure it this way for migration purposes. It is the API surface used by each request that is more important to lock down now.

Input to this module should be of the form of protos that define the DB updates to perform. The actual values written to the DB should already have been serialized to bytes by this point and they should not be deserialized at any point later in the flow (with the exception of merges). This means the module has no knowledge of client ID, quota info, etc.

We would have one proto message defining each operation supported by the DB. The module takes one Batch which contains these operations and will be treated as one Ratis request

message Put { optional bytes columnFamily optional bytes key optional bytes value } message Delete { optional bytes columnFamily optional bytes key } message Merge { optional bytes columnFamily optional bytes key optional bytes value } message Checkpoint { // Path to place the checkpoint optional string destination } // Only one field should be present to define the operation to do. // The module can validate this input. message Operation { optional Put put optional Delete delete optional Merge merge optional Checkpoint checkpoint } // Each OM request would result in one list of ordered operations submitted to the module. // The module can internally combine these lists into one Batch proto that gets submitted to Ratis. // The update to the transaction ID table needs to be handled within the module for each batch applied. message Batch { repeated Operation operations }

Now to translate each proto to a DB update in Ratis' applyTransaction:

Put and Delete simply map to existing RocksDB put and delete key ops. Note that RocksDB does not have a move operation.

Checkpoint creates a RocksDB checkpoint and will be used by snapshots.

Merge will be used to implement any increments required, like quota using the RocksDB associative merge operator. Initializers of the module will pass in a mapping of column families to their corresponding merge operators if required.

For example, the OM would initialize the module with a BucketInfoMergeOperator on the BucketTable, a VolumeInfoMergeOperator on the VolumeTable, etc.

Then the API surface between OM or any other service and the replicated DB module is just a list of column families to open, with some optionally mapped to merge operator callbacks provided on construction, and calls to submit new Operation lists to the module.

A big +1 for this!

ivandika3

Thanks for the patch. Left an initial comments.

Regarding the request flow, could you add a more detailed sequence diagram? Similar to https://issues.apache.org/jira/browse/HDDS-1595 so that it's easier to visualize the new flow.

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

ivandika3 · 2025-01-14T02:24:05Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+#### Memory caching:
+```
+Memory Map: ClientId#CallId Vs Response
+Expiry: 10 minute (as current default for ratis)


This expiry is done independently for each OM node? It won't be replicated from leader?

Leader will sync below information with db changes,

callI and clientId (used to identify retry of request)

timestamp

response message (to be replied back if same request is retried as ratis behavior)

Each node will make use of above information once received in applyTransaction and handle independently.

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

ivandika3 · 2025-01-14T02:42:51Z

cc: @xichen01 @symious

sumitagrawl · 2025-01-19T15:26:25Z

@ivandika3 @errose28 Updated docs for all comments. below is pending,

detailed flow diagram including lock / unlock flow (tried with abstract flow and Gateway responsibility, but this needs more details and further visualization for flow)
Ratis + RocksD separation with OM logic - to check further

xichen01 · 2025-01-22T15:45:33Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+Batching of Request:
+- Request 1..n are executed and db changes are identified and added to queue (and request will be waiting for update via ratis over Future waiting)
+- Batcher will retrieve Request 1..n and db changes, merge those request to single Ratis Request message
+- Send Merged Request message to all nodes via ratis and receive reply


If we only need to update one field of key(e.g. the mtime of a key), should we send the entire DB value of the key to the follower?

Yes, send entire key object. This kind of operation is very less, so need not focus improve over this. The idea is to just sync db operation more close to db update, instead of again doing further action on follower node (till no specific case observed).

@xichen01 brought a good point, for example for large multipart upload part issue (https://issues.apache.org/jira/browse/HDDS-8238), OM leader might need to send a very large DB value, few hundred MBs. This can cause very large network overhead.

Multipart uploads must be looked at from scratch, and incremental updates can be added later. The goal is to keep the code in apply transaction code minimalistic and idempotent. Introducing a read-modify write cycle in an apply transaction has implications for features such as rolling upgrades, which need to be looked into (Followers can have old code that the leader is unaware of). For now, I would keep this part of the code simple (even if it is a bit inefficient) and add complexity later with benchmarks and clarity for the performance gains without compromising correctness.
The current proposal is to move OM APIs to the new implementation incrementally, and we can do this incrementally. Making the code complex from the beginning without measuring the performance can lead us to add complexity without clear gains.

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

xichen01 · 2025-01-22T15:51:09Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+- Send Merged Request message to all nodes via ratis and receive reply
+- Batcher will reply to each request 1..n with db update success notifying future object of each request.
+
+There are multiple batchers waiting over queue,


What are the rules for assigning requests to different queues, for requests with dependencies?

Dependency in request is handled via locking. If one request depends on another, then that request will wait for processing over lock.
Example,

Commit key1

Another client overwrite key1 parallel as commit
So if first client get lock over key1, second client will wait.

Similarly, bucket update operation have write lock over bucket, then all key request will wait for the lock.

So queue will be having all independent request to be updated to DB, and no ordering is required.

As usecase,
Most of operation will be at key level -- create/commit/delete. So different key getting operated are independent to each other, and this will help in achieving better parallism.

If it depends on the request lock, how long will the lock be held?

It may be necessary to wait until Ratis replies successfully (majority of nodes to reply in quorum) before releasing the lock, and if the lock is released before that, the multi-threaded model may lead to ambiguities in the order in which the follower and leader are executed.

And for some “indirectly related” locks, such as if the FSO bucket uses a more fine-grained lock (currently a bucket write lock), this can lead to a lot of complex cases. The HA model I understand, where synchronized journal services are usually sequential, ensures that the latter batch will send a ratis only after the previous batch explicitly writes to it.

It may be necessary to wait until Ratis replies successfully (majority of nodes to reply in quorum) before releasing the lock

Yes this is necessary. Lock must be held down until the request is applied on the leader (which requires committing to the quorum). Locking only happens on the leader and happens before submitting to Ratis, so followers will apply in the order directed by the leader. I think the doc should clarify the scope of the locking, since currently the step-by-step examples are a little ambiguous here.

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

xichen01 · 2025-01-22T15:57:34Z

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

+
+### Apply Transaction (via ratis at all nodes)
+With new flow as change,
+- all nodes during ratis apply transaction will just only update the DB for changes.


It is OK to just send ”db changes“ instead of Journal to follower node? As I understand it, what HA sync is usually the Journal, so that follow can perform operations such as updating the iNode in memory, etc. This would not be possible if we only sync the DB changes.
If Ozone has some operations that need to be updated in the follower, how should this be handled? For example, Ozone intends to make the FSO bucket support in-memory iNode trees.

Yes, certain actions are defined specific to cases like,

bucket quota changes need db update and cache update (differential)

bucket creation needs addition to cache, volume also need to be added to cache.

So above statement is more generic in term that, avoid any special handling other than db update, but there are actions for handling same. Agenda is, unless its mandatory for certain extra handling other than db update to have over follower side, avoid it.

Will try to clarify more in the design doc for this.

However, I understand that in the mainstream filesystem HA models, it is more of a “Journal” synchronization (maybe I'm not correct). I don't know if there are other filesystem HA models that use synchronization of “DB changes”, and I'm concerned that synchronizing only the “DB changes” might limit some of the future features

Are you referencing HDFS here with in memory inodes and journal nodes?

Synchronizing on DB changes is the way consensus usually works IME. Synchronizing on abstract commands like we do currently is very error prone because the critical applyTransaction step is unique for each command. If in-memory state like caches need to be updated that would be considered part of the state machine and updated on applyTransaction as well.

The separation proposed here does make this type of write-through caching more difficult because the applyTransaction entries are already serialized. This should be called out in the doc and tested as well. Are things like a full cache of the bucket table that saves the deserialization step actually helping read performance of can we drop them? If they are needed, there would likely need to be some sort of caching support added:

The module would support callbacks to update a cache on applyTransaction in an async manner.

Caching could be applied at the column family level, similar to what we currently have.

Cache misses would read through to the DB.

Cache update failures would not be fatal since the value is already in the DB.

This prevents cases like we currently have where failures in applyTransaction unique to specific requests crash the OM.

hadoop-hdds/docs/content/design/leader-execution/leader-execution.md

sumitagrawl · 2025-01-28T15:41:43Z

@xichen01 Thanks for review, some good questions are asked in comment. Will try to add those cases in design doc.

Please recheck my reply and further any questions ...

szetszwo

@sumitagrawl , thanks for working on the design. Please see thee comments lined for the OBS locking.

BTW, I suggest splitting out the OBC locking to a separated design doc. It will be easier for everyone.

szetszwo · 2025-03-27T19:49:49Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+OBS case just involves volume, bucket and key. So this is more simplified in terms of locking.
+
+There will be:
+1. Volume Strip Lock: locking for volume
+2. Bucket Strip Lock: locking for bucket
+3. Key Strip Lock: Locking for key


Please describe what are these locks are protecting.

Lock structure in general should looks like a tree. Why we don't need a root lock here?

We have done bucketing of key spaces to "volume" --> "bucket", so based on current operating model in ozone, so, we do not need lock entire key space.

for key operation, generally volume lock is not required, just bucket Read lock, and then the related key.

for bucket operation, volume read lock

Currently we do not have operation to lock entire tree structure, and should be avoided.

szetszwo · 2025-03-27T19:50:49Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+**Note**: Multiple keys locking (like delete multiple keys or rename operation), lock needs to be taken in order, i.e. using StrippedLocking order to avoid deadlock.
+
+Stripped locking ordering:
+- Strip lock is obtained over a hash bucket.


What are hash buckets? Is it the same as Ozone buckets?

Its not Ozone Bucket. Java's Striped class for describing its behavior where, hash is done on key to put keys in different bucket space to concurrency, i.e. hash bucket

szetszwo · 2025-03-27T19:52:20Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+- All keys needs to be ordered with hash bucket
+- And then need take lock in sequence order


Define "ordered with hash bucket" and "sequence order".

Updated the doc separating the striped lock behavior with obs locking.

szetszwo · 2025-03-27T19:56:10Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+- And then need take lock in sequence order
+
+## OBS operation
+Bucket read lock will be there default. This is to ensure:


Even for volume level operations?

No need as per current ozone design.

But in future, it needs volume to be locked, we can have it in builder pattern to include. But currently, there is no need, and in existing Ozone implementation also, its only bucket level lock for all operation.

szetszwo · 2025-03-27T19:57:31Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+- key operation uses updated bucket acl, quota and other properties
+- key does not becomes dandling when parallel bucket is deleted
+
+Note: Volume lock is not required as key depends on bucket only to retrieve information and bucket as parent.


Usually, a lower level lock requires locking all its ancestors; e.g. see https://en.wikipedia.org/wiki/Multiple_granularity_locking

If it does not have the volume lock, what if the volume is renamed?

Currently there is no volume rename operation. Other volume update operation as no relation to key operation, like quota, owner. These changed parameter will be taken by bucket only on creation as design for current implementation also.

So there is no impact.

But in future, it needs volume to be locked, we can have it in builder pattern to include. But currently, there is no need, and in existing Ozone implementation also, its only bucket level lock.

szetszwo · 2025-03-27T19:59:08Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+
+Batch Operation:
+1. deleteKeys: batch will be divided to multiple threads in Execution Pool to run parallel calling DeleteKey
+2. RenameKeys: This is `depreciated`, but for compatibility, will be divided to multiple threads in Execution Pool to run parallel calling RenameKey


Compatibility might not be important here if our target is Ozone 3.0.0.

ok, we can remove this interface from ozone.

szetszwo · 2025-03-27T20:02:09Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+OBS case just involves volume, bucket and key. So this is more simplified in terms of locking.
+
+There will be:
+1. Volume Strip Lock: locking for volume


Strip on what? I guess the answer is name? Why name but not id?

I have updated doc

szetszwo · 2025-03-27T20:03:55Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+1. deleteKeys: batch will be divided to multiple threads in Execution Pool to run parallel calling DeleteKey
+2. RenameKeys: This is `depreciated`, but for compatibility, will be divided to multiple threads in Execution Pool to run parallel calling RenameKey
+
+For batch operation, atomicity is not guranteed for above api, and same is behavior for s3 perspective.


... atomicity is not guranteed for above api ...

I guess you mean atomicity is not guaranteed for the batch but it is guaranteed for the above API. No?

No, For api involving batch of keys deleted or renamed, there is a possibility that, few operation may get success and other may have failed for OBS case.

This is as per specification for S3 perspective, and we may have different implementation to simplify breaking down huge batch to smaller batches, and hence specifiying in front -- as not guaranteed.

eg: One API call batch can have "key1 to key2, key3 to key4" rename together, so,
key1 rename to key2 success but key3 to key4 rename can fail.

szetszwo · 2025-03-27T20:04:51Z

hadoop-hdds/docs/content/design/leader-execution/obs-locking.md

+
+## Bucket and volume locking as required for concurrency for obs key handling
+
+### Volume Operation


We probably should start with Volume, then Bucket and then Key.

sumitagrawl requested a review from kerneltime December 16, 2024 16:18

errose28 self-requested a review December 17, 2024 19:33

sumitagrawl added the design label Dec 20, 2024

sumitagrawl marked this pull request as ready for review December 20, 2024 12:54

adoroszlai reviewed Dec 22, 2024

View reviewed changes

sumitagrawl requested a review from adoroszlai January 6, 2025 10:23

errose28 reviewed Jan 11, 2025

View reviewed changes

ivandika3 self-requested a review January 13, 2025 08:20

ivandika3 reviewed Jan 14, 2025

View reviewed changes

sumitagrawl requested review from ivandika3 and errose28 January 19, 2025 15:22

xichen01 reviewed Jan 22, 2025

View reviewed changes

sumitagrawl requested a review from xichen01 January 28, 2025 15:42

sumitagrawl added the om-pre-ratis-execution PRs related to https://issues.apache.org/jira/browse/HDDS-11897 label Mar 20, 2025

sumitagrawl changed the title ~~HDDS-11898. design doc leader side execution~~ HDDS-11898. design doc pre-ratis execution Mar 20, 2025

sumitagrawl added 11 commits March 27, 2025 19:01

HDDS-11898. design doc leader side execution

bd9d2d6

HDDS-11898. design doc leader side execution

be55a3a

request handling and perf result

0fbbe55

update doc

a826465

doc update

dce4aa0

doc enhancement

049a94a

obs locking doc

b2112b1

merge md files and add title

063be12

update obs lock doc

4bc2713

updated docs for review comments

5a14552

update flow and obs lock

6c1bee5

sumitagrawl force-pushed the HDDS-11898 branch from dbcb22a to 6c1bee5 Compare March 27, 2025 15:15

szetszwo reviewed Mar 27, 2025

View reviewed changes

szetszwo mentioned this pull request Apr 4, 2025

HDDS-12356. granular locking framework for obs #8217

Draft

fix review comments on obs locking

1518010

sumitagrawl requested a review from szetszwo April 21, 2025 11:51

		- All keys needs to be ordered with hash bucket
		- And then need take lock in sequence order


		## Bucket and volume locking as required for concurrency for obs key handling

		### Volume Operation

HDDS-11898. design doc pre-ratis execution #7583

Are you sure you want to change the base?

HDDS-11898. design doc pre-ratis execution #7583

Uh oh!

Conversation

sumitagrawl commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sumitagrawl commented Jan 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

errose28 Jan 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivandika3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ivandika3 commented Jan 14, 2025

Uh oh!

sumitagrawl commented Jan 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sumitagrawl commented Jan 28, 2025

sumitagrawl commented Dec 16, 2024 •

edited

Loading

adoroszlai left a comment •

edited

Loading

errose28 Jan 11, 2025 •

edited

Loading