Skip to content

HDDS-11898. design doc pre-ratis execution #7583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

sumitagrawl
Copy link
Contributor

@sumitagrawl sumitagrawl commented Dec 16, 2024

What changes were proposed in this pull request?

Design doc for pre-ratis side execution

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11898

How was this patch tested?

NA

@errose28 errose28 self-requested a review December 17, 2024 19:33
@sumitagrawl sumitagrawl marked this pull request as ready for review December 20, 2024 12:54
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sumitagrawl for the docs.

Please combine the files into a single markdown file with headers (title, author, status, etc.), license (please see other design docs for example).

This will help readers know where to start, and it is also needed for display on the website: https://ozone.apache.org/docs/edge/design.html

@sumitagrawl
Copy link
Contributor Author

Thanks @sumitagrawl for the docs.

Please combine the files into a single markdown file with headers (title, author, status, etc.), license (please see other design docs for example).

This will help readers know where to start, and it is also needed for display on the website: https://ozone.apache.org/docs/edge/design.html

@adoroszlai Please recheck, now have below as separate

  1. Leader-execution.md
  2. obs-lock.md
  3. requests:
  • obs-create-key.md
  • obs-commit-key.md

Above kept separate as these are independent feature as part of leader execution and its design further will go independently,

@sumitagrawl sumitagrawl requested a review from adoroszlai January 6, 2025 10:23
Comment on lines +190 to +254
- lock: granular level locking
- unlock: unlock locked keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens while we are holding the lock down? Shouldn't this be where processing is happening? This seems like a duplicate of the information in the "Leader Execution" section but both sections are missing steps. For example submitting to Ratis is not mentioned here anywhere.

- [Create key](request/obs-create-key.md)
- [Commit key](request/obs-commit-key.md)

### Execution persist and distribution
Copy link
Contributor

@errose28 errose28 Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole section needs to be redesigned. In theory, Ratis + RocksDB should be able to exist in its own module as a replicated DB with no dependencies on anything Ozone specific. We will need this eventually to bring the same code flow to SCM (for rolling upgrade) and Recon (for non-voting follower) without rewriting these critical pieces that deal with replication and persistence. Actually moving the code to separate modules may be outside the scope of this feature, but we need to define the API surface such that it is possible to avoid having to rewrite/refactor what is soon to be already new code. For this example I will refer to the replicated DB as its own module, even if V1 of the code does not structure it this way for migration purposes. It is the API surface used by each request that is more important to lock down now.

Input to this module should be of the form of protos that define the DB updates to perform. The actual values written to the DB should already have been serialized to bytes by this point and they should not be deserialized at any point later in the flow (with the exception of merges). This means the module has no knowledge of client ID, quota info, etc.

We would have one proto message defining each operation supported by the DB. The module takes one Batch which contains these operations and will be treated as one Ratis request

message Put {
  optional bytes columnFamily
  optional bytes key
  optional bytes value
}

message Delete {
  optional bytes columnFamily
  optional bytes key
}

message Merge {
  optional bytes columnFamily
  optional bytes key
  optional bytes value
}

message Checkpoint {
  // Path to place the checkpoint  
  optional string destination
}

// Only one field should be present to define the operation to do.
// The module can validate this input.
message Operation {
  optional Put put
  optional Delete delete
  optional Merge merge
  optional Checkpoint checkpoint
}

// Each OM request would result in one list of ordered operations submitted to the module.
// The module can internally combine these lists into one Batch proto that gets submitted to Ratis.
// The update to the transaction ID table needs to be handled within the module for each batch applied.
message Batch {
  repeated Operation operations
}

Now to translate each proto to a DB update in Ratis' applyTransaction:

  • Put and Delete simply map to existing RocksDB put and delete key ops. Note that RocksDB does not have a move operation.
  • Checkpoint creates a RocksDB checkpoint and will be used by snapshots.
  • Merge will be used to implement any increments required, like quota using the RocksDB associative merge operator. Initializers of the module will pass in a mapping of column families to their corresponding merge operators if required.
    • For example, the OM would initialize the module with a BucketInfoMergeOperator on the BucketTable, a VolumeInfoMergeOperator on the VolumeTable, etc.

Then the API surface between OM or any other service and the replicated DB module is just a list of column families to open, with some optionally mapped to merge operator callbacks provided on construction, and calls to submit new Operation lists to the module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big +1 for this!

@ivandika3 ivandika3 self-requested a review January 13, 2025 08:20
Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. Left an initial comments.

Regarding the request flow, could you add a more detailed sequence diagram? Similar to https://issues.apache.org/jira/browse/HDDS-1595 so that it's easier to visualize the new flow.

image

#### Memory caching:
```
Memory Map: ClientId#CallId Vs Response
Expiry: 10 minute (as current default for ratis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expiry is done independently for each OM node? It won't be replicated from leader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leader will sync below information with db changes,

  • callI and clientId (used to identify retry of request)
  • timestamp
  • response message (to be replied back if same request is retried as ratis behavior)

Each node will make use of above information once received in applyTransaction and handle independently.

@ivandika3
Copy link
Contributor

cc: @xichen01 @symious

@sumitagrawl
Copy link
Contributor Author

@ivandika3 @errose28 Updated docs for all comments. below is pending,

  1. detailed flow diagram including lock / unlock flow (tried with abstract flow and Gateway responsibility, but this needs more details and further visualization for flow)

  2. Ratis + RocksD separation with OM logic - to check further

Batching of Request:
- Request 1..n are executed and db changes are identified and added to queue (and request will be waiting for update via ratis over Future waiting)
- Batcher will retrieve Request 1..n and db changes, merge those request to single Ratis Request message
- Send Merged Request message to all nodes via ratis and receive reply
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only need to update one field of key(e.g. the mtime of a key), should we send the entire DB value of the key to the follower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, send entire key object. This kind of operation is very less, so need not focus improve over this. The idea is to just sync db operation more close to db update, instead of again doing further action on follower node (till no specific case observed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 brought a good point, for example for large multipart upload part issue (https://issues.apache.org/jira/browse/HDDS-8238), OM leader might need to send a very large DB value, few hundred MBs. This can cause very large network overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multipart uploads must be looked at from scratch, and incremental updates can be added later. The goal is to keep the code in apply transaction code minimalistic and idempotent. Introducing a read-modify write cycle in an apply transaction has implications for features such as rolling upgrades, which need to be looked into (Followers can have old code that the leader is unaware of). For now, I would keep this part of the code simple (even if it is a bit inefficient) and add complexity later with benchmarks and clarity for the performance gains without compromising correctness.
The current proposal is to move OM APIs to the new implementation incrementally, and we can do this incrementally. Making the code complex from the beginning without measuring the performance can lead us to add complexity without clear gains.

- Send Merged Request message to all nodes via ratis and receive reply
- Batcher will reply to each request 1..n with db update success notifying future object of each request.

There are multiple batchers waiting over queue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the rules for assigning requests to different queues, for requests with dependencies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependency in request is handled via locking. If one request depends on another, then that request will wait for processing over lock.
Example,

  1. Commit key1
  2. Another client overwrite key1 parallel as commit
    So if first client get lock over key1, second client will wait.

Similarly, bucket update operation have write lock over bucket, then all key request will wait for the lock.

So queue will be having all independent request to be updated to DB, and no ordering is required.

As usecase,
Most of operation will be at key level -- create/commit/delete. So different key getting operated are independent to each other, and this will help in achieving better parallism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it depends on the request lock, how long will the lock be held?

It may be necessary to wait until Ratis replies successfully (majority of nodes to reply in quorum) before releasing the lock, and if the lock is released before that, the multi-threaded model may lead to ambiguities in the order in which the follower and leader are executed.

And for some “indirectly related” locks, such as if the FSO bucket uses a more fine-grained lock (currently a bucket write lock), this can lead to a lot of complex cases. The HA model I understand, where synchronized journal services are usually sequential, ensures that the latter batch will send a ratis only after the previous batch explicitly writes to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be necessary to wait until Ratis replies successfully (majority of nodes to reply in quorum) before releasing the lock

Yes this is necessary. Lock must be held down until the request is applied on the leader (which requires committing to the quorum). Locking only happens on the leader and happens before submitting to Ratis, so followers will apply in the order directed by the leader. I think the doc should clarify the scope of the locking, since currently the step-by-step examples are a little ambiguous here.


### Apply Transaction (via ratis at all nodes)
With new flow as change,
- all nodes during ratis apply transaction will just only update the DB for changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is OK to just send ”db changes“ instead of Journal to follower node? As I understand it, what HA sync is usually the Journal, so that follow can perform operations such as updating the iNode in memory, etc. This would not be possible if we only sync the DB changes.
If Ozone has some operations that need to be updated in the follower, how should this be handled? For example, Ozone intends to make the FSO bucket support in-memory iNode trees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, certain actions are defined specific to cases like,

  • bucket quota changes need db update and cache update (differential)
  • bucket creation needs addition to cache, volume also need to be added to cache.

So above statement is more generic in term that, avoid any special handling other than db update, but there are actions for handling same. Agenda is, unless its mandatory for certain extra handling other than db update to have over follower side, avoid it.

Will try to clarify more in the design doc for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I understand that in the mainstream filesystem HA models, it is more of a “Journal” synchronization (maybe I'm not correct). I don't know if there are other filesystem HA models that use synchronization of “DB changes”, and I'm concerned that synchronizing only the “DB changes” might limit some of the future features

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referencing HDFS here with in memory inodes and journal nodes?

Synchronizing on DB changes is the way consensus usually works IME. Synchronizing on abstract commands like we do currently is very error prone because the critical applyTransaction step is unique for each command. If in-memory state like caches need to be updated that would be considered part of the state machine and updated on applyTransaction as well.

The separation proposed here does make this type of write-through caching more difficult because the applyTransaction entries are already serialized. This should be called out in the doc and tested as well. Are things like a full cache of the bucket table that saves the deserialization step actually helping read performance of can we drop them? If they are needed, there would likely need to be some sort of caching support added:

  • The module would support callbacks to update a cache on applyTransaction in an async manner.
  • Caching could be applied at the column family level, similar to what we currently have.
  • Cache misses would read through to the DB.
  • Cache update failures would not be fatal since the value is already in the DB.
    • This prevents cases like we currently have where failures in applyTransaction unique to specific requests crash the OM.

@sumitagrawl
Copy link
Contributor Author

@xichen01 Thanks for review, some good questions are asked in comment. Will try to add those cases in design doc.

Please recheck my reply and further any questions ...

@sumitagrawl sumitagrawl requested a review from xichen01 January 28, 2025 15:42
@sumitagrawl sumitagrawl added the om-pre-ratis-execution PRs related to https://issues.apache.org/jira/browse/HDDS-11897 label Mar 20, 2025
@sumitagrawl sumitagrawl changed the title HDDS-11898. design doc leader side execution HDDS-11898. design doc pre-ratis execution Mar 20, 2025
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl , thanks for working on the design. Please see thee comments lined for the OBS locking.

BTW, I suggest splitting out the OBC locking to a separated design doc. It will be easier for everyone.

Comment on lines 25 to 30
OBS case just involves volume, bucket and key. So this is more simplified in terms of locking.

There will be:
1. Volume Strip Lock: locking for volume
2. Bucket Strip Lock: locking for bucket
3. Key Strip Lock: Locking for key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe what are these locks are protecting.

Lock structure in general should looks like a tree. Why we don't need a root lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have done bucketing of key spaces to "volume" --> "bucket", so based on current operating model in ozone, so, we do not need lock entire key space.

  • for key operation, generally volume lock is not required, just bucket Read lock, and then the related key.
  • for bucket operation, volume read lock

Currently we do not have operation to lock entire tree structure, and should be avoided.

**Note**: Multiple keys locking (like delete multiple keys or rename operation), lock needs to be taken in order, i.e. using StrippedLocking order to avoid deadlock.

Stripped locking ordering:
- Strip lock is obtained over a hash bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are hash buckets? Is it the same as Ozone buckets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not Ozone Bucket. Java's Striped class for describing its behavior where, hash is done on key to put keys in different bucket space to concurrency, i.e. hash bucket

Comment on lines 36 to 37
- All keys needs to be ordered with hash bucket
- And then need take lock in sequence order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define "ordered with hash bucket" and "sequence order".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc separating the striped lock behavior with obs locking.

- And then need take lock in sequence order

## OBS operation
Bucket read lock will be there default. This is to ensure:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for volume level operations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need as per current ozone design.

But in future, it needs volume to be locked, we can have it in builder pattern to include. But currently, there is no need, and in existing Ozone implementation also, its only bucket level lock for all operation.

- key operation uses updated bucket acl, quota and other properties
- key does not becomes dandling when parallel bucket is deleted

Note: Volume lock is not required as key depends on bucket only to retrieve information and bucket as parent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, a lower level lock requires locking all its ancestors; e.g. see https://en.wikipedia.org/wiki/Multiple_granularity_locking

If it does not have the volume lock, what if the volume is renamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is no volume rename operation. Other volume update operation as no relation to key operation, like quota, owner. These changed parameter will be taken by bucket only on creation as design for current implementation also.

So there is no impact.

But in future, it needs volume to be locked, we can have it in builder pattern to include. But currently, there is no need, and in existing Ozone implementation also, its only bucket level lock.


Batch Operation:
1. deleteKeys: batch will be divided to multiple threads in Execution Pool to run parallel calling DeleteKey
2. RenameKeys: This is `depreciated`, but for compatibility, will be divided to multiple threads in Execution Pool to run parallel calling RenameKey
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compatibility might not be important here if our target is Ozone 3.0.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, we can remove this interface from ozone.

OBS case just involves volume, bucket and key. So this is more simplified in terms of locking.

There will be:
1. Volume Strip Lock: locking for volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strip on what? I guess the answer is name? Why name but not id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated doc

1. deleteKeys: batch will be divided to multiple threads in Execution Pool to run parallel calling DeleteKey
2. RenameKeys: This is `depreciated`, but for compatibility, will be divided to multiple threads in Execution Pool to run parallel calling RenameKey

For batch operation, atomicity is not guranteed for above api, and same is behavior for s3 perspective.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... atomicity is not guranteed for above api ...

I guess you mean atomicity is not guaranteed for the batch but it is guaranteed for the above API. No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, For api involving batch of keys deleted or renamed, there is a possibility that, few operation may get success and other may have failed for OBS case.

This is as per specification for S3 perspective, and we may have different implementation to simplify breaking down huge batch to smaller batches, and hence specifiying in front -- as not guaranteed.

eg: One API call batch can have "key1 to key2, key3 to key4" rename together, so,
key1 rename to key2 success but key3 to key4 rename can fail.


## Bucket and volume locking as required for concurrency for obs key handling

### Volume Operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should start with Volume, then Bucket and then Key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sumitagrawl sumitagrawl requested a review from szetszwo April 21, 2025 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design om-pre-ratis-execution PRs related to https://issues.apache.org/jira/browse/HDDS-11897
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants