HDDS-9279. OM HA: support read from followers. #5288

szetszwo · 2023-09-13T20:45:46Z

What changes were proposed in this pull request?

Ratis has a new Linearizable Read (RATIS-1557) feature, including reading from the followers. In this JIRA, we will change OM to serve read requests from any OM servers, including the follower OMs.

What is the link to the Apache JIRA

HDDS-9279

How was this patch tested?

(Please explain how this patch was tested. Ex: unit tests, manual tests)
(If this patch involves UI changes, please attach a screen-shot; otherwise, remove this)

Will add new tests.

sodonnel · 2023-09-13T21:39:36Z

My immediate thought on this, without fully understanding the details, is that this is a major change in behavior and should probably be off by default until is sees more testing.

Is it possible for some followers to fall behind - if so how is this handled?

Or are the followers always totally up to date? Eg, is a commit key on the leader absolutely available on all followers when the client call to commit the key returns, so that if a client immediately issues a read and it hits a different OM, it will see the data?

Is there a performance implication for writes by setting this on? Eg in the old way, as the followers slightly behind as they apply the Ratis log, while with this on its more like a 3 way commit for each write, so a delay (eg GC pause) on one OM would affect the write time on the leader?

kerneltime · 2023-09-14T05:51:49Z

cc @tanvipenumudy @muskan1012

adoroszlai · 2023-09-14T08:12:08Z

Thanks a lot @szetszwo for working on this.

Will add new tests.

Can we mark as "draft" until then?

szetszwo · 2023-09-14T20:36:45Z

... should probably be off by default until is sees more testing.

@sodonnel , thanks for taking a look! Sure, let's disable it by default.

Is it possible for some followers to fall behind - if so how is this handled?

Or are the followers always totally up to date? Eg, is a commit key on the leader absolutely available on all followers when the client call to commit the key returns, so that if a client immediately issues a read and it hits a different OM, it will see the data?

In short, the read-index algorithm handle it. For more details, see the design doc in RATIS-1557 and also the Raft thesis, Section 6.4

Is there a performance implication for writes by setting this on? Eg in the old way, as the followers slightly behind as they apply the Ratis log, while with this on its more like a 3 way commit for each write, so a delay (eg GC pause) on one OM would affect the write time on the leader?

The writes are the same. The feature only change the behavior for read.

szetszwo · 2023-09-14T20:37:45Z

Can we mark as "draft" until then?

@adoroszlai , Done.

sodonnel · 2023-09-14T21:01:48Z

I don't have any real understanding of Ratis or how it is applied to OM HA, so its hard for me to understand how this would work.

For an OM write - the write updates the leader OM data (cache / RocksDB) and then writes the transaction to Ratis before the call returns to the client. For this Ratis write to succeed, must it make it onto the other 2 OM nodes and into their Ratis log or just into a majority of the Ratis logs?

When the Ratis transaction is received by the follower OM, what happens before that call returns to the leader who called it? Is the transaction written to the follower Ratis log AND applied to the follower memory state too before the Ozone client returns? Or is the Ratis log applied to the followers memory async by a separate thread, meaning the original client call returns BEFORE the memory state is updated in all OMs?

If the original client call doesn't return until all 3 OMs have been updated, does this mean the 3 OMs have a strictly consistent memory state, rather than eventually consistent?

szetszwo · 2023-09-16T21:49:09Z

... For this Ratis write to succeed, must it make it onto the other 2 OM nodes and into their Ratis log or just into a majority of the Ratis logs?

Only one Follower is needed, i.e. 2 OMs (including the Leader) out of 3 OMs are needed.

When the Ratis transaction is received by the follower OM, what happens before that call returns to the leader who called it? Is the transaction written to the follower Ratis log AND applied to the follower memory state too before the Ozone client returns? Or is the Ratis log applied to the followers memory async by a separate thread, meaning the original client call returns BEFORE the memory state is updated in all OMs?

When the Leader replies to the client, the transaction must be committed (i.e. replicated to one follower) and it is applied at the Leader. The follower may not have applied it.

The read index algorithm has the following steps:

a client sends a read request to a follower,
the follower asks the Leader the current commit index, say 10 -- this is the read index
the follower may only have applied log to a small index, say 8.
the follower will just wait until its applied index advances to 10.
the follower replies the read request.

hadoop-hdds/common/src/main/resources/ozone-default.xml

kerneltime · 2023-10-02T16:12:50Z

Maybe we can rerun some subset of the tests after changing the config to LINEARIZABLE.

sodonnel · 2023-10-02T16:49:00Z

I think this is a good change to commit, as it gets us read from standby almost for free. But I do have some concerns - eg if a follower is struggling and requests hit it and are always also because it is always behind etc. I know a lot of thought went into this sort of thing with HDFS, but my memory of it is too old now to remember any of the details.

whbing · 2024-02-01T12:12:27Z

Hi, I would like to inquire about a question. I did a simple test by running hadoop fs -ls ofs://follower/xxx and found that it can be accessed via the follower. But I attempt to access it via 'serverId' using ofs://serverId/xxx, it consistently hits the leader (from audit log). So under what conditions will it be forwarded to the follower? Or is it random?

szetszwo · 2024-02-01T16:29:34Z

... I did a simple test by running ...

@whbing, do you mean that you had applied this change and then ran the test?

But I attempt to access it via 'serverId' using ofs://serverId/xxx, ...

I guess the serverId was translated to the pipeline somewhere at the client side. Then, the client use the pipeline contacting the leader.

whbing · 2024-02-02T04:14:25Z

... I did a simple test by running ...

@whbing, do you mean that you had applied this change and then ran the test?

But I attempt to access it via 'serverId' using ofs://serverId/xxx, ...

I guess the serverId was translated to the pipeline somewhere at the client side. Then, the client use the pipeline contacting the leader.

I applied the pr and test in my cluster.
I'm sorry that the above description is not accurate. It is not hit leader every time, but the first one in the ozone-om.nodes.xxx in client side conf. For example, om2 will always be selected to read if configured
<name>ozone.om.nodes.myozone</name><value>om2,om0,om1</value>

szetszwo · 2024-02-06T21:11:38Z

... the first one in the ozone-om.nodes.xxx in client side conf. ...

@whbing , it looks like that the current code always use the first OM on the list. We should choose the closest OM or randomize it.

szetszwo · 2024-02-07T18:56:53Z

@whbing , pushed a change to shuffle omNodeIDList.

whbing · 2024-02-22T07:35:21Z

@whbing , pushed a change to shuffle omNodeIDList.

Thanks for the update, and I have tested that it is OK

ivandika3

Thank you for driving this forward @szetszwo . We are interested in trying out the linearizable read feature in our cluster to possibly shed some load from the OM leader.

I have not looked into the Ratis linearizable read implementation in depth yet, but I have some initial comments. Will add follow-up comments in the coming weeks after I go through the Ratis implementation.

cc: @symious

ivandika3 · 2024-03-12T08:32:08Z

...main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerProtocolServerSideTranslatorPB.java

+    // Read from leader or followers using linearizable read
+    if (omRatisServer.isLinearizableRead()) {
+      return handler.handleReadRequest(request);
+    }


Please correct me if I am wrong, but OzoneManagerRequestHandler#handleReadRequest queries the OM metadata tables directly without going through Ratis (OzoneManagerStateMachine#query). This might cause OM to incorrectly read follower stale data. From my understanding, linearizable read should only work if a request is sent to the OM Ratis server (just like OzoneManagerRatisServer#submitRequest).

Just some question regarding OM consistency: currently (before linearizable read) since reads and writes must goes through the leader, does OM provides a "read-after-write" consistency (or is it stronger?) even through the read does not go through Ratis server? If we enable linearizable read we can increase the consistency guarantee to "linearizable"?

Reference: Jepsen consistency model (https://jepsen.io/consistency)

... OzoneManagerRequestHandler#handleReadRequest queries the OM metadata tables directly without going through Ratis (OzoneManagerStateMachine#query). ...

This is a good point! Adding the code here seems incorrect.

... does OM provides a "read-after-write" consistency (or is it stronger?) even through the read does not go through Ratis server?

For the same client, since the calls are blocking, it will have "read-after-write" consistency (An earlier write must be done before processing the later read.)

... If we enable linearizable read we can increase the consistency guarantee to "linearizable"?

Yes.

It also guarantee read-after-write consistent even for the Ratis async APIs.

Thank you for the info.

Adding the code here seems incorrect.

I think this can be fixed by also calling submitRequestToRatis even on the read path.

I'm curious, is there any historical reason that currently OM does not send read request to OM Ratis server? Maybe some overhead consideration? My understanding is that it should not have a large additional overhead since read request through Ratis (StateMachine#query) does not generate an append log entry.

... currently OM does not send read request to OM Ratis server? ...

For non-ha, there is no Ratis server in OM. It probably is the reason.

whbing · 2024-03-20T07:15:54Z

I have another consultation. I noticed that follwer's 9862 RPC port would restart during running. Related ticket HDDS-10177.
Not sure what the impact is on follwer reads during rpc port restart.

szetszwo · 2024-03-25T17:01:40Z

... Not sure what the impact is on follwer reads during rpc port restart.

The client will fail and retry. It seems okay if the client can failover to other datanodes. We should test it.

szetszwo · 2024-03-25T18:57:26Z

@whbing , @ivandika3 , I am currently not able to continue this work. Are you interested in working on this? Please feel free to let me know.

symious · 2024-04-03T03:41:24Z

@szetszwo We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, but the performance improvement is not very satisfactory.

whbing · 2024-04-03T04:59:30Z

@whbing , @ivandika3 , I am currently not able to continue this work. Are you interested in working on this? Please feel free to let me know.

@szetszwo Thanks! I'm not an expert on ratis. As far as I know it is almost complete, I can help to add some tests to this pr later.

whbing · 2024-04-03T05:02:18Z

@szetszwo We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, but the performance improvement is not very satisfactory.

Hi, @symious, What about the performance of your tests? I also ran some performance tests, see https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit#heading=h.o61uifuxltgn

symious · 2024-04-03T06:04:53Z

What about the performance of your tests?

About 20% performance improvement.

szetszwo · 2024-04-03T18:04:45Z

..., We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, ...

@symious , I checked the PR, the code looks good.

... but the performance improvement is not very satisfactory.

How did you test it? Could you share the results?

Suppose the test only has read requests but no write requests (i.e. the commit index won't change). Then, the ops should have ~3x improvement since each OM should serve 1/3 of the requests.

The latency may have much smaller or no improvement since

the readIndex algorithm has additional overhead, and
without enabling LINEARIZABLE, the read is NOT linearizable even it reads from the Leader.

szetszwo · 2024-04-03T18:26:26Z

... I also ran some performance tests, ...

@whbing , thanks for testing the performance and sharing the results. I commented on the doc. My major comment is that a single freon command may not be able to test the performance correctly since the limitation may be at the client side. In other words, using a single client machine to test the performance of multiple server machines does not sound right.

We should try running the commands in multiple client machines.

szetszwo · 2024-04-03T18:43:46Z

BTW, updating Ratis may help. Deployed a 3.1.0-c73a3eb-SNAPSHOT release https://repository.apache.org/content/groups/snapshots/org/apache/ratis/ratis-common/3.1.0-c73a3eb-SNAPSHOT/ , see if you could try it.

ivandika3 · 2024-04-04T01:49:43Z

Another observation is that the benchmark with writes will eventually converge to the leader, which makes most traffic to hit the leader eventually for persistent client (e.g. Ozone client in S3G).

Example case

1: READ -> follower 1
2: WRITE -> follower 1 will throw NotLeaderException, redirect to leader
3: subsequent READs -> leader (until the leader fails)

One possible way is to ensure READ is sent to the follower, is to modify the client (OmFailoverProxyProviderBase) keep track of the current leader and followers of the OM service. All the read requests will be sent to the follower, while the write requests will be sent to the leader.

We might adapt some of the logic from HDFS's ObserverReadProxyProvider.

ivandika3 · 2024-04-04T02:09:11Z

FYI: I created a comment about possible considerations around OM follower read in the ticket (https://issues.apache.org/jira/browse/HDDS-9279). Hopefully this will highlight some ideas around follower read.

whbing · 2024-04-09T13:14:08Z

... with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279 ...

@symious Will you encounter any errors based on this code? I encounter the following errors many times when running ozone freon ockrw -r 1000 -t 3 --linear --contiguous --duration 5s --size 4096 -v s3v -b $bucket -p $prefix with ozone.om.ratis.server.read.option=LINEARIZABLE

Total execution time (sec): 7
Failures: 0
Successful executions: 0

java.lang.NullPointerException
        at org.apache.hadoop.ozone.om.helpers.OMRatisHelper.getOMResponseFromRaftClientReply(OMRatisHelper.java:69)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:532)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$2(OzoneManagerRatisServer.java:289)
        at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:287)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:267)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:260)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:267)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:201)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:161)
        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:152)
        at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)

reply.getMessage() is null.

whbing · 2024-04-09T13:27:42Z

BTW, updating Ratis may help. Deployed a 3.1.0-c73a3eb-SNAPSHOT release https://repository.apache.org/content/groups/snapshots/org/apache/ratis/ratis-common/3.1.0-c73a3eb-SNAPSHOT/ , see if you could try it.

OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. (OM will also start failed with the above jar). But this pr doesn't look like it needs 3.1.0 and can be skipped for now.

symious · 2024-04-09T14:51:58Z

reply.getMessage() is null.

Can test on 1.4.0 first.

szetszwo · 2024-04-09T15:19:10Z

OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. ...

@whbing , RATIS-2011 was to fix a memory leak issue (otherwise, some TransactionContext objects will not be removed from the map.) Why it courses the failure?

whbing · 2024-04-11T10:12:33Z

reply.getMessage() is null.

Can test on 1.4.0 first.

1.4.0 is ok. And I update test in https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit#heading=h.r9ym4lnx88xm.
The results are indeed not ideal. The performance improvement is about 20% In pure read scenarios. However, in read-write mixed scenarios, using LINEARIZABLE did not increase throughput. In fact, both QPS and latency became worse.

whbing · 2024-04-11T10:26:05Z

OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. ...

@whbing , RATIS-2011 was to fix a memory leak issue (otherwise, some TransactionContext objects will not be removed from the map.) Why it courses the failure?

OM build failed with ratis 3.1.0 because it calls RatisHelper.attemptUntilTrue which removed in RATIS-2011.
OM also start failed with the new ratis-common-.jar, with err:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.ratis.protocol.RaftClientRequest
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createRaftRequestImpl(OzoneManagerRatisServer.java:458)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$3(OzoneManagerRatisServer.java:296)
        at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createRaftRequest(OzoneManagerRatisServer.java:294)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:259)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:264)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:271)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:211)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:171)
        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1098)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1021)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)

I've done some tests just based on ratis-common-3.0.1.jar and updated in the above doc.

szetszwo · 2024-04-15T18:32:26Z

However, in read-write mixed scenarios, using LINEARIZABLE did not increase throughput. In fact, both QPS and latency became worse.

@whbing , thanks for the update! @kerneltime pointed out that the OM thread pool may be the bottleneck. Need to check.

szetszwo · 2024-04-15T20:23:40Z

OM build failed with ratis 3.1.0 because it calls RatisHelper.attemptUntilTrue which removed in RATIS-2011.

Filed RATIS-2057. Let me deploy a Ratis snapshot.

kerneltime · 2024-04-16T05:13:54Z

I would recommend the following:

Establish the absolute peak performance possible in the cluster by using ozone freon om-echo command, I want to see what the pure RPC performance that can be reached. Also, try bumping up the config ipc.server.read.threadpool.size to a value around 10 or more and see if the echo command does better. This number can be really high 300k+ on reasonably modern server. If we are seeing numbers much lower (less than 100k) there is some other issue in the test bed that needs to be looked into. You might need --clients and --thread to be bumped up.
Once the peak is established, we should run a pure read load, on a modern architecture with NVMe, it is fair to expect a performance for pure metadata reads be 150k+ ops/sec (GetKeyInfo). I have not run this against the latest master so not sure if there some regression but I have seen performance around 200k or better. The expectation here is to see that the CPU has reached utilization numbers higher than 50%. It is possible that we do not have enough client threads or client instances.
After we establish the peak performance that the cluster can hit, we should switch to measuring impact of reading from followers using Ratis.
OM should have enough threads idle under the write load to respond with the information needed to linearize the reads, so tracking thread utilization metrics of OM is important.

szetszwo · 2024-04-16T15:19:15Z

@whbing ,

OM build failed with ratis 3.1.0 because it calls RatisHelper.attemptUntilTrue which removed in RATIS-2011.

Sorry, added it back by RATIS-2057.

OM also start failed with the new ratis-common-.jar, ...

Not sure why? Could you try ~~3.1.0-f8b1692-SNAPSHOT~~ 3.1.0-2e778f7-SNAPSHOT, which was built from https://github.com/szetszwo/ratis/tree/Revert_RATIS-2045?

szetszwo · 2024-04-16T15:20:44Z

@whbing , please include also HDDS-10690.

whbing · 2024-04-17T13:09:05Z

try bumping up the config ipc.server.read.threadpool.size

@kerneltime Thanks for your suggest. A lot of improvements when bumping up ipc.server.read.threadpool.size.
（1） ipc.server.read.threadpool.size=1 (default)

$ ozone freon ome -t=400 --clients=400 -n=8000000
-- Timers ----------------------------------------------------------------------
rpc-payload
             count = 8000000
         mean rate = 59583.14 calls/second
     1-minute rate = 59483.53 calls/second
     5-minute rate = 53146.84 calls/second
    15-minute rate = 50789.33 calls/second
               min = 0.42 milliseconds
               max = 17.52 milliseconds
              mean = 6.55 milliseconds
            stddev = 2.69 milliseconds
            median = 6.54 milliseconds
              75% <= 8.44 milliseconds
              95% <= 10.92 milliseconds
              98% <= 11.89 milliseconds
              99% <= 12.32 milliseconds
            99.9% <= 13.52 milliseconds


Total execution time (sec): 134
Failures: 0
Successful executions: 8000000

（2） ipc.server.read.threadpool.size=20

$ ozone freon ome -t=400 --clients=400 -n=8000000
-- Timers ----------------------------------------------------------------------
rpc-payload
             count = 8000000
         mean rate = 273248.49 calls/second
     1-minute rate = 251957.31 calls/second
     5-minute rate = 239593.34 calls/second
    15-minute rate = 237211.86 calls/second
               min = 0.16 milliseconds
               max = 131.59 milliseconds
              mean = 1.46 milliseconds
            stddev = 4.73 milliseconds
            median = 1.23 milliseconds
              75% <= 1.31 milliseconds
              95% <= 1.63 milliseconds
              98% <= 2.13 milliseconds
              99% <= 2.96 milliseconds
            99.9% <= 71.47 milliseconds


Total execution time (sec): 29
Failures: 0
Successful executions: 8000000

Other tests will be supplemented later. But I experienced a performance decrease when read/write mixed operations in the previous tests which is unexpected.

szetszwo · 2024-04-17T16:12:41Z

... I experienced a performance decrease when read/write mixed operations in the previous tests which is unexpected.

... Could you try ~~3.1.0-f8b1692-SNAPSHOT~~ 3.1.0-2e778f7-SNAPSHOT, which was built from https://github.com/szetszwo/ratis/tree/Revert_RATIS-2045?

This is also unexpected to me. Not sure if the recent commits could fix it.

whbing · 2024-04-21T15:23:25Z

Not sure if the recent commits could fix it.

@szetszwo I include your given commits and run freon in in the following branch (Please point out any incorrect in my code):

Ratis: https://github.com/whbing/ratis/tree/test-follower-read
Ozone-1.4: https://github.com/whbing/ozone/tree/follower-read-1.4 (use above ratis commits)

Unfortunately, LINEARIZABLE was worse than DEFAULT when I ran the same ozone freon ockrw ... command mixed read and write through multiple clients. LINEARIZABLE: read qps: 1.5k, write qps: 2.4k; DEFAULT: read qps: 8.7k, write qps: 2.8k. Although this does not meet the throughput limit, it is also not as expected.

I think once LINEARIZABLE is better than DEFAULT, then we can do benchmark based on the flow @kerneltime pointed out.

( By the way, master branch (ozone-1.5) https://github.com/whbing/ozone/tree/follower-read-1.5 is not yet compatible follower read (RPC requests appear to not be handled), so temporarily testing based on ozone-1.4. )

kerneltime · 2024-06-13T16:29:02Z

@whbing do you plan to continue work on this? This is an important feature for Ozone.

whbing · 2024-06-14T01:49:15Z

@whbing do you plan to continue work on this? This is an important feature for Ozone.

@kerneltime Sorry for the late follow-up. I might need a few more days (maybe next week) before I have time to continue working on this.
@ivandika3 @symious have made significant contributions, look forward to continued involvement if you have time.

(btw, Anyone has edit access to the document OM HA support read from followers testing , and can continue in this document.)

peterxcli · 2025-05-09T17:51:21Z

I just left a comment in a PR that is part of ratis read-index - apache/ratis#730 (comment)
Guess it would help the LINEARIZABLE read from follower to have smaller overhead.
The lease read feature is already enabled by default. apache/ratis#730 (comment)

chungen0126 · 2025-05-14T08:22:03Z

Thanks @szetszwo for working on this. This should really help reduce the load on the leader OM once it's in. I think maybe we could add an option to let users choose to read directly from follower OMs.

There’s a risk of reading stale data, but the benefit is lower latency compared to linearizable reads — even if those are implemented perfectly. It would also help further reduce the load on the leader. I think it’d be useful to make this trade-off configurable so users can choose based on their needs.

szetszwo · 2025-05-14T18:25:14Z

@chungen0126 , sure, we could add a conf to support stale read.

chungen0126 · 2025-05-20T15:10:25Z

@szetszwo Would you like to include stale read in this patch, or prefer to create a new JIRA ticket for it? I’m happy to help if needed.

kerneltime requested review from kerneltime and duongkame September 14, 2023 05:51

szetszwo marked this pull request as draft September 14, 2023 20:37

kerneltime reviewed Oct 2, 2023

View reviewed changes

hadoop-hdds/common/src/main/resources/ozone-default.xml Show resolved Hide resolved

ivandika3 reviewed Mar 12, 2024

View reviewed changes

szetszwo force-pushed the HDDS-9279 branch from 358d002 to 4ba5557 Compare April 23, 2024 23:08

ivandika3 mentioned this pull request Jun 12, 2024

HDDS-10108. [hsync] Adopt RATIS-1994 to reduce hsync ops latency #6014

Closed

ivandika3 mentioned this pull request Apr 4, 2025

HDDS-6856. HA aware reads from Snapshots #7988

Draft

[WIP] HDDS-9279. OM HA: support read from followers.

feed930

szetszwo force-pushed the HDDS-9279 branch from 4ba5557 to feed930 Compare May 14, 2025 18:23

HDDS-9279. OM HA: support read from followers. #5288

Are you sure you want to change the base?

HDDS-9279. OM HA: support read from followers. #5288

Conversation

szetszwo commented Sep 13, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sodonnel commented Sep 13, 2023

Uh oh!

kerneltime commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Sep 14, 2023

Uh oh!

szetszwo commented Sep 14, 2023

Uh oh!

szetszwo commented Sep 14, 2023

Uh oh!

sodonnel commented Sep 14, 2023

Uh oh!

szetszwo commented Sep 16, 2023

Uh oh!

Uh oh!

kerneltime commented Oct 2, 2023

Uh oh!

sodonnel commented Oct 2, 2023

Uh oh!

whbing commented Feb 1, 2024

Uh oh!

szetszwo commented Feb 1, 2024

Uh oh!

whbing commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented Feb 6, 2024

Uh oh!

szetszwo commented Feb 7, 2024

Uh oh!

whbing commented Feb 22, 2024

Uh oh!

ivandika3 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivandika3 Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

ivandika3 Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szetszwo Mar 25, 2024

Choose a reason for hiding this comment

Uh oh!

whbing commented Mar 20, 2024

Uh oh!

szetszwo commented Mar 25, 2024

Uh oh!

szetszwo commented Mar 25, 2024

Uh oh!

symious commented Apr 3, 2024

Uh oh!

whbing commented Apr 3, 2024

Uh oh!

whbing commented Apr 3, 2024

Uh oh!

symious commented Apr 3, 2024

Uh oh!

szetszwo commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kerneltime commented Sep 14, 2023 •

edited

Loading

whbing commented Feb 2, 2024 •

edited

Loading

ivandika3 left a comment •

edited

Loading

ivandika3 Mar 12, 2024 •

edited

Loading

ivandika3 Mar 13, 2024 •

edited

Loading

szetszwo commented Apr 3, 2024 •

edited

Loading

szetszwo commented Apr 3, 2024 •

edited

Loading

ivandika3 commented Apr 4, 2024 •

edited

Loading

whbing commented Apr 9, 2024 •

edited

Loading

kerneltime commented Apr 16, 2024 •

edited

Loading

szetszwo commented Apr 16, 2024 •

edited

Loading

szetszwo commented Apr 17, 2024 •

edited

Loading

peterxcli commented May 9, 2025 •

edited

Loading