Skip to content

HDDS-9279. OM HA: support read from followers. #5288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

szetszwo
Copy link
Contributor

What changes were proposed in this pull request?

Ratis has a new Linearizable Read (RATIS-1557) feature, including reading from the followers. In this JIRA, we will change OM to serve read requests from any OM servers, including the follower OMs.

What is the link to the Apache JIRA

HDDS-9279

How was this patch tested?

(Please explain how this patch was tested. Ex: unit tests, manual tests)
(If this patch involves UI changes, please attach a screen-shot; otherwise, remove this)

Will add new tests.

@sodonnel
Copy link
Contributor

My immediate thought on this, without fully understanding the details, is that this is a major change in behavior and should probably be off by default until is sees more testing.

Is it possible for some followers to fall behind - if so how is this handled?

Or are the followers always totally up to date? Eg, is a commit key on the leader absolutely available on all followers when the client call to commit the key returns, so that if a client immediately issues a read and it hits a different OM, it will see the data?

Is there a performance implication for writes by setting this on? Eg in the old way, as the followers slightly behind as they apply the Ratis log, while with this on its more like a 3 way commit for each write, so a delay (eg GC pause) on one OM would affect the write time on the leader?

@kerneltime
Copy link
Contributor

kerneltime commented Sep 14, 2023

cc @tanvipenumudy @muskan1012

@adoroszlai
Copy link
Contributor

Thanks a lot @szetszwo for working on this.

Will add new tests.

Can we mark as "draft" until then?

@szetszwo
Copy link
Contributor Author

... should probably be off by default until is sees more testing.

@sodonnel , thanks for taking a look! Sure, let's disable it by default.

Is it possible for some followers to fall behind - if so how is this handled?

Or are the followers always totally up to date? Eg, is a commit key on the leader absolutely available on all followers when the client call to commit the key returns, so that if a client immediately issues a read and it hits a different OM, it will see the data?

In short, the read-index algorithm handle it. For more details, see the design doc in RATIS-1557 and also the Raft thesis, Section 6.4

Is there a performance implication for writes by setting this on? Eg in the old way, as the followers slightly behind as they apply the Ratis log, while with this on its more like a 3 way commit for each write, so a delay (eg GC pause) on one OM would affect the write time on the leader?

The writes are the same. The feature only change the behavior for read.

@szetszwo szetszwo marked this pull request as draft September 14, 2023 20:37
@szetszwo
Copy link
Contributor Author

Can we mark as "draft" until then?

@adoroszlai , Done.

@sodonnel
Copy link
Contributor

I don't have any real understanding of Ratis or how it is applied to OM HA, so its hard for me to understand how this would work.

For an OM write - the write updates the leader OM data (cache / RocksDB) and then writes the transaction to Ratis before the call returns to the client. For this Ratis write to succeed, must it make it onto the other 2 OM nodes and into their Ratis log or just into a majority of the Ratis logs?

When the Ratis transaction is received by the follower OM, what happens before that call returns to the leader who called it? Is the transaction written to the follower Ratis log AND applied to the follower memory state too before the Ozone client returns? Or is the Ratis log applied to the followers memory async by a separate thread, meaning the original client call returns BEFORE the memory state is updated in all OMs?

If the original client call doesn't return until all 3 OMs have been updated, does this mean the 3 OMs have a strictly consistent memory state, rather than eventually consistent?

@szetszwo
Copy link
Contributor Author

... For this Ratis write to succeed, must it make it onto the other 2 OM nodes and into their Ratis log or just into a majority of the Ratis logs?

Only one Follower is needed, i.e. 2 OMs (including the Leader) out of 3 OMs are needed.

When the Ratis transaction is received by the follower OM, what happens before that call returns to the leader who called it? Is the transaction written to the follower Ratis log AND applied to the follower memory state too before the Ozone client returns? Or is the Ratis log applied to the followers memory async by a separate thread, meaning the original client call returns BEFORE the memory state is updated in all OMs?

When the Leader replies to the client, the transaction must be committed (i.e. replicated to one follower) and it is applied at the Leader. The follower may not have applied it.

The read index algorithm has the following steps:

  1. a client sends a read request to a follower,
  2. the follower asks the Leader the current commit index, say 10 -- this is the read index
  3. the follower may only have applied log to a small index, say 8.
  4. the follower will just wait until its applied index advances to 10.
  5. the follower replies the read request.

@kerneltime
Copy link
Contributor

Maybe we can rerun some subset of the tests after changing the config to LINEARIZABLE.

@sodonnel
Copy link
Contributor

sodonnel commented Oct 2, 2023

I think this is a good change to commit, as it gets us read from standby almost for free. But I do have some concerns - eg if a follower is struggling and requests hit it and are always also because it is always behind etc. I know a lot of thought went into this sort of thing with HDFS, but my memory of it is too old now to remember any of the details.

@whbing
Copy link
Contributor

whbing commented Feb 1, 2024

Hi, I would like to inquire about a question. I did a simple test by running hadoop fs -ls ofs://follower/xxx and found that it can be accessed via the follower. But I attempt to access it via 'serverId' using ofs://serverId/xxx, it consistently hits the leader (from audit log). So under what conditions will it be forwarded to the follower? Or is it random?

@szetszwo
Copy link
Contributor Author

szetszwo commented Feb 1, 2024

... I did a simple test by running ...

@whbing, do you mean that you had applied this change and then ran the test?

But I attempt to access it via 'serverId' using ofs://serverId/xxx, ...

I guess the serverId was translated to the pipeline somewhere at the client side. Then, the client use the pipeline contacting the leader.

@whbing
Copy link
Contributor

whbing commented Feb 2, 2024

... I did a simple test by running ...

@whbing, do you mean that you had applied this change and then ran the test?

But I attempt to access it via 'serverId' using ofs://serverId/xxx, ...

I guess the serverId was translated to the pipeline somewhere at the client side. Then, the client use the pipeline contacting the leader.

I applied the pr and test in my cluster.
I'm sorry that the above description is not accurate. It is not hit leader every time, but the first one in the ozone-om.nodes.xxx in client side conf. For example, om2 will always be selected to read if configured
<name>ozone.om.nodes.myozone</name><value>om2,om0,om1</value>

@szetszwo
Copy link
Contributor Author

szetszwo commented Feb 6, 2024

... the first one in the ozone-om.nodes.xxx in client side conf. ...

@whbing , it looks like that the current code always use the first OM on the list. We should choose the closest OM or randomize it.

@szetszwo
Copy link
Contributor Author

szetszwo commented Feb 7, 2024

@whbing , pushed a change to shuffle omNodeIDList.

@whbing
Copy link
Contributor

whbing commented Feb 22, 2024

@whbing , pushed a change to shuffle omNodeIDList.

Thanks for the update, and I have tested that it is OK

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for driving this forward @szetszwo . We are interested in trying out the linearizable read feature in our cluster to possibly shed some load from the OM leader.

I have not looked into the Ratis linearizable read implementation in depth yet, but I have some initial comments. Will add follow-up comments in the coming weeks after I go through the Ratis implementation.

cc: @symious

Comment on lines +246 to +190
// Read from leader or followers using linearizable read
if (omRatisServer.isLinearizableRead()) {
return handler.handleReadRequest(request);
}
Copy link
Contributor

@ivandika3 ivandika3 Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong, but OzoneManagerRequestHandler#handleReadRequest queries the OM metadata tables directly without going through Ratis (OzoneManagerStateMachine#query). This might cause OM to incorrectly read follower stale data. From my understanding, linearizable read should only work if a request is sent to the OM Ratis server (just like OzoneManagerRatisServer#submitRequest).

Just some question regarding OM consistency: currently (before linearizable read) since reads and writes must goes through the leader, does OM provides a "read-after-write" consistency (or is it stronger?) even through the read does not go through Ratis server? If we enable linearizable read we can increase the consistency guarantee to "linearizable"?

Reference: Jepsen consistency model (https://jepsen.io/consistency)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... OzoneManagerRequestHandler#handleReadRequest queries the OM metadata tables directly without going through Ratis (OzoneManagerStateMachine#query). ...

This is a good point! Adding the code here seems incorrect.

... does OM provides a "read-after-write" consistency (or is it stronger?) even through the read does not go through Ratis server?

For the same client, since the calls are blocking, it will have "read-after-write" consistency (An earlier write must be done before processing the later read.)

... If we enable linearizable read we can increase the consistency guarantee to "linearizable"?

Yes.

It also guarantee read-after-write consistent even for the Ratis async APIs.

Copy link
Contributor

@ivandika3 ivandika3 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the info.

Adding the code here seems incorrect.

I think this can be fixed by also calling submitRequestToRatis even on the read path.

I'm curious, is there any historical reason that currently OM does not send read request to OM Ratis server? Maybe some overhead consideration? My understanding is that it should not have a large additional overhead since read request through Ratis (StateMachine#query) does not generate an append log entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... currently OM does not send read request to OM Ratis server? ...

For non-ha, there is no Ratis server in OM. It probably is the reason.

@whbing
Copy link
Contributor

whbing commented Mar 20, 2024

I have another consultation. I noticed that follwer's 9862 RPC port would restart during running. Related ticket HDDS-10177.
Not sure what the impact is on follwer reads during rpc port restart.

@szetszwo
Copy link
Contributor Author

... Not sure what the impact is on follwer reads during rpc port restart.

The client will fail and retry. It seems okay if the client can failover to other datanodes. We should test it.

@szetszwo
Copy link
Contributor Author

@whbing , @ivandika3 , I am currently not able to continue this work. Are you interested in working on this? Please feel free to let me know.

@symious
Copy link
Contributor

symious commented Apr 3, 2024

@szetszwo We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, but the performance improvement is not very satisfactory.

@whbing
Copy link
Contributor

whbing commented Apr 3, 2024

@whbing , @ivandika3 , I am currently not able to continue this work. Are you interested in working on this? Please feel free to let me know.

@szetszwo Thanks! I'm not an expert on ratis. As far as I know it is almost complete, I can help to add some tests to this pr later.

@whbing
Copy link
Contributor

whbing commented Apr 3, 2024

@szetszwo We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, but the performance improvement is not very satisfactory.

Hi, @symious, What about the performance of your tests? I also ran some performance tests, see https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit#heading=h.o61uifuxltgn

@symious
Copy link
Contributor

symious commented Apr 3, 2024

What about the performance of your tests?

About 20% performance improvement.

@szetszwo
Copy link
Contributor Author

szetszwo commented Apr 3, 2024

..., We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, ...

@symious , I checked the PR, the code looks good.

... but the performance improvement is not very satisfactory.

How did you test it? Could you share the results?

Suppose the test only has read requests but no write requests (i.e. the commit index won't change). Then, the ops should have ~3x improvement since each OM should serve 1/3 of the requests.

The latency may have much smaller or no improvement since

  1. the readIndex algorithm has additional overhead, and
  2. without enabling LINEARIZABLE, the read is NOT linearizable even it reads from the Leader.

@szetszwo
Copy link
Contributor Author

szetszwo commented Apr 3, 2024

... I also ran some performance tests, ...

@whbing , thanks for testing the performance and sharing the results. I commented on the doc. My major comment is that a single freon command may not be able to test the performance correctly since the limitation may be at the client side. In other words, using a single client machine to test the performance of multiple server machines does not sound right.

We should try running the commands in multiple client machines.

@szetszwo
Copy link
Contributor Author

szetszwo commented Apr 3, 2024

BTW, updating Ratis may help. Deployed a 3.1.0-c73a3eb-SNAPSHOT release https://repository.apache.org/content/groups/snapshots/org/apache/ratis/ratis-common/3.1.0-c73a3eb-SNAPSHOT/ , see if you could try it.

@ivandika3
Copy link
Contributor

ivandika3 commented Apr 4, 2024

Another observation is that the benchmark with writes will eventually converge to the leader, which makes most traffic to hit the leader eventually for persistent client (e.g. Ozone client in S3G).

Example case

1: READ -> follower 1
2: WRITE -> follower 1 will throw NotLeaderException, redirect to leader
3: subsequent READs -> leader (until the leader fails)

One possible way is to ensure READ is sent to the follower, is to modify the client (OmFailoverProxyProviderBase) keep track of the current leader and followers of the OM service. All the read requests will be sent to the follower, while the write requests will be sent to the leader.

We might adapt some of the logic from HDFS's ObserverReadProxyProvider.

@ivandika3
Copy link
Contributor

FYI: I created a comment about possible considerations around OM follower read in the ticket (https://issues.apache.org/jira/browse/HDDS-9279). Hopefully this will highlight some ideas around follower read.

@whbing
Copy link
Contributor

whbing commented Apr 9, 2024

... with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279 ...

@symious Will you encounter any errors based on this code? I encounter the following errors many times when running ozone freon ockrw -r 1000 -t 3 --linear --contiguous --duration 5s --size 4096 -v s3v -b $bucket -p $prefix with ozone.om.ratis.server.read.option=LINEARIZABLE

Total execution time (sec): 7
Failures: 0
Successful executions: 0
java.lang.NullPointerException
        at org.apache.hadoop.ozone.om.helpers.OMRatisHelper.getOMResponseFromRaftClientReply(OMRatisHelper.java:69)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:532)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$2(OzoneManagerRatisServer.java:289)
        at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:287)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:267)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:260)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:267)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:201)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:161)
        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:152)
        at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)

reply.getMessage() is null.

@whbing
Copy link
Contributor

whbing commented Apr 9, 2024

BTW, updating Ratis may help. Deployed a 3.1.0-c73a3eb-SNAPSHOT release https://repository.apache.org/content/groups/snapshots/org/apache/ratis/ratis-common/3.1.0-c73a3eb-SNAPSHOT/ , see if you could try it.

OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. (OM will also start failed with the above jar). But this pr doesn't look like it needs 3.1.0 and can be skipped for now.

@symious
Copy link
Contributor

symious commented Apr 9, 2024

reply.getMessage() is null.

Can test on 1.4.0 first.

@szetszwo
Copy link
Contributor Author

szetszwo commented Apr 9, 2024

OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. ...

@whbing , RATIS-2011 was to fix a memory leak issue (otherwise, some TransactionContext objects will not be removed from the map.) Why it courses the failure?

@whbing
Copy link
Contributor

whbing commented Apr 11, 2024

reply.getMessage() is null.

Can test on 1.4.0 first.

1.4.0 is ok. And I update test in https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit#heading=h.r9ym4lnx88xm.
The results are indeed not ideal. The performance improvement is about 20% In pure read scenarios. However, in read-write mixed scenarios, using LINEARIZABLE did not increase throughput. In fact, both QPS and latency became worse.

@whbing
Copy link
Contributor

whbing commented Apr 11, 2024

OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. ...

@whbing , RATIS-2011 was to fix a memory leak issue (otherwise, some TransactionContext objects will not be removed from the map.) Why it courses the failure?

  1. OM build failed with ratis 3.1.0 because it calls RatisHelper.attemptUntilTrue which removed in RATIS-2011.
  2. OM also start failed with the new ratis-common-.jar, with err:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.ratis.protocol.RaftClientRequest
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createRaftRequestImpl(OzoneManagerRatisServer.java:458)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$3(OzoneManagerRatisServer.java:296)
        at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createRaftRequest(OzoneManagerRatisServer.java:294)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:259)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:264)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:271)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:211)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:171)
        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:162)
        at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1098)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1021)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)

I've done some tests just based on ratis-common-3.0.1.jar and updated in the above doc.

@szetszwo
Copy link
Contributor Author

However, in read-write mixed scenarios, using LINEARIZABLE did not increase throughput. In fact, both QPS and latency became worse.

@whbing , thanks for the update! @kerneltime pointed out that the OM thread pool may be the bottleneck. Need to check.

@szetszwo
Copy link
Contributor Author

  1. OM build failed with ratis 3.1.0 because it calls RatisHelper.attemptUntilTrue which removed in RATIS-2011.

Filed RATIS-2057. Let me deploy a Ratis snapshot.

@kerneltime
Copy link
Contributor

kerneltime commented Apr 16, 2024

I would recommend the following:

  1. Establish the absolute peak performance possible in the cluster by using ozone freon om-echo command, I want to see what the pure RPC performance that can be reached. Also, try bumping up the config ipc.server.read.threadpool.size to a value around 10 or more and see if the echo command does better. This number can be really high 300k+ on reasonably modern server. If we are seeing numbers much lower (less than 100k) there is some other issue in the test bed that needs to be looked into. You might need --clients and --thread to be bumped up.
  2. Once the peak is established, we should run a pure read load, on a modern architecture with NVMe, it is fair to expect a performance for pure metadata reads be 150k+ ops/sec (GetKeyInfo). I have not run this against the latest master so not sure if there some regression but I have seen performance around 200k or better. The expectation here is to see that the CPU has reached utilization numbers higher than 50%. It is possible that we do not have enough client threads or client instances.
  3. After we establish the peak performance that the cluster can hit, we should switch to measuring impact of reading from followers using Ratis.
  4. OM should have enough threads idle under the write load to respond with the information needed to linearize the reads, so tracking thread utilization metrics of OM is important.

@szetszwo
Copy link
Contributor Author

szetszwo commented Apr 16, 2024

@whbing ,

OM build failed with ratis 3.1.0 because it calls RatisHelper.attemptUntilTrue which removed in RATIS-2011.

Sorry, added it back by RATIS-2057.

OM also start failed with the new ratis-common-.jar, ...

Not sure why? Could you try 3.1.0-f8b1692-SNAPSHOT 3.1.0-2e778f7-SNAPSHOT, which was built from https://github.com/szetszwo/ratis/tree/Revert_RATIS-2045?

@szetszwo
Copy link
Contributor Author

@whbing , please include also HDDS-10690.

@whbing
Copy link
Contributor

whbing commented Apr 17, 2024

try bumping up the config ipc.server.read.threadpool.size

@kerneltime Thanks for your suggest. A lot of improvements when bumping up ipc.server.read.threadpool.size.
(1) ipc.server.read.threadpool.size=1 (default)

$ ozone freon ome -t=400 --clients=400 -n=8000000
-- Timers ----------------------------------------------------------------------
rpc-payload
             count = 8000000
         mean rate = 59583.14 calls/second
     1-minute rate = 59483.53 calls/second
     5-minute rate = 53146.84 calls/second
    15-minute rate = 50789.33 calls/second
               min = 0.42 milliseconds
               max = 17.52 milliseconds
              mean = 6.55 milliseconds
            stddev = 2.69 milliseconds
            median = 6.54 milliseconds
              75% <= 8.44 milliseconds
              95% <= 10.92 milliseconds
              98% <= 11.89 milliseconds
              99% <= 12.32 milliseconds
            99.9% <= 13.52 milliseconds


Total execution time (sec): 134
Failures: 0
Successful executions: 8000000

(2) ipc.server.read.threadpool.size=20

$ ozone freon ome -t=400 --clients=400 -n=8000000
-- Timers ----------------------------------------------------------------------
rpc-payload
             count = 8000000
         mean rate = 273248.49 calls/second
     1-minute rate = 251957.31 calls/second
     5-minute rate = 239593.34 calls/second
    15-minute rate = 237211.86 calls/second
               min = 0.16 milliseconds
               max = 131.59 milliseconds
              mean = 1.46 milliseconds
            stddev = 4.73 milliseconds
            median = 1.23 milliseconds
              75% <= 1.31 milliseconds
              95% <= 1.63 milliseconds
              98% <= 2.13 milliseconds
              99% <= 2.96 milliseconds
            99.9% <= 71.47 milliseconds


Total execution time (sec): 29
Failures: 0
Successful executions: 8000000

Other tests will be supplemented later. But I experienced a performance decrease when read/write mixed operations in the previous tests which is unexpected.

@szetszwo
Copy link
Contributor Author

szetszwo commented Apr 17, 2024

... I experienced a performance decrease when read/write mixed operations in the previous tests which is unexpected.

... Could you try 3.1.0-f8b1692-SNAPSHOT 3.1.0-2e778f7-SNAPSHOT, which was built from https://github.com/szetszwo/ratis/tree/Revert_RATIS-2045?

This is also unexpected to me. Not sure if the recent commits could fix it.

@whbing
Copy link
Contributor

whbing commented Apr 21, 2024

Not sure if the recent commits could fix it.

@szetszwo I include your given commits and run freon in in the following branch (Please point out any incorrect in my code):

  1. Ratis: https://github.com/whbing/ratis/tree/test-follower-read
  2. Ozone-1.4: https://github.com/whbing/ozone/tree/follower-read-1.4 (use above ratis commits)

Unfortunately, LINEARIZABLE was worse than DEFAULT when I ran the same ozone freon ockrw ... command mixed read and write through multiple clients. LINEARIZABLE: read qps: 1.5k, write qps: 2.4k; DEFAULT: read qps: 8.7k, write qps: 2.8k. Although this does not meet the throughput limit, it is also not as expected.

I think once LINEARIZABLE is better than DEFAULT, then we can do benchmark based on the flow @kerneltime pointed out.

( By the way, master branch (ozone-1.5) https://github.com/whbing/ozone/tree/follower-read-1.5 is not yet compatible follower read (RPC requests appear to not be handled), so temporarily testing based on ozone-1.4. )

@kerneltime
Copy link
Contributor

@whbing do you plan to continue work on this? This is an important feature for Ozone.

@whbing
Copy link
Contributor

whbing commented Jun 14, 2024

@whbing do you plan to continue work on this? This is an important feature for Ozone.

@kerneltime Sorry for the late follow-up. I might need a few more days (maybe next week) before I have time to continue working on this.
@ivandika3 @symious have made significant contributions, look forward to continued involvement if you have time.

(btw, Anyone has edit access to the document OM HA support read from followers testing , and can continue in this document.)

@peterxcli
Copy link
Member

peterxcli commented May 9, 2025

I just left a comment in a PR that is part of ratis read-index - apache/ratis#730 (comment)
Guess it would help the LINEARIZABLE read from follower to have smaller overhead.

The lease read feature is already enabled by default. apache/ratis#730 (comment)

@chungen0126
Copy link
Contributor

Thanks @szetszwo for working on this. This should really help reduce the load on the leader OM once it's in. I think maybe we could add an option to let users choose to read directly from follower OMs.

There’s a risk of reading stale data, but the benefit is lower latency compared to linearizable reads — even if those are implemented perfectly. It would also help further reduce the load on the leader. I think it’d be useful to make this trade-off configurable so users can choose based on their needs.

@szetszwo
Copy link
Contributor Author

@chungen0126 , sure, we could add a conf to support stale read.

@chungen0126
Copy link
Contributor

@szetszwo Would you like to include stale read in this patch, or prefer to create a new JIRA ticket for it? I’m happy to help if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants