-
Notifications
You must be signed in to change notification settings - Fork 535
HDDS-9279. OM HA: support read from followers. #5288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
My immediate thought on this, without fully understanding the details, is that this is a major change in behavior and should probably be off by default until is sees more testing. Is it possible for some followers to fall behind - if so how is this handled? Or are the followers always totally up to date? Eg, is a commit key on the leader absolutely available on all followers when the client call to commit the key returns, so that if a client immediately issues a read and it hits a different OM, it will see the data? Is there a performance implication for writes by setting this on? Eg in the old way, as the followers slightly behind as they apply the Ratis log, while with this on its more like a 3 way commit for each write, so a delay (eg GC pause) on one OM would affect the write time on the leader? |
Thanks a lot @szetszwo for working on this.
Can we mark as "draft" until then? |
@sodonnel , thanks for taking a look! Sure, let's disable it by default.
In short, the read-index algorithm handle it. For more details, see the design doc in RATIS-1557 and also the Raft thesis, Section 6.4
The writes are the same. The feature only change the behavior for read. |
@adoroszlai , Done. |
I don't have any real understanding of Ratis or how it is applied to OM HA, so its hard for me to understand how this would work. For an OM write - the write updates the leader OM data (cache / RocksDB) and then writes the transaction to Ratis before the call returns to the client. For this Ratis write to succeed, must it make it onto the other 2 OM nodes and into their Ratis log or just into a majority of the Ratis logs? When the Ratis transaction is received by the follower OM, what happens before that call returns to the leader who called it? Is the transaction written to the follower Ratis log AND applied to the follower memory state too before the Ozone client returns? Or is the Ratis log applied to the followers memory async by a separate thread, meaning the original client call returns BEFORE the memory state is updated in all OMs? If the original client call doesn't return until all 3 OMs have been updated, does this mean the 3 OMs have a strictly consistent memory state, rather than eventually consistent? |
Only one Follower is needed, i.e. 2 OMs (including the Leader) out of 3 OMs are needed.
When the Leader replies to the client, the transaction must be committed (i.e. replicated to one follower) and it is applied at the Leader. The follower may not have applied it. The read index algorithm has the following steps:
|
Maybe we can rerun some subset of the tests after changing the config to |
I think this is a good change to commit, as it gets us read from standby almost for free. But I do have some concerns - eg if a follower is struggling and requests hit it and are always also because it is always behind etc. I know a lot of thought went into this sort of thing with HDFS, but my memory of it is too old now to remember any of the details. |
Hi, I would like to inquire about a question. I did a simple test by running |
@whbing, do you mean that you had applied this change and then ran the test?
I guess the serverId was translated to the pipeline somewhere at the client side. Then, the client use the pipeline contacting the leader. |
I applied the pr and test in my cluster. |
@whbing , it looks like that the current code always use the first OM on the list. We should choose the closest OM or randomize it. |
@whbing , pushed a change to shuffle omNodeIDList. |
Thanks for the update, and I have tested that it is OK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for driving this forward @szetszwo . We are interested in trying out the linearizable read feature in our cluster to possibly shed some load from the OM leader.
I have not looked into the Ratis linearizable read implementation in depth yet, but I have some initial comments. Will add follow-up comments in the coming weeks after I go through the Ratis implementation.
cc: @symious
// Read from leader or followers using linearizable read | ||
if (omRatisServer.isLinearizableRead()) { | ||
return handler.handleReadRequest(request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong, but OzoneManagerRequestHandler#handleReadRequest
queries the OM metadata tables directly without going through Ratis (OzoneManagerStateMachine#query
). This might cause OM to incorrectly read follower stale data. From my understanding, linearizable read should only work if a request is sent to the OM Ratis server (just like OzoneManagerRatisServer#submitRequest
).
Just some question regarding OM consistency: currently (before linearizable read) since reads and writes must goes through the leader, does OM provides a "read-after-write" consistency (or is it stronger?) even through the read does not go through Ratis server? If we enable linearizable read we can increase the consistency guarantee to "linearizable"?
Reference: Jepsen consistency model (https://jepsen.io/consistency)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... OzoneManagerRequestHandler#handleReadRequest queries the OM metadata tables directly without going through Ratis (OzoneManagerStateMachine#query). ...
This is a good point! Adding the code here seems incorrect.
... does OM provides a "read-after-write" consistency (or is it stronger?) even through the read does not go through Ratis server?
For the same client, since the calls are blocking, it will have "read-after-write" consistency (An earlier write must be done before processing the later read.)
... If we enable linearizable read we can increase the consistency guarantee to "linearizable"?
Yes.
It also guarantee read-after-write consistent even for the Ratis async APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the info.
Adding the code here seems incorrect.
I think this can be fixed by also calling submitRequestToRatis
even on the read path.
I'm curious, is there any historical reason that currently OM does not send read request to OM Ratis server? Maybe some overhead consideration? My understanding is that it should not have a large additional overhead since read request through Ratis (StateMachine#query
) does not generate an append log entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... currently OM does not send read request to OM Ratis server? ...
For non-ha, there is no Ratis server in OM. It probably is the reason.
I have another consultation. I noticed that follwer's 9862 RPC port would restart during running. Related ticket HDDS-10177. |
The client will fail and retry. It seems okay if the client can failover to other datanodes. We should test it. |
@whbing , @ivandika3 , I am currently not able to continue this work. Are you interested in working on this? Please feel free to let me know. |
@szetszwo We test this feature with a revised version of this PR, https://github.com/symious/ozone/tree/HDDS-9279, but the performance improvement is not very satisfactory. |
@szetszwo Thanks! I'm not an expert on ratis. As far as I know it is almost complete, I can help to add some tests to this pr later. |
Hi, @symious, What about the performance of your tests? I also ran some performance tests, see https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit#heading=h.o61uifuxltgn |
About 20% performance improvement. |
@symious , I checked the PR, the code looks good.
How did you test it? Could you share the results? Suppose the test only has read requests but no write requests (i.e. the commit index won't change). Then, the The
|
@whbing , thanks for testing the performance and sharing the results. I commented on the doc. My major comment is that a single freon command may not be able to test the performance correctly since We should try running the commands in multiple client machines. |
BTW, updating Ratis may help. Deployed a |
Another observation is that the benchmark with writes will eventually converge to the leader, which makes most traffic to hit the leader eventually for persistent client (e.g. Ozone client in S3G). Example case 1: READ -> follower 1 One possible way is to ensure READ is sent to the follower, is to modify the client (OmFailoverProxyProviderBase) keep track of the current leader and followers of the OM service. All the read requests will be sent to the follower, while the write requests will be sent to the leader. We might adapt some of the logic from HDFS's |
FYI: I created a comment about possible considerations around OM follower read in the ticket (https://issues.apache.org/jira/browse/HDDS-9279). Hopefully this will highlight some ideas around follower read. |
@symious Will you encounter any errors based on this code? I encounter the following errors many times when running
java.lang.NullPointerException
at org.apache.hadoop.ozone.om.helpers.OMRatisHelper.getOMResponseFromRaftClientReply(OMRatisHelper.java:69)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponseImpl(OzoneManagerRatisServer.java:532)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$2(OzoneManagerRatisServer.java:289)
at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createOmResponse(OzoneManagerRatisServer.java:287)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:267)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:260)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:267)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:201)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:161)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:152)
at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048) reply.getMessage() is null. |
OM is not yet compatible with ratis 3.1.0. such as RATIS-2011. (OM will also start failed with the above jar). But this pr doesn't look like it needs 3.1.0 and can be skipped for now. |
Can test on 1.4.0 first. |
@whbing , RATIS-2011 was to fix a memory leak issue (otherwise, some TransactionContext objects will not be removed from the map.) Why it courses the failure? |
1.4.0 is ok. And I update test in https://docs.google.com/document/d/1xVkaQYDXJmztETJVZQHkkij_j8j6MGQ4XB8ehathhG8/edit#heading=h.r9ym4lnx88xm. |
java.lang.NoClassDefFoundError: Could not initialize class org.apache.ratis.protocol.RaftClientRequest
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createRaftRequestImpl(OzoneManagerRatisServer.java:458)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$3(OzoneManagerRatisServer.java:296)
at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.createRaftRequest(OzoneManagerRatisServer.java:294)
at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:259)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:264)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:271)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:211)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:171)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:162)
at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1098)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1021)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) I've done some tests just based on ratis-common-3.0.1.jar and updated in the above doc. |
@whbing , thanks for the update! @kerneltime pointed out that the OM thread pool may be the bottleneck. Need to check. |
Filed RATIS-2057. Let me deploy a Ratis snapshot. |
I would recommend the following:
|
@whbing ,
Sorry, added it back by RATIS-2057.
Not sure why? Could you try |
@whbing , please include also HDDS-10690. |
@kerneltime Thanks for your suggest. A lot of improvements when bumping up $ ozone freon ome -t=400 --clients=400 -n=8000000
-- Timers ----------------------------------------------------------------------
rpc-payload
count = 8000000
mean rate = 59583.14 calls/second
1-minute rate = 59483.53 calls/second
5-minute rate = 53146.84 calls/second
15-minute rate = 50789.33 calls/second
min = 0.42 milliseconds
max = 17.52 milliseconds
mean = 6.55 milliseconds
stddev = 2.69 milliseconds
median = 6.54 milliseconds
75% <= 8.44 milliseconds
95% <= 10.92 milliseconds
98% <= 11.89 milliseconds
99% <= 12.32 milliseconds
99.9% <= 13.52 milliseconds
Total execution time (sec): 134
Failures: 0
Successful executions: 8000000 (2) ipc.server.read.threadpool.size=20 $ ozone freon ome -t=400 --clients=400 -n=8000000
-- Timers ----------------------------------------------------------------------
rpc-payload
count = 8000000
mean rate = 273248.49 calls/second
1-minute rate = 251957.31 calls/second
5-minute rate = 239593.34 calls/second
15-minute rate = 237211.86 calls/second
min = 0.16 milliseconds
max = 131.59 milliseconds
mean = 1.46 milliseconds
stddev = 4.73 milliseconds
median = 1.23 milliseconds
75% <= 1.31 milliseconds
95% <= 1.63 milliseconds
98% <= 2.13 milliseconds
99% <= 2.96 milliseconds
99.9% <= 71.47 milliseconds
Total execution time (sec): 29
Failures: 0
Successful executions: 8000000 Other tests will be supplemented later. But I experienced a performance decrease when read/write mixed operations in the previous tests which is unexpected. |
This is also unexpected to me. Not sure if the recent commits could fix it. |
@szetszwo I include your given commits and run freon in in the following branch (Please point out any incorrect in my code):
Unfortunately, LINEARIZABLE was worse than DEFAULT when I ran the same I think once LINEARIZABLE is better than DEFAULT, then we can do benchmark based on the flow @kerneltime pointed out. ( By the way, master branch (ozone-1.5) https://github.com/whbing/ozone/tree/follower-read-1.5 is not yet compatible follower read (RPC requests appear to not be handled), so temporarily testing based on ozone-1.4. ) |
@whbing do you plan to continue work on this? This is an important feature for Ozone. |
@kerneltime Sorry for the late follow-up. I might need a few more days (maybe next week) before I have time to continue working on this. (btw, Anyone has edit access to the document OM HA support read from followers testing , and can continue in this document.) |
|
Thanks @szetszwo for working on this. This should really help reduce the load on the leader OM once it's in. I think maybe we could add an option to let users choose to read directly from follower OMs. There’s a risk of reading stale data, but the benefit is lower latency compared to linearizable reads — even if those are implemented perfectly. It would also help further reduce the load on the leader. I think it’d be useful to make this trade-off configurable so users can choose based on their needs. |
@chungen0126 , sure, we could add a conf to support stale read. |
@szetszwo Would you like to include stale read in this patch, or prefer to create a new JIRA ticket for it? I’m happy to help if needed. |
What changes were proposed in this pull request?
Ratis has a new Linearizable Read (RATIS-1557) feature, including reading from the followers. In this JIRA, we will change OM to serve read requests from any OM servers, including the follower OMs.
What is the link to the Apache JIRA
HDDS-9279
How was this patch tested?
(Please explain how this patch was tested. Ex: unit tests, manual tests)
(If this patch involves UI changes, please attach a screen-shot; otherwise, remove this)
Will add new tests.