-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Reindex data stream indices on different nodes #125171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindex data stream indices on different nodes #125171
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @masseyke, I've created a changelog YAML for you. |
…masseyke/elasticsearch into round-robin-reindex-data-stream-indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a round-robin strategy for reindexing data stream indices by routing reindex requests to different ingest nodes, aiming to alleviate pipeline execution bottlenecks on a single node. Key changes include:
- Updating ReindexDataStreamIndexTransportAction to use transportService.sendRequest to distribute work among ingest nodes.
- Modifying tests in ReindexDataStreamIndexTransportActionTests to simulate cluster states with various ingest node configurations and to validate round-robin behavior.
- Adding a changelog entry to document the enhancement.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
x-pack/plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/action/ReindexDataStreamIndexTransportAction.java | Updates reindex logic to round-robin requests among ingest nodes, replacing client.execute with transportService.sendRequest. |
x-pack/plugin/migrate/src/test/java/org/elasticsearch/xpack/migrate/action/ReindexDataStreamIndexTransportActionTests.java | Updates tests to properly mock cluster states and verify that round-robin node assignment behaves as expected. |
docs/changelog/125171.yaml | Adds a changelog entry summarizing the enhancement for reindexing data stream indices. |
Comments suppressed due to low confidence (2)
x-pack/plugin/migrate/src/test/java/org/elasticsearch/xpack/migrate/action/ReindexDataStreamIndexTransportActionTests.java:270
- [nitpick] Consider capturing and asserting the entire sequence of nodes returned by the round-robin logic (e.g. using getAllValues() from the captor) to more robustly verify the round-robin ordering.
for (int i = 0; i < ingestNodeCount - 1; i++) {
x-pack/plugin/migrate/src/main/java/org/elasticsearch/xpack/migrate/action/ReindexDataStreamIndexTransportAction.java:323
- [nitpick] The error message 'No ingest nodes in cluster' could be further enhanced by suggesting possible steps or checks for cluster configuration to help users troubleshoot the issue.
if (ingestNodes.length == 0) {
if (ingestNodes.length == 0) { | ||
listener.onFailure(new NoNodeAvailableException("No ingest nodes in cluster")); | ||
} else { | ||
DiscoveryNode ingestNode = ingestNodes[Math.floorMod(ingestNodeOffsetGenerator.incrementAndGet(), ingestNodes.length)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the API's for Math.floorMod
and AtomicInteger
, I think this line should correctly handle the case where the AtomicInteger
overflows and the dividend becomes negative but is it worth adding an test for that case to future proof or are the current tests OK to handle it passively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried it out, and it is a minor bug -- Math.floorMod(Integer.MAX_VALUE, 17)
is equal to Math.floorMod(Integer.MAX_VALUE + 1, 17)
. So if the test were to start with a value near Integer.MAX_VALUE it would fail (although I'm not too worried about the round-robin repeating a node twice in that very rare situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the maximum initial value of that random number to be much smaller so that it never exceeds MAX_VALUE and the test will never fail. I think the actual behavior (choosing the same node twice once every 4.3 billion times) is harmless, and not worth the complexity of fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One NABD comment but otherwise looks like a nice speed improvement :-) 👍🏻
…masseyke/elasticsearch into round-robin-reindex-data-stream-indices
(cherry picked from commit 24132d3)
(cherry picked from commit 24132d3)
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
(cherry picked from commit 24132d3)
ReindexDataStreamIndexTransportAction currently always runs TransportReindexAction on the local node. TransportReindexAction allows for a pipeline to be passed in. When this happens, the pipeline is run on the local node, and then the output data is sent to whichever node has its shards for indexing. We have found this to be a bottleneck in performance testing. For example, we tested using a single data stream with 100 10-GB indices on a 10-node cluster. We configured data stream reindex to allow for 100 indices to be reindexed at once, and no throttling. We found that one node (the one where the reindex data stream task was running) to average 100% CPU use (with almost all of that being pipeline execution), and the other nodes to average ~15% (almost all indexing-related).
An initial change was to modify ReindexTransportAction to round-robin the nodes that it uses to handle slices of the data. In the example above, each index is divided into several slices, and each slice is sent to a different node. This led to the total reindex time for the 100-index data stream to be 1/3 of what it had been before.
This PR is a little less risky. It does the round-robin logic inside of data stream reindex. Pipelines for all documents of all slices for a single index will still be executed on a single node, but each index within a data stream is potentially sent to a different node. This will get us a pretty similar performance increase in most data stream reindex uses, without the risk of touching something that is in much wider use. And I don't think that this change prevents us from making the other change later on. They are both actually good for two different use cases. This change is good if you have many indices, and especially if you have many small indices. The change to TransportReindexAction would be good if you have one very large index.
The pipeline that is run in most cases is the one added in #121617.