Skip to content

Do not recommend increasing max_shards_per_node #120458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today if the shards_capacity health indicator detects a problem then
it recommends increasing the limit, which goes against the advice in the
manual about not increasing these limits and also makes it rather
pointless having a limit in the first place.

This commit improves the recommendation to suggest either adding nodes
or else reducing the shard count.

Today if the `shards_capacity` health indicator detects a problem then
it recommends increasing the limit, which goes against the advice in the
manual about not increasing these limits and also makes it rather
pointless having a limit in the first place.

This commit improves the recommendation to suggest either adding nodes
or else reducing the shard count.
@DaveCTurner DaveCTurner added >bug auto-backport Automatically create backport pull requests when merged :Data Management/Health v9.0.0 v8.18.0 labels Jan 20, 2025
@DaveCTurner DaveCTurner requested a review from gmarouli January 20, 2025 09:10
@DaveCTurner DaveCTurner requested a review from a team as a code owner January 20, 2025 09:10
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jan 20, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Comment on lines 81 to 88
static final Diagnosis SHARDS_MAX_CAPACITY_REACHED_DATA_NODES = SHARD_MAX_CAPACITY_REACHED_FN.apply(
"increase_max_shards_per_node",
"decrease_shards_per_non_frozen_node",
ShardLimitValidator.SETTING_CLUSTER_MAX_SHARDS_PER_NODE,
"data"
"non-frozen"
);
static final Diagnosis SHARDS_MAX_CAPACITY_REACHED_FROZEN_NODES = SHARD_MAX_CAPACITY_REACHED_FN.apply(
"increase_max_shards_per_node_frozen",
"decrease_shards_per_frozen_node",
ShardLimitValidator.SETTING_CLUSTER_MAX_SHARDS_PER_NODE_FROZEN,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bad "increase the limit" advice was baked into the actual diagnosis IDs - fixed here, and see also https://github.com/elastic/telemetry/pull/4362 for the corresponding change to the telemetry cluster

@gmarouli
Copy link
Contributor

Hey @DaveCTurner , you are bringing up a very good point here. I do have a concern though.

If I am not mistaken the current limit is quite low, so it is probable that it would make sense to first increase the limit before expanding the cluster or reducing the shards. So, I am thinking of 2 options to make this more useful to users:

  • we could combine this PR with increasing the default to a more realistic value, considering the many shards improvements,
  • or, if this is too risky or difficult to implement to add one more sentence in the diagnosis, to give the user one last option to increase the limit if they are sure that their cluster can handle the load.

Does this make sense?

@DaveCTurner
Copy link
Contributor Author

The default of 1000 shards per node is still rather relaxed IMO, at least for high-segment-count or high-field-count indices, and we do want users to stick to it for now. We do get support cases involving egregiously high shard-per-node counts sometimes, and we need to be able to point at the guidance in the manual when telling users to scale up their clusters. It rather weakens that argument when the health API told them specifically to keep on relaxing the limit each time they got close.

A better limit would be nice ofc, maybe one based on #111123, but that won't be a quick process and I don't think we can in good conscience block this change on that work.

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for raising this and addressing this @DaveCTurner

@DaveCTurner DaveCTurner merged commit 9c0709f into elastic:main Jan 20, 2025
16 checks passed
@DaveCTurner DaveCTurner deleted the 2025/01/20/shards-capacity-advice branch January 20, 2025 14:33
@DaveCTurner
Copy link
Contributor Author

Thanks @gmarouli

@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 120458

DaveCTurner added a commit that referenced this pull request Jan 20, 2025
Today if the `shards_capacity` health indicator detects a problem then
it recommends increasing the limit, which goes against the advice in the
manual about not increasing these limits and also makes it rather
pointless having a limit in the first place.

This commit improves the recommendation to suggest either adding nodes
or else reducing the shard count.
@DaveCTurner
Copy link
Contributor Author

Backported to 8.x in de5be24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Data Management/Health Team:Data Management Meta label for data/management team v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants