Skip to content

HDDS-12928. datanode min free space configuration #8388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

sumitagrawl
Copy link
Contributor

@sumitagrawl sumitagrawl commented May 5, 2025

What changes were proposed in this pull request?

  1. Design document with conclusion for changes
  2. Min free space default value to 20GB and Max of MinFreeSpace and MinFreeSpace percent

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12928

How was this patch tested?

  • Unit test cases updated

@sumitagrawl
Copy link
Contributor Author

As discussion over community and above discussion, Approach 2 is better.

Another point came for default min.free.space -- instead of 10 GB, it can be 20GB as to consider 4 times the default container size. This is case considering activity like container re-concillation, disk space required for export/import under error case, to avoid any blockage of disk.

@sumitagrawl sumitagrawl marked this pull request as ready for review May 9, 2025 06:04
@sumitagrawl sumitagrawl changed the title HDDS-12928. design approach for min free space configuration HDDS-12928. datanode min free space configuration May 9, 2025
@@ -29,6 +29,7 @@
import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.FAILED_DB_VOLUMES_TOLERATED_KEY;
import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.FAILED_METADATA_VOLUMES_TOLERATED_KEY;
import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.FAILED_VOLUMES_TOLERATED_DEFAULT;
import static org.apache.hadoop.ozone.container.common.statemachine.DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE_PERCENT_DEFAULT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new unit test which doesn't explicitly set any of the two properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is considered in org.apache.hadoop.ozone.container.common.statemachine.TestDatanodeConfiguration#isCreatedWitDefaultValues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isCreatedWitDefaultValues unsets DatanodeConfiguration.HDDS_DATANODE_VOLUME_MIN_FREE_SPACE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unset ensure default value is used in ozone configuration, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unset is done for ozone-site.xml as defined in test module, so that it can use default value if not defined. comment added.

@sumitagrawl sumitagrawl requested a review from ChenSammi May 12, 2025 03:37
}

@Test
void useMinFreeSpaceIfBothMinFreeSpacePropertiesSet() {
void useMaxAsPercentIfBothMinFreeSpacePropertiesSet() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useMaxAsPercentIfBothMinFreeSpacePropertiesSet ->

useMaxIfBothMinFreeSpacePropertiesSet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sumitagrawl sumitagrawl requested a review from ChenSammi May 12, 2025 09:25
@sumitagrawl
Copy link
Contributor Author

Please don't mix design doc and implementation in the same PR. Updates to documentation-only PR are cheap due to limited set of checks. If code changes are mixed in, all checks need to be run just to fix a typo in the doc.

@adoroszlai This is concluded to have in same PR if design is very small and need discussion. Having multiple PR for such changes may not be required. This is confirmed in community meeting.

Please share if have different opinion over this ?

@adoroszlai
Copy link
Contributor

adoroszlai commented May 14, 2025

Please share if have different opinion over this ?

I already mentioned why I think separate PR is better even for small design doc. Alternatively: once code changes are added to the PR, push any further design doc changes together with code changes.

No need to split this PR now, but please consider it next time.

@sumitagrawl sumitagrawl marked this pull request as ready for review May 14, 2025 06:18
@smengcl
Copy link
Contributor

smengcl commented May 15, 2025

@sumitagrawl Could you resolve the code conflict? Thanks.

@kerneltime
Copy link
Contributor

Please don't mix design doc and implementation in the same PR. Updates to documentation-only PR are cheap due to limited set of checks. If code changes are mixed in, all checks need to be run just to fix a typo in the doc.

I think for small changes, it should be ok to put the design doc and the code change together. If the PR is churning on the design doc we can keep the PR in draft mode and switch it to ready once design doc looks good and the code works in the fork. This would make it much easier to to browse the git log and understand the code changes done in one go. From a usability stand point I would much rather have a small design doc live together with a code change when browsing history.

@sumitagrawl
Copy link
Contributor Author

sumitagrawl commented May 16, 2025

@sumitagrawl Could you resolve the code conflict? Thanks.

done

@adoroszlai
Copy link
Contributor

I would much rather have a small design doc live together with a code change when browsing history.

That's a good point. It should be fine, if developers can follow one of:

If the PR is churning on the design doc we can keep the PR in draft mode and switch it to ready once design doc looks good and the code works in the fork.

Alternatively: once code changes are added to the PR, push any further design doc changes together with code changes.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sumitagrawl for the patch.

This case is more useful for test environment where disk space is less and no need any additional configuration.

# Conclusion
1. Going with Approach 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supporting both explicit size and percent is good, but there's a few issues still not addressed:

  • Do we only want to support setting one global size for all volumes or supporting individual volume configs?
  • If we are adjusting how hdds.datanode.volume.min.free.space works, we should also adjust hdds.datanode.dir.du.reserved to support configuration in a consistent way.
  • It is bad UX to have two different configs (percent and value) for the same thing. The user has no intuition as to what happens when both are configured.
    • Having a max function buried in the code to resolve this instead of making them exclusive is even worse.

Probably the most user friendly thing to do is deprecate the percent config keys and have one config that takes either a size or percent based value. Whether we want to continue supporting individual volume mappings in the config is still an open question that needs to be resolved in this proposal.

Copy link
Contributor Author

@sumitagrawl sumitagrawl May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28
The config is applicable for each volume as global config only, do not support individual volume. As min-free-space as ozone maintain is global in nature for each volume.
hdds.datanode.dir.du.reserved config simplification is not in scope of this JIRA/PR.

Using 2 config has been discussed in community meeting, and concluded to have both. Any concern now, need re-discuss over community again.

Single config: Approach "2" is not being opted with majority, and hence went with Approach 1 as max of 2. I have updated in design doc for both Approach 1 and approach 2 pros/cons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 2 config has been discussed in community meeting, and concluded to have both. Any concern now, need re-discuss over community again.

Community meetings are for synchronous discussion, not definitive decisions. There are many other forums (mailing list, PRs, Jira, Github discussion). I think this kind of issue is fine for discussion in PR. If you are concerned about visibility, please discuss on mailing list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28 after discussion over community, will go with Approach 1 only.

  1. purpose of du.reserved config is identifying the disk to be reserved for the application sharing the disk, and hence its at disk level. But here, since its ozone managed space, this needs to be flat configuration. So both need not be same.

  2. For simplicity for min.free.space config, its at global level, and may not be required to be disk level similar to reserved.

  3. Max of min.free space and percent is done, min.free space represent min threshold for most of the disk ranges, and percent to be if some disk are exceptionally higher size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to add some guiding principles for config modifications which can help us compare one decision or another. The following are usability issues that can occur with config keys:

  1. Inconsistent config format: Configs that operate on similar entities (space usage, address + port, percentages) that read those values differently.
  2. Hidden config dependencies: When one configuration whose value is unchanged functions differently based on the value applied to a different config.
    • This does not include invalid config combinations that fail component startup, since that is easily caught and called out with an error message. We know that no actively running system will have this configuration.

Both hdds.datanode.du.reserved{.percent} and hdds.datanode.min.free.space{.percent} have issues here, and this is our chance to fix them. Now let's look at how our options either help or hurt the above points.

Inconsistent config format

hdds.datanode.du.reserved and hdds.datanode.min.free.space are both used to configure space reservation on datanode drives, so as stated in point 1 it is most intuitive if they accept the same value format. It is ok if one format is more useful for one than another. For example per-volume configuration may be required for hdds.datanode.du.reserved but not for hdds.datanode.min.free.space. It's still ok for both to have that option because it is not invalid for hdds.datanode.min.free.space, there is still only one set of formatting options for users to remember, and only one parser in the code. If we pick and choose different valid formats for each config we will have two formats to remember and two parsers in the code. Therefore even removing allowed config formats from hdds.datanode.min.free.space that are still present in hdds.datanode.du.reserved actually adds complexity. Based on this hdds.datanode.du.reserved and hdds.datanode.min.free.space must accept values of the same format to avoid introducing new config usability problems.

Hidden config dependencies

Next let's look at how the percent variations affect point 2. Anything other than failing startup if the percent and non-percent variations are specified creates this problem, so if a percent and non-percent config key are given like hdds.datanode.min.free.space.percent and hdds.datanode.min.free.space it must be considered invalid and fail the datanode.

There is another option though: get rid of the percentage specific config keys but still support percentage based configuration with the one hdds.datanode.min.free.space config. Let's look at why this works:

  • hdds.datanode.du.reserved needs to support volume specific configuration in the form of <volume-path>:reserved-size since not all volumes may be used as spill for compute, or the volumes may be utilized differently.
    • This means we will always have a parsing method like VolumeUsage#getReserved to handle converting config strings into long values for a volume.
  • hdds.datanode.min.free.space and hdds.datanode.du.reserved should support the same value format, so hdds.datanode.min.free.space also needs to use this same parser.
  • If we are already need a string parser for both configs, we might as well make it differentiate between percentage and size based configs too.

Proposal to address all requirements

The following layout meets all the constraints defined above:

  • Only two config keys: hdds.datanode.min.free.space and hdds.datanode.du.reserved
  • The valid formats for either config key are:
    • A fixed size, like 20GB
    • A percentage as a float, like 0.001. The lack of a unit differentiates it from the first option.
    • A mapping of volumes to sizes, like /data/hdds1:20GB,/data/hdds2:10GB
  • Only one parser is required for both types of configs.
    • This is not new since a parser is already required and cannot be removed without removing support for per-volume configuration in hdds.datanode.du.reserved.

We should never introduce usability issues in our configurations. We have enough of them already : ) If you can show how an alternate proposal meets all the configuration requirements without impacting usability we can consider that as well, but currently none of the proposals in the doc satisfy this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28 You mean we need have another config for min.free.space?
min.free.space.volumes:/data/hdds1:20GB,/data/hdds2:10GB. -- to be similar to reserve? as similar to du.reserve

I do not feel being in name of similar config for space, we should go with this approach, These are if different purpose. Making similar just in name of both represent free space will make configuration complex for min.free.space as user need config for all disk. There is no usecase till not for min.free.space for this.

I do not agree with this approach. In future if there is a need for this for volume mapping for min.free.space, we can ad as separate requirement and handle.

Share your suggestion for this PR if can be merged .....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding one more to @errose28's list of requirements: cross-compatibility. When extending the possible values allowed for existing configuration, e.g.:

  • adding suffix
  • starting to support percentage
  • allowing list of items instead of single one

we need to consider that even old version may encounter values understood only by new one, and fail. (See HDDS-13077 for a specific example.)

In such cases it may be better to deprecate the existing config properties and add new one(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumitagrawl please re-read the Proposal to address all requirements section in my reply. I think this very clearly states the proposal but the things you are referring to in your reply are not mentioned there.

You mean we need have another config for min.free.space?

No, two configs, one for min free space and one for DU reserved that each use the same value schema. I very clearly said in the previous response "Only two config keys: hdds.datanode.min.free.space and hdds.datanode.du.reserved".

I do not feel being in name of similar config for space, we should go with this approach, These are if different purpose.

This is your take as developer. You need to look at this from a user's perspective. Our consistent failure to consider this perspective is why the system is difficult to use. Configs representing the same "type" of configuration, be it an address, percentage, disk space, time duration, etc must accept the same types of values. Users are not going to understand the nuance of why two similar configs accept different value formats, and in a few months I probably won't either.

Making similar just in name of both represent free space will make configuration complex for min.free.space as user need config for all disk.

This is not part of the proposal. Please re-read it. Min space can be configured with one value across all disks, OR it can use a volume mapping.

There is no usecase till not for min.free.space for this.

Lack of use case is not a valid reason to create a separate value schema for configs that work on the same type. There is also no use case for setting hdds.heartbeat.interval to 7d, but the same value makes perfect sense for hdds.container.scrub.data.scan.interval. Yet they use the same value schema because they both represent time intervals. Your suggestion is analogous to rejecting the d suffix for hdds.heartbeat.interval because it would never be set that long.

@adoroszlai

we need to consider that even old version may encounter values understood only by new one, and fail.

We definitely need to formalize our configuration compatibility guarantees. This probably warrants a dedicated discussion somewhere more visible. My initial take is that we should always support "new software old config", but that supporting "old software new config" is not sustainable because it closes our config for extensions. Especially on the server side this would seem like a deployment error. Maybe our client side config compat guarantees would be different from the server.

Copy link
Contributor Author

@sumitagrawl sumitagrawl May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28

No, two configs, one for min free space and one for DU reserved that each use the same value schema

DU reserved is special case carried from Hadoop, for case of disk sharing by other application. This may not be required to have same value Schema. This needs user input over various disk as sharing may differ, so this schema is specialized. They are not of same type.

This is your take as developer. You need to look at this from a user's perspective. Our consistent failure to consider this perspective is why the system is difficult to use.

From user perspective only, user have no knowledge how to configure the min-free-space, this is more internal to Ozone working.

volume mapping

This might be additional config can be added later on on need basis. May be we should not add just based on
intuition, as this may go to be dead config.
Please share any possible use case in practical env, we can take up this as enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per disk configuration is an abomination that stems from needing to run other applications on nodes/drives along with HDFS in the past. It makes sense for the du config where essentially we tell the Datanode to spare a few drives. This is very different from the min configurations, which has to do with operations and uptime of applications. We must keep configurations for min same across all drives as it has to do with space for repairs and recovery and nothing to do with configuration of the cluster with regards to co-existing with peer applications.

I am all for consistency but in this case it implies a capability that I am not sure we wish to implement.

@adoroszlai adoroszlai marked this pull request as draft May 20, 2025 05:19
@sumitagrawl sumitagrawl marked this pull request as ready for review May 20, 2025 15:17
@sumitagrawl
Copy link
Contributor Author

@errose28 @kerneltime please share opinion, based on this will merge this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants