Skip to content

HDDS-8387. Improved Storage Volume Handling in Datanodes #8405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

errose28
Copy link
Contributor

@errose28 errose28 commented May 6, 2025

What changes were proposed in this pull request?

This document outlines a proposal to improve storage volume failure handling in datanodes by:

  • Introducing a new degraded state to handle partially failed volumes without reducing durability of readable data.
  • Improve volume health observability through CLI and metrics.

This can be used as the focus for Phase II of the container and volume scanners, which is currently just a collection of miscellaneous tasks.

What is the link to the Apache JIRA

HDDS-8387

How was this patch tested?

N/A

@errose28
Copy link
Contributor Author

errose28 commented May 6, 2025

cc @ptlrs who helped work on this design.

@sodonnel
Copy link
Contributor

sodonnel commented May 7, 2025

My observation from past problems on HDFS is that partially failed disks are a very large problem. They are hard to detect and sometimes reads on them can block for a very long time, resulting in hard to explain slow reads. I'd be more in favor of failing bad volumes completely, rather than

I understand the idea this doc is going with, but it does add quite a bit of new complexity to the system:

  • The new degraded state and DN excluding it for writes
  • DN capacity reduction perhaps? Balancer has to factor this in.
  • SCM tracking container to volume mappings
  • The replication flow considering the new state
  • Probably a need for SCM to tell clients to try the degraded volume last

The system is intended to handle the abrupt loss of a datanode or disk at any time, so what is driving the need for this proposal? Are volumes being failed too easily resulting in dataloss?

If volumes are being failed to eagerly, then for what reason? Disk full, checksum errors, outright failed reads?

We do have mechanisms to repair bad containers already (scanner and reconcilor), so that part is handled.

What is considered an IO error which can trigger an ondemand scan? Is it a checksum validation or an unexpect EOF / data length error? Are we keeping a sliding windown count of each unique block so that 10 failures on the same block only counts as 1 rather than 10?

@errose28
Copy link
Contributor Author

errose28 commented May 7, 2025

Thanks for checking this out @sodonnel. I can improve the motivation at the top of this doc, but the driving factor is the same as any changes we have made to replication manager, reconstruction, or reconciliation: As a storage system, we must prioiritize data durability over everything else, and we should never deliberately reduce data durability.

My observation from past problems on HDFS is that partially failed disks are a very large problem. They are hard to detect and sometimes reads on them can block for a very long time, resulting in hard to explain slow reads. I'd be more in favor of failing bad volumes completely,

This is conflating two different issues with partially failed volumes: performance and durability. This doc is only concerned with data durability, which is more important. If a disk is causing performance problems then that should be identifed with metrics and alerting, which we also don't do well, but that would be a different proposal. We should not remove readable replicas without first copying them just to improve system performance.

The system is intended to handle the abrupt loss of a datanode or disk at any time, so what is driving the need for this proposal? Are volumes being failed too easily resulting in dataloss?

There is a difference between us losing copies of data because of an external issue we are responding to, and us losing copies of data because we removed them ourselves. In the later case we are in control, and need to make new copies before removing existing ones. For reference, previously our handling of unhealthy replicas did not do this (we deleted them on sight) and this was rightfully changed.

If volumes are being failed to eagerly, then for what reason? Disk full, checksum errors, outright failed reads?

This seems to imply that there is an exact set of criteria fail a volume, and anything outside of that is either "too eager" or "not eager enough". Disk failures are a fuzzy problem and I don't think such an exact set of criteria exists. The purpose of adding an intermediate state is to safely account for this unknown, rather than pin down a binary definition of volume health which becomes closely tied to our durability guarantees.

We do have mechanisms to repair bad containers already (scanner and reconcilor), so that part is handled.

This is true. An alternate proposal would be to keep the current criteria we are using for volume failure, and discard all checks that this doc currently proposes using to move a volume to degraded health. Then let scanner + reconciler fix things as we go. I considered this approach and I'm actually not opposed to it, my hesitation was that it seems irresponsible to treat volumes that are frequently throwing errors the same as if they are totally healthy. We cannot choose to fail these reachable volumes without first copying all their data though.

What is considered an IO error which can trigger an ondemand scan? Is it a checksum validation or an unexpect EOF / data length error? Are we keeping a sliding windown count of each unique block so that 10 failures on the same block only counts as 1 rather than 10?

Everything listed here could trigger an on-demand scan. Currently the on-demand volume scanner is plugged into the catch blocks of most datanode IO paths. The sliding windows are planned to be tracked at a per-disk level, but this raises a good point that if one bad sector becomes hot it may artifically cause the volume to seem worse than it is purely based on scan counts.

Overall I agree that there is complexity involved here, and I am not tied to this particular solution. One alternate proposal could be to improve our disk health metrics and dashboards, maybe putting some info in Recon, to alert when disks have reached a degraded state. But at that point the safe way out would be disk decommissioning, which would be a new feature that looks similar to this one.

Regardless of the proposal, I do think we need change in this area. As stated at the top of the doc, currently our only two options to handle partial volume failures are to reduce durability by removing all data on a disk that is potentially still readable, or swallow disk errors with the scanner and continue to put new data on this volume as if nothing is wrong.

@errose28
Copy link
Contributor Author

errose28 commented May 8, 2025

@sodonnel based on your comments I have another proposal to handle this issue. I can write that up in this doc as well so we can compare.

The current proposal mixes a degraded volume state with a sort of volume decommissioning feature. The later is where most of the complexity comes from. As an initial change, we can make the degraded state purely a sort of alert that shows up via metrics, CLI, Recon, etc when a volume is experiencing numerous IO errors but is still reachable. The state does not need to be persisted in this case. At a later time, we can add volume decommissioning as a separate feature, which would handle persistence of the decom state, space calculation, moving data, and all that work similar to full datanode decommissioning. We could optionally add a config to have the system automatically decom degraded volumes. However, in this proposal volume decommissioning would be left as a future improvement, and the current scope of work would just be about flagging a degraded state for volumes.

@errose28
Copy link
Contributor Author

The original version of this doc can be viewed from the commit history. I've updated it to treat volume decommissioning as a separate feature that is not scoped here.

@kerneltime kerneltime requested a review from Copilot May 13, 2025 22:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Once degraded volumes can be reported to SCM, it is possible to take action for the affected data on that volume. When a degraded volume is encountered, we should:
- Stop writing new data onto the volume
- Make additional copies of the data on that volume, ideally using other nodes as sources if possible
- Notify the admin when all data from the volume has been fully replicated so it is safe to remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibly simpler option, but not without its own challenges, would be to use a variation of the disk balancer to copy "as much as possible" off the old disk onto other disks on the same node. Then fail the disk and let the hopefully few containers with issues be replicated via the normal path.

This does get into just how bad the disk is - if most reads are failing or are very slow, this is unworkable probably, but then also the disk is probably affecting normal cluster read ops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this shows another way of action to handled degraded volume quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is getting into the decom implementation which is not really in scope for this document. This section is just to demonstrate how degraded volumes could be handled by a generic volume decom feature with only a loose coupling between the two features' implementations.

@sodonnel
Copy link
Contributor

I think it is a good idea to decouple a disk decommission feature from the detection design.

I do see that disk decommission would be potentially valuable, but at the same time HDFS has lived without it. Disks are getting denser so that is an argument in favor of decommissioning a disk, as the recovery time gets longer with the density.

HDFS does have a hot swap disk feature. An admin can adjust config on a running datanode process to remove or add a volume without restarting the process. If you remove a volume, then the DN just reports its blocks to the NN without that disk and the NN replicates as usual.

One thing about soft failing a disk, by which I mean we fail it in Ozone, but to the OS the disk is still there and working, is that the data on it is not lost. If several disks on the cluster were soft failed at the same time, then you could have a data available problem, but not a data loss problem. The data is still there such as it was (readable or not). That makes the graceful decommission of a bad disk nice to have, but not essential. If replication cannot keep up before the next disk fails, the admin can bring the soft failed disks back to allow the data to be recovered.

One thing that is important, is that there is a clear indication somewhere (SCM WebUI, Metrics, etc) that some configured disks on the cluster are offline, so that in the event of data availability problems it is clear that some disks have been dropped and that can be investigated.


- Add metrics for volume scanners. They currently don't have any.
- This includes metrics for the current counts in each health state sliding window.
- Add metrics for volume health state (including the new degraded state), which can be used for alerting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming from a monitoring background, I have three thoughts here:

  • How will the state be represented as a metric? Is there a known/intended approach that fits the models in Metrics2, Prometheus (metrics and alert rules), and Grafana?
  • Do you think the resulting cardinality will scale to large clusters (e.g., 1 EB) with the typical monitoring system?
  • As a follow-up, with alerting being the primary use case, are we fine with the costs of individually tracking each volume? (Time series are fairly expensive in general, so this could work out to anything between a non-issue and a serious concern, in my opinion.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good questions. I should solidify a proposal for metrics in the doc. Basically we need to make a decision whether to expose volume health state per volume as a metric, or to just track the number of volumes in each state per datanode. The later is much simpler and I believe that is all that HDFS has (NumFailedVolumes). I think I am leaning towards this approach, and admins can use the CLI to figure out which specific volumes on the alerted datanode are bad.

How will the state be represented as a metric? Is there a known/intended approach that fits the models in Metrics2, Prometheus (metrics and alert rules), and Grafana?

If we want to expose health per volume as a metric I think we have two options:

  • A gauge per health state per volume, set to either 1 to indicate the volume is in that state, or 0 to indicate it is not.
  • One gauge per volume, which takes a different numeric value for each state. For example 0=healthy, 1=degraded, 2=failed.

I don't think there is a good way to expose state as a string like healthy, degraded, failed without being coupled to a particular framework afaik. However, now that I'm looking at this problem I'm not sure we want to track health state per volume as a metric.

Do you think the resulting cardinality will scale to large clusters (e.g., 1 EB) with the typical monitoring system?

I'm not sure I follow this part entirely. Is the concern that the monitoring will not be able to keep up with checking each volume for an alert even if the underlying metrics database contains the information?

are we fine with the costs of individually tracking each volume?

Currently datanodes already track one set of metrics per volume using VolumeInfoMetrics, although this is only used for hdds/data volumes right now and should also be used for metadata volumes. If we choose to track health state as a metric per volume, we would not significantly increase the amount of data being tracked beyond this.

@errose28
Copy link
Contributor Author

I've updated the doc with new sections to specify improvements to metrics and CLI for identifying volume health. See also #7266 which is related to this area.


Once degraded volumes can be reported to SCM, it is possible to take action for the affected data on that volume. When a degraded volume is encountered, we should:
- Stop writing new data onto the volume
- Make additional copies of the data on that volume, ideally using other nodes as sources if possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the extra benefit of keeping volume as degraded instead of failure, if both state's post process is replicating all data on this volume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specified in the beginning of the doc:

In many cases the volume may still be mostly or partially readable. Containers on this volume that were still readable would be removed by the system and have their redundancy reduced unecessarily. This is not a safe operation.

It is the same difference between decommissioning a node and just shutting it down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to explicitly exclude reading data from these degraded state volume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the point of the degraded state is to avoid availability or durability issues with healthy data. I've updated the doc to specify that reading from a degraded volume is supported.

- Make additional copies of the data on that volume, ideally using other nodes as sources if possible
- Notify the admin when all data from the volume has been fully replicated so it is safe to remove

This summary is very similar to the process of datanode decommissioning, just at the disk level instead of the node level. Having decommissioning for each failure domain (full node or disk) in the system is generally useful, so we can implement this as a volume decommissioning feature that is completely separate from disk health. Designing this feature comes with its own challenges and is not in the scope of this document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more like a DeadNodeHandler, say FailedVolumeHandler, which removes all replicas from SCM in memory container map for this volume, and then let the replication manager do the rest of replication work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now SCM just learns of volume failures through an FCR sent when the volume fails which shows the containers missing. If we do end up creating a volume to container map in SCM we can re-evaluate whether we want to continue using this approach on volume failures. Volume decommissioning would need some kind of handler, probably similar to the DatanodeAdminMonitorImpl. We aren't really designing this decom/replication feature yet though.

- Keep the volume healthy
- Containers on this volume will not have extra copies made until the container scanner finds corruption and marks them unhealthy, after which we have already lost redundancy.

For the common case of soft volume failures, neither of these are good options. This document outlines a proposal to classify and handle soft volume failures in datanodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a little bit more about what kind of failure is categorized as hard failure, and what kind of failure will be treated as soft failure? Some examples will be helpful with the understanding of goal of this proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifications are provided later in the doc. Do you still have questions after finishing the document?

Copy link
Contributor

@ChenSammi ChenSammi May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these two. Do I miss anything else?

Failure threshold of the **degraded volume sliding window** is crossed.
Failure threshold of the **failed volume sliding window** is crossed.

What will be the recommended(default) thresholds value for degraded and failed state, and what will be the default slide window duration mentioned? Also an explanation of why we choose these default value is helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is mapping of which checks correspond to which sliding window defined in the Sliding Window section, and then when the threshold of the window is crossed, the state is changed. Defining the specific thresholds for the windows is going to take some thought, so for now I've left that detail to one of the tasks in the Task Breakdown section. If we are able to decide on this earlier we can specify the initial recommendation in the doc as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is a situation: I hit a bad sector, and an IO error is reported, which triggers an on-demand scan: the value of X is incremented. Now, in the current behavior, RM replicates the good replicas from other sources immediately. So, full durability is restored by the system.
With the proposed model, I have compromised durability because until my window length of (x-y) is hit, my container has only 2 good copies elsewhere. Instead, a more desirable situation is if X = 1, degraded volume has the last copy of the container, RM replicated from this as the source, rest of the behavior is left identical. That increases the overall durability of the system even more than what is available today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the first on-demand container scan is triggered, we could speed up the degraded/failed state detection of a disk by throttling up the background volume scanner. This would reduce the time required to satisfy the sliding window criteria at the expense of operational reads and increased IO.

The durability of data is a priority. One of the points discussed was that the replication manager changes required for acting upon a degraded volume would align with the changes required for a volume-decommissioning feature. As a result, this proposal suggests taking on the replication manager changes as the next step.

An alternative would be to first have a simplified detection of the degraded state and improve the existing replication manager's actions to consider the new degraded volume state when replicating. Improving the detection of degraded state and decommissioning of volumes could be done at a later stage. What do you think @errose28?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is a situation: I hit a bad sector, and an IO error is reported, which triggers an on-demand scan: the value of X is incremented. Now, in the current behavior, RM replicates the good replicas from other sources immediately. So, full durability is restored by the system.
With the proposed model, I have compromised durability because until my window length of (x-y) is hit, my container has only 2 good copies elsewhere.

This would still happen in the proposed model. There are no proposed changes to replication manager or container states in this document. I think there is some confusion between the on-demand container scanner and on-demand volume scanners here as well. On-demand container scanner will be triggered when a bad sector is read within the container, and if that fails it will mark the container unhealthy triggering the normal replication process. There is no sliding window for the on-demand container scanner.

What is proposed in this doc is that if the on-demand container scanner marks a container unhealthy, it should also trigger an on-demand volume scan. For each on-demand volume scan requested, it would add a counter towards the degraded state sliding window of that volume.

Instead, a more desirable situation is if X = 1, degraded volume has the last copy of the container, RM replicated from this as the source, rest of the behavior is left identical.

If there is only one copy of a container then it is already under-replicated and RM will copy from this volume as long as it is not failed. This doc does not propose any changes here.

@smengcl smengcl requested a review from Copilot May 20, 2025 18:06
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR proposes a new design for handling storage volume health in Ozone datanodes by adding a degraded state for volumes experiencing soft failures, along with improvements to volume-scan metrics and CLI support.

  • Introduces a degraded volume state with associated checks and sliding window mechanisms.
  • Proposes CLI improvements and enhanced metrics for volume health status reporting.
  • Outlines changes in documentation and task breakdown for volume scanner and CLI features.
Comments suppressed due to low confidence (1)

hadoop-hdds/docs/content/design/degraded-storage-volumes.md:90

  • The term 'reuqirements' earlier in the document appears to be a typo; consider using 'requirements'.
This document does not propose adding persistence for volume health states, so all volumes will return to healthy on restart until checks move them to a different state.

@errose28 errose28 added the scanners Changes related to datanode container and volume scanners label May 22, 2025
- **File Check**:
- If the sync portion of the check fails, add an entry to the **failed health sliding window**
- If any other part of this check fails, add an entry to the **degraded health sliding window**
- **IO Error Count**: When an on-demand volume scan is requested, add an entry to the **degraded health sliding window**
Copy link
Contributor

@swagle swagle May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the container scanner informs the volume scanner of a problem, correct?
Why do we need a sliding window? If the total of X I/O errors was reported, decide to fail it. Sliding window to me makes a decision to prioritize errors based on time, but then it is complicated to implement, instead, a threshold is a simple measure. Even in this case, how do you decide what is X? What heuristic guides this decision?
Also, if the value of X cannot be decided very easily, ignoring the X - length(window) events seems even more arbitrary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the container scanner informs the volume scanner of a problem, correct?

We currently have code that triggers on-demand volume scans in catch block of most IO operations. It is currently missing from when the container scanner marks a container unhealthy but we should add it since that is also an IO error.

Why do we need a sliding window? If the total of X I/O errors was reported, decide to fail it. Sliding window to me makes a decision to prioritize errors based on time, but then it is complicated to implement, instead, a threshold is a simple measure.

There is still a time based component in this suggestion: datanode uptime. A very long running datanode will eventually hit X even on a healthy volume. Creating a fixed time in the sliding window normalizes for this.

Even in this case, how do you decide what is X? What heuristic guides this decision?

This is a tricky problem, and I'm not sure I have a good heuristic right now. But we should note it is not unique to this proposal. Even the current volume scanner uses a counter based sliding window where 2/3 of the last checks must have passed to fail a volume. The only other option is to fail a volume on a single IO error which would be too aggressive. Even a healthy disk is going to have some IO bumps occasionally.

- **File Check**:
- If the sync portion of the check fails, add an entry to the **failed health sliding window**
- If any other part of this check fails, add an entry to the **degraded health sliding window**
- **IO Error Count**: When an on-demand volume scan is requested, add an entry to the **degraded health sliding window**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28 In the current design, the definition and tracking of I/O errors are critical for evaluating disks in borderline states. These disks often exhibit intermittent failures—functioning normally during certain time windows and frequently failing during others. Should we consider further optimizing the configuration of the sliding window mechanism to avoid repeatedly triggering the degraded state due to error fluctuations that have not yet escalated, thereby preventing unnecessary data replication or alerts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, I have another concern: I believe that certain administrative operations themselves can contribute to performance degradation in Datanodes. Tasks such as disk scanning and data recovery introduce additional I/O overhead, especially when the disk is already under stress.

What is the current time interval configured for the sliding window? If the interval is too short, it may lead to frequent state changes due to temporary fluctuations. If it's too long, it might delay fault detection and cause us to miss the optimal window for intervention.

Would it be possible to introduce a pre-warning mechanism that can proactively detect potential disk degradation based on performance trends, before actual failure thresholds are reached? For example, if a disk's read/write latency or throughput is significantly worse than other disks on the same node, could the system flag it as "performance abnormal" or "under observation" and trigger an alert? This would allow administrators to review and decide whether to manually degrade or replace the disk.

Such proactive handling may be more effective than waiting for hard errors to trigger degradation, especially in environments where soft failures and high node load are common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very good practical points, thanks for adding them. Let me try to address each one.

These disks often exhibit intermittent failures—functioning normally during certain time windows and frequently failing during others. Should we consider further optimizing the configuration of the sliding window mechanism to avoid repeatedly triggering the degraded state due to error fluctuations that have not yet escalated, thereby preventing unnecessary data replication or alerts?

The sliding window of the degraded state is intended to deal with this exact situation: intermittent errors that are not enough to escalate to full volume failure. Degraded is the only state that can move back to healthy, so there would be no fluctuation of volumes from failed to healthy triggering re-replication, only possible fluctuation from degraded to healthy. In this case it just provides more monitoring options. The current system provides no optics into intermittent volume errors, so it is as if all of these types of alerts are ignored. If the concern is with spurious alerts, then alerting can be ignored for degraded volume metrics, which puts it on par with the current system. The sliding windows can also be tuned to adjust how sensitive the disk is to health state changes.

In addition, I have another concern: I believe that certain administrative operations themselves can contribute to performance degradation in Datanodes. Tasks such as disk scanning and data recovery introduce additional I/O overhead, especially when the disk is already under stress.

This is a good point. Right now such situations may cause the volume to be marked as degraded for alerting purposes, but should not fail the volume. Container import/export and container scanner can have their bandwidth throttled with configs if those operations themselves are burdening the node to the point where it is unhealthy.

What is the current time interval configured for the sliding window? If the interval is too short, it may lead to frequent state changes due to temporary fluctuations. If it's too long, it might delay fault detection and cause us to miss the optimal window for intervention.

Yes I will add a proposal for specific values in the document, although it will be a tricky to pick a "best" value. I'm still working on this area and will update the doc soon.

Would it be possible to introduce a pre-warning mechanism that can proactively detect potential disk degradation based on performance trends, before actual failure thresholds are reached? For example, if a disk's read/write latency or throughput is significantly worse than other disks on the same node, could the system flag it as "performance abnormal" or "under observation" and trigger an alert?

This would be a good detection mechanism, but I'm not sure it needs to be handled within Ozone. Ozone can and should report issues it sees while operating, but IO wait can be detected by other systems like smartctl, iostat, and prometheus node exporter. We don't need to re-invent the wheel within Ozone when we have these dedicated tools available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the detailed response — it gave me a much clearer understanding of the design logic behind this feature. Overall, it seems that your considerations are already quite thorough, and I'm looking forward to seeing it implemented.

I'd also like to add one more thought from my side:

These disks often exhibit intermittent failures—functioning normally during certain time windows and frequently failing during others. Should we consider further optimizing the configuration of the sliding window mechanism to avoid repeatedly triggering the degraded state due to error fluctuations that have not yet escalated, thereby preventing unnecessary data replication or alerts?

The sliding window of the degraded state is intended to deal with this exact situation: intermittent errors that are not enough to escalate to full volume failure. Degraded is the only state that can move back to healthy, so there would be no fluctuation of volumes from failed to healthy triggering re-replication, only possible fluctuation from degraded to healthy. In this case it just provides more monitoring options. The current system provides no optics into intermittent volume errors, so it is as if all of these types of alerts are ignored. If the concern is with spurious alerts, then alerting can be ignored for degraded volume metrics, which puts it on par with the current system. The sliding windows can also be tuned to adjust how sensitive the disk is to health state changes.

Regarding point 1 on optimizing the sliding window mechanism — I fully agree with your explanation, and it's clear that the current design addresses intermittent errors.

However, I do have a follow-up question: how exactly does a volume transition from degraded to failed? Is there a clearly defined threshold or set of criteria for this transition?

In addition, I have another concern: I believe that certain administrative operations themselves can contribute to performance degradation in Datanodes. Tasks such as disk scanning and data recovery introduce additional I/O overhead, especially when the disk is already under stress.

Would it be possible to introduce a pre-warning mechanism that can proactively detect potential disk degradation based on performance trends, before actual failure thresholds are reached? For example, if a disk's read/write latency or throughput is significantly worse than other disks on the same node, could the system flag it as "performance abnormal" or "under observation" and trigger an alert?

This would be a good detection mechanism, but I'm not sure it needs to be handled within Ozone. Ozone can and should report issues it sees while operating, but IO wait can be detected by other systems like smartctl, iostat, and prometheus node exporter. We don't need to re-invent the wheel within Ozone when we have these dedicated tools available.

This is a good point. Right now such situations may cause the volume to be marked as degraded for alerting purposes, but should not fail the volume. Container import/export and container scanner can have their bandwidth throttled with configs if those operations themselves are burdening the node to the point where it is unhealthy.

Regarding point 2, your explanation has largely addressed my concerns. However, I’m wondering if we could take it a step further by supporting dynamic configuration of bandwidth limits for these operations. In real-world scenarios, we’ve observed cases where disk scanning introduced I/O pressure that affected normal read/write performance. Allowing bandwidth limits to be adjusted at runtime based on node load could help better balance stability and performance.

What is the current time interval configured for the sliding window? If the interval is too short, it may lead to frequent state changes due to temporary fluctuations. If it's too long, it might delay fault detection and cause us to miss the optimal window for intervention.

Yes I will add a proposal for specific values in the document, although it will be a tricky to pick a "best" value. I'm still working on this area and will update the doc soon.

Regarding point 3, I fully understand that it’s difficult to define a value, as disk usage patterns and environments can vary significantly across deployments. That said, I believe it would be helpful to include a clear explanation in the documentation. Speaking from personal experience, when I come across a critical configuration parameter, I really appreciate seeing a detailed description — for example, how increasing or decreasing the value would affect system behavior. This kind of guidance makes it much easier for users to understand the design rationale and make informed tuning decisions.

Would it be possible to introduce a pre-warning mechanism that can proactively detect potential disk degradation based on performance trends, before actual failure thresholds are reached? For example, if a disk's read/write latency or throughput is significantly worse than other disks on the same node, could the system flag it as "performance abnormal" or "under observation" and trigger an alert?

This would be a good detection mechanism, but I'm not sure it needs to be handled within Ozone. Ozone can and should report issues it sees while operating, but IO wait can be detected by other systems like smartctl, iostat, and prometheus node exporter. We don't need to re-invent the wheel within Ozone when we have these dedicated tools available.

Regarding point 4, I think you raised a great point, and I generally agree with your approach. However, I’d like to offer an additional perspective.

While there are indeed many external tools available for monitoring I/O performance, relying entirely on them can lead to a fragmented view of system health. Monitoring data becomes scattered across multiple sources, and I personally believe that it would be more effective if Ozone could provide some built-in, conclusive metrics to help assess disk health directly — rather than requiring SREs to piece together information from various systems to make a judgment.

I’ve experienced this challenge firsthand. When users report performance issues in Ozone — especially in scenarios where performance is critical — I often find myself digging through different metrics and dashboards to locate the root cause. This process is time-consuming and mentally taxing. If Ozone could consolidate key signals and present them in a unified way, it would significantly improve troubleshooting efficiency and reliability.

Take I/O performance as an example — we can retrieve read/write latency or throughput data simply by reading certain system files. This doesn’t require much effort or any complex tooling. In fact, I’ve already made some progress on this in #7273 , where I exposed some of these metrics directly through Ozone’s built-in metrics system. This kind of integration is much more intuitive, centralized, and operationally helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this upgrade proposal is already quite comprehensive, and I’m happy to give it a +1.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@errose28 Thanks for the doc, have few questions.


- **Directory Check**: No sliding window required. If the volume is not present based on filesystem metadata it should be failed immediately.
- **Database Check**: On failure, add an entry to the **failed health sliding window**
- **File Check**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need define what we are adding as part of this,

  1. File Check already have sliding window mechanism.
  2. IO Error: what kind of IO Error? do container read IO error or Container write IO error, and kind of IO Error further definition is required for clarification. May be this is not explicitly thrown to be cached easily.
  3. Database check: What happens if DB is not accessible? during that time, all container read / write will fail. So this might be more critical ... do have strategy to check more frequently and time-period specified for such critical failure ? like immediate retry within in 1 sec for 3 times. So I think this may not fall in sliding window like others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File Check already have sliding window mechanism.

This is noted in the above section already, explaining it is currently based on counters, not time, and why we should switch to a time based implementation.

IO Error: what kind of IO Error?

The counter would increment for each request for an on-demand volume scan. I can clarify this in the above "IO Error Count" section, but it is defined below as well. Locations for on-demand volume scanning are already present in the current code, although we may want to review where they have been placed.

Database check: What happens if DB is not accessible? during that time, all container read / write will fail. So this might be more critical ... do have strategy to check more frequently and time-period specified for such critical failure ?

I'm still working on the proposed config changes and plan to update the doc later today with that information, but with a shorter minimum volume scan gap (like 1 minute vs the current 15 minutes), the requests for on-demand volume scans should handle this.

For example, say the DB has become completely inaccessible and our default configs are:

  • Sliding window length for disk failure = 1 hour
  • Sliding window failure count = 3
  • Minimum disk scan gap = 1 minute

All IO, including the container scanner which needs to read DB metadata, would flag this and trigger an on-demand volume scan which would execute the DB check. Since all ops are failing, this means a volume with a completely inaccessible DB would still be failed in 3 minutes.


We can use one time based sliding window to track errors that would cause a volume to be degraded, and a second one for errors that would cause a volume to be failed. When a check fails, it can add the result to whichever sliding window it corresponds to. We can create the following assignments of checks:

- **Directory Check**: No sliding window required. If the volume is not present based on filesystem metadata it should be failed immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata volume failure is similar to DN failure, as it will be having ratis dir. Now sure about other common metadata. I think we may define check applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code already has a tolerance set for how many of each volume type must be available for a datanode to continue running, and for metadata directory it is set to 1. So if the metadata volume is marked as failed the datanode stops running. Do we need any more specific checks for this volume type?

We can use one time based sliding window to track errors that would cause a volume to be degraded, and a second one for errors that would cause a volume to be failed. When a check fails, it can add the result to whichever sliding window it corresponds to. We can create the following assignments of checks:

- **Directory Check**: No sliding window required. If the volume is not present based on filesystem metadata it should be failed immediately.
- **Database Check**: On failure, add an entry to the **failed health sliding window**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly there is another volume type - DB Volume, its container db volume which may be separate. This is optional configuration. Not checked in details, but may have some impact for defining rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that feature is not widely used because it creates a single point of failure for all data on the datanode. We could add a database check for this volume type as well for completeness.


Degraded volumes are still reachable, but are reporting numerous IO errors when interacting with them. For identification purposes, we can escalate this with SCM, Recon, and metrics which will allow admins to decide whether to leave the volume or remove it. Identification of degraded volumes will be based on errors reported by ongoing datanode IO. This includes the container scanner to ensure that degraded volumes can still be detected even on idle nodes. The container scanner will continue to determine whether individual containers are healthy or not, and the volume scanner will still run to determine if the volume needs to be moved to **failed**. **Degraded** volumes may move back to **healthy** if IO errors are no longer being reported. Data can still be read from **degraded** volumes, the client's read checksums will verify whether the data is valid or not.

- **Enter**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define the time window of check if different from existing checker? and do all type of failure follow same time window or different ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time window would be configurable separately for degraded and failed sliding windows. This gives us a way to tweak our sensitivity to disk issues. The defaults could use the same value though. I'm not sure which component you are referring to as the "existing checker". If it's the background volume scanner, the interval that runs at would still be configured separately from the sliding window interval. There is somewhat of a dependency here which is called out at the end of the doc:

  • Sliding window timeouts and scan intervals need to be set accordingly so that background scanners alone have a chance to cross the check threshold even if no foreground load is triggering on-demand scans.

I don't yet have a proposal for specific values, it is something we would need to discuss and can add in this document since there are a lot of questions in this area.


To identify degraded and failed volumes through metrics which can be plugged into an alerting system, we can expose counters on each datanode tracking the number of healthy volumes. The following metrics can be used:

- `TotalVolumes`: The total number of volumes on the datanode regardless of health state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this metrics only for data volume? or being notified separately for DB volume and MetaVolume?

Copy link
Contributor Author

@errose28 errose28 May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be total number of volumes on the whole datanode regardless of volume type. It might be useful to split this by volume type though since it would only add two additional gauges. I still think TotalVolumes across all types is a useful metric as well since we don't want to generate a matrix of volume type x volume health metrics, so TotalVolumes can still be checked against NumHealthyVolumes as identified in the use cases below.


### Improve CLI

- Improve `ozone admin datanode volume list/info` for identifying volumes based on health state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need pass datanode ID to list volume specific to datanodes. This is more similar to ozone admin datanode usageinfo but gives by summarizing at DN level. may be this can have similar api with volume extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This content is stale, I forgot to update it when I updated the CLI section. Will fix this. The proposal is for volumes to be identified though ozone admin datanode list + ozone admin datanode info (a new command) instead of introducing a volume specific subcommand.

Once degraded volumes can be reported to SCM, it is possible to take action for the affected data on that volume. When a degraded volume is encountered, we should:
- Stop writing new data onto the volume
- Make additional copies of the data on that volume, ideally using other nodes as sources if possible
- Notify the admin when all data from the volume has been fully replicated so it is safe to remove
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this shows another way of action to handled degraded volume quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design scanners Changes related to datanode container and volume scanners
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants