HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

slfan1989 · 2024-10-03T01:20:07Z

What changes were proposed in this pull request?

Currently, we lack tools on the SCM side to track failed disks on DataNodes. DataNodes have already reported this information, and we need to display it.

In this PR, we will display the failed disks on the DataNode. The information can be displayed in JSON format or using the default format.

Default format

Datanode Volume Failures (5 Volumes)

Node         : localhost-62.238.104.185 (de97aaf3-99ad-449d-ad92-2c4f5a744b49) 
Failed Volume: /data0/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-163.120.165.68 (cf40e987-8952-4f7a-88b7-096e6b285243) 
Failed Volume: /data1/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-253.243.206.120 (0cc77921-489d-4cf0-a036-475faa16d443) 
Failed Volume: /data2/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-136.194.243.81 (5cb6430d-0ce5-4204-b265-179ee38fb30e) 
Failed Volume: /data3/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-48.253.209.226 (f99a8374-edb0-419d-9cba-cfab9d9e8a2e) 
Failed Volume: /data4/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024

Json format

[ {
  "node" : "localhost-161.170.151.131 (155bb574-7ed8-41cd-a868-815f4c2b0d60)",
  "volumeName" : "/data0/ozonedata/hdds",
  "failureDate" : 1727918794694,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-67.218.46.23 (520d29eb-8387-4cda-bcb1-8727fdddd451)",
  "volumeName" : "/data1/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-30.151.88.21 (d66cab50-bbf8-4199-9d7f-82da84a30137)",
  "volumeName" : "/data2/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-78.50.38.217 (a673f50a-6f74-4e62-8c0c-f7337d5f3ce5)",
  "volumeName" : "/data3/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-138.205.52.25 (84b7e49a-9bd4-4115-96fa-69f2d259343c)",
  "volumeName" : "/data4/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
} ]

Table format

+-------------------------------------------------------------------------------------------------------------------------------------------+
|                                                         Datanode Volume Failures                                                          |
+------------------------------------------------------------------+-----------------------+---------------+--------------------------------+
|                               Node                               |      Volume Name      | Capacity Lost |          Failure Date          |
+------------------------------------------------------------------+-----------------------+---------------+--------------------------------+
|  localhost-83.212.219.28 (8b6addb1-759a-49e9-99fb-0d1a6cfb2d7f)  | /data0/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
| localhost-103.199.236.47 (0dbe503a-3382-4753-b95a-447bab5766c4)  | /data1/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
|  localhost-178.123.46.32 (2017076a-e763-4f47-abce-78535b5770a3)  | /data2/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
| localhost-123.112.235.228 (aaebb6a7-6b62-4160-9934-b16b8fdde65e) | /data3/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
| localhost-249.235.216.19 (cbc7c0b5-5ae0-4e40-91b8-1d9c419a007c)  | /data4/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
+------------------------------------------------------------------+-----------------------+---------------+--------------------------------+

What is the link to the Apache JIRA

JIRA: HDDS-11463. Track and display failed DataNode storage locations in SCM.

How was this patch tested?

Add Junit Test & Testing in a test environment.

slfan1989 · 2024-10-23T09:01:44Z

@errose28 Could you please help review this PR? Thank you very much! We discussed the relevant implementation together in HDDS-11463.

errose28

Thanks for working on this @slfan1989, this looks like a useful addition. I only had time for a quick high level look for now.

errose28 · 2024-10-28T20:43:42Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/scm/VolumeFailureSubCommand.java

+ * Handler of ozone admin scm volumesfailure command.
+ */
+@Command(
+    name = "volumesfailure",


For the CLI, we should probably use something like ozone admin datanode volume list. The datanode subcommand is already used to retrieve information about datanodes from SCM. Splitting the commands so that volume has its own subcommand gives us more options in the future.

To distinguish failed and healthy volumes and filter out different nodes, we can either add some kind of filter flag, or leave it up to grep/jq to be applied to the output.

This also means we should make the RPC more generic to support pulling all volume information.

Thank you for helping to review this PR! I will continue to improve the relevant code based on your suggestions.

errose28 · 2024-10-28T20:48:01Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

@@ -382,6 +383,7 @@ public abstract static class Builder<T extends Builder<T>> {
    private boolean failedVolume = false;
    private String datanodeUuid;
    private String clusterID;
+    private long failureDate;


Lets use failureTime. I'm assuming this is being stored as millis since epoch, so it will have data and time information.

I have improved the relevant code.

errose28 · 2024-10-28T20:51:55Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

+    // Ensure it is set only once,
+    // which is the time when the failure was first detected.
+    if (failureDate == 0L) {
+      setFailureDate(Time.now());


Let's use Instant.now() per HDDS-7911.

@errose28 Can you help review this PR again? Thank you very much!

adoroszlai · 2024-11-05T18:33:44Z

Thanks @slfan1989 for working on this. Converted it to draft because there is a failing test:

[ERROR] org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport  Time elapsed: 1.105 s  <<< ERROR!
java.lang.UnsupportedOperationException
	at java.base/java.util.AbstractList.add(AbstractList.java:153)
	at java.base/java.util.AbstractList.add(AbstractList.java:111)
	at org.apache.hadoop.hdds.scm.node.DatanodeInfo.updateStorageReports(DatanodeInfo.java:186)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.processNodeReport(SCMNodeManager.java:674)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:423)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:360)
	at org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport(TestSCMNodeManager.java:1591)

https://github.com/slfan1989/ozone/actions/runs/11471452180
https://github.com/slfan1989/ozone/actions/runs/11476535807
https://github.com/slfan1989/ozone/actions/runs/11625983369

slfan1989 · 2024-11-06T09:29:07Z

Thanks @slfan1989 for working on this. Converted it to draft because there is a failing test:
[ERROR] org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport  Time elapsed: 1.105 s  <<< ERROR!
java.lang.UnsupportedOperationException
	at java.base/java.util.AbstractList.add(AbstractList.java:153)
	at java.base/java.util.AbstractList.add(AbstractList.java:111)
	at org.apache.hadoop.hdds.scm.node.DatanodeInfo.updateStorageReports(DatanodeInfo.java:186)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.processNodeReport(SCMNodeManager.java:674)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:423)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:360)
	at org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport(TestSCMNodeManager.java:1591)
https://github.com/slfan1989/ozone/actions/runs/11471452180 https://github.com/slfan1989/ozone/actions/runs/11476535807 https://github.com/slfan1989/ozone/actions/runs/11625983369

@adoroszlai Thank you for reviewing this PR！ I am currently making improvements, and once the changes pass the CI tests in my branch, I will reopen the PR.

cc: @errose28

slfan1989 · 2024-11-07T23:48:24Z

@adoroszlai Thank you for reviewing this PR! I will also pay closer attention to CI issues in future development. I understand that CI testing resources are valuable.

I have made improvements to the code based on @errose28 suggestions and also fixed the related unit test errors. The CI for my branch has passed(https://github.com/slfan1989/ozone/actions/runs/11719380711), and I have updated the PR status to 'Ready for Review'.

slfan1989 · 2024-11-19T14:18:04Z

@errose28 Could you please help review this PR again? Thank you very much! I’ve made some additional improvements to this PR, as we wanted to print all the disk information. However, since there’s quite a lot of disk data, I’ve added pagination functionality.

adoroszlai · 2025-02-14T19:47:22Z

Temporarily converted to draft and assigned to myself, to resolve conflicts.

slfan1989 · 2025-02-15T09:25:19Z

@adoroszlai Thank you for your attention to this PR. I will continue to follow up on it.

adoroszlai · 2025-02-18T11:28:02Z

Merged from master. There will be one checkstyle problem:

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/datanode/package-info.java
 18: Missing a Javadoc comment.
 18: Missing javadoc for package-info.java file.

Previously the license header was a javadoc in this new file, so the problem was hidden.

adoroszlai

Thanks @slfan1989 for the patch.

adoroszlai · 2025-05-05T10:54:55Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+  private String uuid;
+
+  // The HostName identifier of the DataNode.
+  @Option(names = { "--hostName" },


Please avoid camelCase for options. (Also for --displayMode.)

Suggested change

@Option(names = { "--hostName" },

@Option(names = { "--hostname" },

This still applies to the latest patch.

adoroszlai · 2025-05-05T11:10:37Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+  // PageSize refers to the number of items displayed per page
+  // in a paginated view.
+  @Option(names = { "--pageSize" },
+       defaultValue = "20",
+       description = "The number of volume information items displayed per page.")
+  private int pageSize;
+
+  // The current page.
+  @Option(names = { "--currentPage" },
+       defaultValue = "1",
+       description = "The current page.")
+  private int currentPage;


Other Ozone CLI commands allow pagination (via ListOptions reusable mixin) with --start to specify the start item, --length for "page size", and --all (only one of these last two are allowed). Please try to use options consistent with that. Using ListOption directly may not work, since --prefix does not seem to be applicable here.

Other Ozone CLI commands allow pagination (via ListOptions reusable mixin) with --start to specify the start item, --length for "page size", and --all (only one of these last two are allowed). Please try to use options consistent with that.

I have thoroughly reviewed ListOptions, and it is indeed a powerful tool class. However, my requirements differ slightly from its original functionality. I want to implement a pagination display, as there may sometimes be a certain number of damaged disks. For example, if there are 30 damaged disks and the limit is set to 10, the currentPage can range from 1 to 3, allowing for pagination. In this case, currentPage represents the current page number. Therefore, I plan to add the currentPage option to ListOptions to meet this need.

Using ListOption directly may not work, since --prefix does not seem to be applicable here.

The --start and --prefix options have limited relevance for our functionality. Currently, we support filtering by hostName, which I believe should be sufficient. I'm unsure whether we need to filter by disk letter as well. What do you think?

The problem with numeric "current page" is that the list is not fixed. By the time you request the next page, it may be shorter or longer. Then you may get the same item twice (if new item is added in the range of the earlier pages) or not at all (if item is removed from earlier pages).

Using an anchor item eliminates those problems by allowing the size of "previous pages" to change.

Compare GitHub's list of commits, which uses commit SHA as anchor:

https://github.com/apache/ozone/commits/master?after=e6daae4af3f9839be52ea6db636c9ff434acf306+34

to list of pull requests, which has simple numeric page:

https://github.com/apache/ozone/pulls?page=2

Created HDDS-12995 to move out --prefix, to allow reusing ListOptions for lists that are filtered by other parameters.

Thank you for your detailed explanation! I now understand the reliability of using anchor points and plan to make the corresponding improvements in the code. Regarding HDDS-12995, I understand that you mean to move the --prefix option from its current position, specifically placing it within the command, rather than having it as a global option. Am I understanding this correctly?

adoroszlai · 2025-05-05T11:11:26Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+  // Display it in JSON format.
+  @Option(names = { "--json" },
+       defaultValue = "false",
+       description = "Format output as JSON.")
+  private boolean json;
+
+  // Display it in TABLE format.
+  @Option(names = { "--table" },
+       defaultValue = "false",
+       description = "Format output as Table.")
+  private boolean table;


Please make these exclusive, like:

ozone/hadoop-ozone/cli-shell/src/main/java/org/apache/hadoop/ozone/shell/ListOptions.java

Lines 27 to 28 in af1f98c

@CommandLine.ArgGroup(exclusive = true)

private ExclusiveLimit exclusiveLimit = new ExclusiveLimit();

ozone/hadoop-ozone/cli-shell/src/main/java/org/apache/hadoop/ozone/shell/ListOptions.java

Lines 63 to 74 in af1f98c

static class ExclusiveLimit {

@CommandLine.Option(names = {"--length", "-l"},

description = "Maximum number of items to list",

defaultValue = "100",

showDefaultValue = CommandLine.Help.Visibility.ALWAYS)

private int limit;

@CommandLine.Option(names = {"--all", "-a"},

description = "List all results",

defaultValue = "false")

private boolean all;

}

That's a good suggestion! I will improve the code based on this recommendation.

adoroszlai · 2025-05-05T11:13:25Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+      description = "failed is used to display failed disks, " +
+      "normal is used to display normal disks.")


description seems to be dup of --displayMode. (Same for --hostName.)

I will improve the help information for displayMode, hostname, and uuid.

adoroszlai · 2025-05-05T11:18:09Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+  @Option(names = { "--displayMode" },
+      defaultValue = "all",
+      description = "failed is used to display failed disks, " +
+      "normal is used to display normal disks.")
+  private String displayMode;


Please use an enum to limit values allowed.

https://picocli.info/#_enum_types

adoroszlai · 2025-05-05T11:27:25Z

hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/ozone/scm/TestVolumeCommand.java

+    System.setOut(new PrintStream(outContent, false, DEFAULT_ENCODING));
+    System.setErr(new PrintStream(errContent, false, DEFAULT_ENCODING));


Do we need to manually set out/err when using SystemOutCapturer?

This is a good question! After using SystemOutCapturer, we no longer need to manually set out/err. I have already improved this part of the code.

adoroszlai · 2025-05-05T11:30:38Z

hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/ozone/scm/TestVolumeCommand.java

+  private HddsProtos.DatanodeDetailsProto createDatanodeDetails() {
+    Random random = ThreadLocalRandom.current();
+    String ipAddress = random.nextInt(256)
+        + "." + random.nextInt(256)
+        + "." + random.nextInt(256)
+        + "." + random.nextInt(256);
+
+    DatanodeDetails.Builder dn = DatanodeDetails.newBuilder()
+        .setUuid(UUID.randomUUID())
+        .setHostName("localhost" + "-" + ipAddress)
+        .setIpAddress(ipAddress)
+        .setPersistedOpState(HddsProtos.NodeOperationalState.IN_SERVICE)
+        .setPersistedOpStateExpiry(0);
+
+    for (DatanodeDetails.Port.Name name : ALL_PORTS) {
+      dn.addPort(DatanodeDetails.newPort(name, 0));
+    }


Use MockDatanodeDetails.

adoroszlai · 2025-05-05T11:37:49Z

hadoop-hdds/interface-client/src/main/proto/hdds.proto

+    required string uuid = 1;
+    required string hostName = 2;
+    required string volumeName = 3;
+    required bool failed = 4;
+    required int64 failureTime = 5;
+    required int64 capacity = 6;


Add fields as optional.

https://protobuf.dev/programming-guides/proto2/#field-labels

adoroszlai · 2025-05-05T11:39:19Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeInfo.java

@@ -49,7 +51,7 @@ public class DatanodeInfo extends DatanodeDetails {
  private volatile long lastHeartbeatTime;
  private long lastStatsUpdatedTime;
  private int failedVolumeCount;
-
+  private List<VolumeInfoProto> volumeInfos;


volumeInfos is never assigned. I think the intention was to update this variable in updateStorageReports in the block holding writeLock.

Thank you for pointing out this issue! I hadn't noticed these details before.

adoroszlai · 2025-05-05T11:53:43Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+    name = "volumes",
+    description = "Display the list of volumes on the DataNode.",
+    mixinStandardHelpOptions = true,
+    versionProvider = HddsVersionProvider.class)
+public class VolumeSubCommand extends ScmSubcommand {


Currently this subcommand is dangling, not added to any parent. If you add it somewhere, it will be available as ozone ... volumes. Suggestion from @errose28 was to add this as ozone admin datanode volume list, which allows adding other subcommands related to datanode volumes later.

Subcommands are added like this:

ozone/hadoop-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/DatanodeCommands.java

Lines 33 to 40 in af1f98c

subcommands = {

ListInfoSubcommand.class,

DecommissionSubCommand.class,

MaintenanceSubCommand.class,

RecommissionSubCommand.class,

StatusSubCommand.class,

UsageInfoSubcommand.class

})

slfan1989 · 2025-05-05T14:24:57Z

@adoroszlai Thank you very much for reviewing this PR! I will continue to improve it.

slfan1989 · 2025-05-23T14:27:43Z

@adoroszlai Could you help review this PR? Thank you very much!

I have conducted tests on my own branch, and it currently passes the key CI tests.

https://github.com/slfan1989/ozone/actions/runs/15209585380

slfan1989 · 2025-05-23T22:59:20Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java

+
+    // If startItem is specified, find its position in the volumeInfos list
+    int startIndex = 0;
+    if (StringUtils.isNotBlank(startItem)) {


@adoroszlai I added logic to skip startItem in this part of the code, but after thinking it through, I realized it’s better to use the server’s hostname or UUID as startItem instead of a disk prefix. That’s because many machines name their disks like data0 to data9, and using a disk name could lead to unexpected filtering behavior.

slfan1989 · 2025-05-27T22:03:41Z

@adoroszlai Can we move forward with this PR? I would appreciate your advice.

adoroszlai

Thanks @slfan1989 for updating the patch.

adoroszlai · 2025-05-28T08:56:23Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/datanode/VolumeInfo.java

+  private static final Codec<VolumeInfo> CODEC = new DelegatedCodec<>(
+      Proto2Codec.get(HddsProtos.VolumeInfoProto.getDefaultInstance()),
+      VolumeInfo::fromProtobuf,
+      VolumeInfo::getProtobuf,
+      VolumeInfo.class);
+
+  public static Codec<VolumeInfo> getCodec() {
+    return CODEC;
+  }


Codec is required only for storing in DB, but VolumeInfo does not seem to be persisted by either datanode or SCM. So I think this can be removed.

I have removed the CODEC.

adoroszlai · 2025-05-28T08:59:03Z

hadoop-hdds/interface-client/src/main/proto/hdds.proto

@@ -304,6 +304,15 @@ message RemoveScmResponseProto {
    optional string scmId = 2;
 }

+message VolumeInfoProto {
+    optional string uuid = 1;


Please use DatanodeIDProto.

adoroszlai · 2025-05-28T09:00:01Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/datanode/VolumeInfo.java

+ */
+public final class VolumeInfo implements Comparable<VolumeInfo> {
+
+  private String uuid;


Please use DatanodeID.

adoroszlai · 2025-05-28T09:01:26Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/datanode/VolumeInfo.java

+
+  @Override
+  public int compareTo(VolumeInfo that) {
+    Preconditions.checkNotNull(that);


nit: prefer builtin Objects.requireNonNull

adoroszlai · 2025-05-28T09:02:05Z

...mmon/src/main/java/org/apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocol.java

+   * @throws IOException
+   * I/O exceptions that may occur during the process of querying the volume.
+   */
+  StorageContainerLocationProtocolProtos.GetVolumeInfosResponseProto getVolumeInfos(


nit: please import GetVolumeInfosResponseProto instead of StorageContainerLocationProtocolProtos.

adoroszlai · 2025-05-28T10:10:55Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+  private String uuid;
+
+  // The HostName identifier of the DataNode.
+  @Option(names = { "--hostName" },


This still applies to the latest patch.

adoroszlai · 2025-05-28T10:12:46Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+  @CommandLine.Mixin
+  private ListPaginationOptions listOptions;
+
+  enum DISPLAYMODE { all, normal, failed }


nit: Enums should be named like other types (classes, interfaces): DisplayMode.

Also, please consider using all-caps for values (ALL, etc.)

adoroszlai · 2025-05-28T10:20:33Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+  @Option(names = { "--displayMode" },
+      defaultValue = "all",
+      description = "Display mode for disks: 'failed' shows failed disks, " +
+      "'normal' shows healthy disks, 'all' shows all disks.")
+  private DISPLAYMODE displayMode;


On another look, I think "display mode" is confusing. JSON and Table are display modes. "all/normal/failed" filter the list by volume state.

I suggest renaming to --state, and renaming normal to healthy.

Then description can be simplified to Filter disks by state.

adoroszlai · 2025-05-28T10:23:24Z

...-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/VolumeSubCommand.java

+
+    // If displayed in JSON format.
+    if (json) {
+      System.out.print(JsonUtils.toJsonStringWithDefaultPrettyPrinter(volumeInfos));


This is still applicable.

Also, please use println, to avoid situation like HDDS-13100.

adoroszlai · 2025-05-28T10:27:41Z

hadoop-ozone/cli-admin/src/test/java/org/apache/hadoop/ozone/scm/TestVolumeCommand.java

+  @AfterEach
+  public void tearDown() {
+  }
+


nit: unnecessary

errose28 · 2025-05-28T16:18:19Z

Hi @slfan1989 thanks for working on this change. I think there are three attributes being added here which should be reviewed separately:

Adding an SCM RPC to retrieve volume information
Tracking failure time of the volume
Adding a CLI to view the volume information

The RPC to retrieve volume information is definitely required going forward regardless of the other two items to create some sort of CLI to query volume state.

Tracking the failure time of the volume seems like a somewhat invasive change since it spans the datanode, heartbeat, and SCM. Is this necessary, or is it enough to depend on a metrics database to track timing of cluster events? Of course we need improvements to our volume metrics as well as mentioned in #8405.

On the CLI front, I do think we need a dedicated ozone admin datanode info command going forward as outlined in HDDS-13097. This would give all volume information per node. With volume counters added to ozone admin datanode list as proposed in HDDS-13096, we could get all failed volumes in a two step process:

jq filter on ozone admin datanode list to find all nodes with failed volumes.
jq filter on ozone admin datanode info to get specific information about the failed volumes, including their capacity.

Do we need a dedicated ozone admin datanode volume list/info command pairing in addition to this? It may be useful to have such cross-cutting commands to get information in one shot, but on the other hand it may result in duplication at the CLI. For example I could see the request to add node filtering to ozone admin datanode volume list/info at which point it becomes much the same as ozone admin datanode list/info.

slfan1989 · 2025-05-29T02:17:40Z

@errose28 Thank you for your message! I'd like to share some thoughts from a different perspective. As it stands, this feature does not conflict with the proposal in #8405. #8405 represents a more innovative and forward-looking design, and although it's still under discussion, it will certainly be valuable if implemented as planned.

At the same time, I believe this feature does not impact HDDS-13096 or HDDS-13097. My comment on #8405 was more about expressing expectations for the system’s future capabilities — I hope Ozone can gradually support such features — rather than raising any objections to #8405 itself.

The design of #7266 is inspired by HDFS's disk failure detection mechanism, with the goal of improving the system's ability to identify and locate failed disks. For users migrating from HDFS to Ozone, using the volume command to directly view failed disks can offer a more intuitive and convenient operational experience.

From my perspective, we all play different roles in this project. Your team focuses on evolving and optimizing the system's architecture, while we, as external users, are more focused on refining specific functional details based on real-world use. Ultimately, however, we share the same goal: to make Ozone more robust, more user-friendly, and more widely adopted.

Naturally, it's not easy to fully align these detail-oriented changes with larger, ongoing feature developments — for example, making #7266 fully consistent with #8405. This is mainly because #8405 is broader in scope, with a longer timeline, whereas #7266 focuses on a very specific aspect. While we fully respect the overall direction, we also hope to move forward with some smaller, incremental improvements to address current practical issues.

In addition to this PR, we're also working on several other enhancements. For instance, we've implemented mechanisms to collect DataNode I/O statistics to more precisely manage container replication. We've also introduced time-based peak/off-peak control logic for various DataNode management operations (such as deletion, container replication, and EC container reconstruction). These improvements are driven by real-world production needs, and from our perspective, they've shown positive results.

However, since many of these PRs have some degree of code coupling with our previous contributions, it's difficult for us to combine everything into a single, unified patch for upstream submission.

Therefore, we hope to proceed with #7266 for now. If #8405 later results in a more complete or improved solution, we’d be happy to continue refining things in that direction. In the meantime, this also gives us a valuable opportunity to participate in the community and contribute to Ozone’s development.

slfan1989 · 2025-05-29T03:12:30Z

@adoroszlai Thank you very much for reviewing the code. I will make improvements based on your suggestions. @errose28's comments are essentially not in conflict with #7266, and I'm looking forward to seeing #7266 progress so that we can move forward with the subsequent work.

slfan1989 · 2025-05-29T09:31:13Z

@adoroszlai Thank you very much for your detailed suggestions! I've made the changes accordingly. Could you review this PR again? Thank you very much! I respect @errose28's perspective. However, I believe this PR does not conflict with #8405, nor with HDDS-13096 or HDDS-13097 — they can coexist. We've already spent considerable time reviewing this PR together, and I'd like to continue moving it forward.

slfan1989 · 2025-05-29T21:12:05Z

@adoroszlai @errose28 Can I still continue to follow up on this PR? I feel that I’ve put in some effort, but right now I’ve lost a clear direction on how to proceed. According to @errose28 suggestion, this PR only needs to keep the RPC part, but I’m not sure how to continue working on the related functionality from here.

adoroszlai · 2025-05-30T05:30:46Z

@slfan1989 Thanks for all your efforts on this PR. The concerns/suggestions raised by @errose28 make sense though. Please try to reach agreement.

I won't be able to re-review until next week in any case.

slfan1989 · 2025-05-30T05:43:32Z

@slfan1989 Thanks for all your efforts on this PR. The concerns/suggestions raised by @errose28 make sense though. Please try to reach agreement.

I won't be able to re-review until next week in any case.

@adoroszlai Thank you very much for your message and for your continued support and assistance! Since @errose28 is currently planning some new features, I believe this PR could be considered as part of that effort, especially given the amount of work we've already invested. As for which specific features should be retained, it would be helpful if @errose28 could review and provide guidance.

slfan1989 mentioned this pull request Oct 3, 2024

HDDS-11268. Add --table mode for OM/SCM Roles CLI #7016

Merged

slfan1989 force-pushed the HDDS-11463 branch from 56230df to a84c575 Compare October 5, 2024 09:57

slfan1989 marked this pull request as ready for review October 5, 2024 11:48

slfan1989 marked this pull request as draft October 6, 2024 01:46

slfan1989 closed this Oct 22, 2024

slfan1989 force-pushed the HDDS-11463 branch from d93e9a3 to 3fb2cf0 Compare October 22, 2024 04:44

slfan1989 reopened this Oct 23, 2024

slfan1989 marked this pull request as ready for review October 23, 2024 04:41

slfan1989 force-pushed the HDDS-11463 branch from 08e0f93 to 4e58fca Compare October 23, 2024 08:57

errose28 reviewed Oct 28, 2024

View reviewed changes

adoroszlai marked this pull request as draft November 5, 2024 18:30

slfan1989 force-pushed the HDDS-11463 branch 2 times, most recently from a83a8f7 to b1df492 Compare November 6, 2024 09:26

slfan1989 force-pushed the HDDS-11463 branch 2 times, most recently from 8bc7ae0 to ff43fac Compare November 7, 2024 08:37

slfan1989 marked this pull request as ready for review November 7, 2024 23:48

adoroszlai requested a review from errose28 November 8, 2024 04:59

errose28 mentioned this pull request Dec 2, 2024

HDDS-11770. Change default failed volume tolerated to 0 #7499

Closed

adoroszlai self-assigned this Feb 14, 2025

adoroszlai marked this pull request as draft February 14, 2025 19:46

adoroszlai removed their assignment Feb 18, 2025

slfan1989 force-pushed the HDDS-11463 branch from c18377e to 1197c0a Compare February 21, 2025 07:14

adoroszlai requested changes May 5, 2025

View reviewed changes

adoroszlai marked this pull request as draft May 7, 2025 10:57

errose28 mentioned this pull request May 15, 2025

HDDS-8387. Improved Storage Volume Handling in Datanodes #8405

Draft

HDDS-11463. Track and display failed DataNode storage locations in SCM.

41afe7c

slfan1989 force-pushed the HDDS-11463 branch from 1f34a2f to 41afe7c Compare May 22, 2025 08:59

slfan1989 added 6 commits May 23, 2025 17:08

HDDS-11463. Improve Some Code.

a39ae2f

HDDS-11463. Improve Some Code.

e973492

HDDS-11463. Improve Some Code.

f58bad4

HDDS-11463. Fix CheckStyle.

7b07957

HDDS-11463. Fix CheckStyle.

33b9f0e

HDDS-11463. Fix CheckStyle.

0381c37

slfan1989 marked this pull request as ready for review May 23, 2025 14:24

adoroszlai self-requested a review May 23, 2025 15:03

slfan1989 commented May 23, 2025

View reviewed changes

adoroszlai reviewed May 28, 2025

View reviewed changes

slfan1989 added 4 commits May 29, 2025 14:14

HDDS-11463. Improve Some Code.

4ddc7c5

HDDS-11463. Improve Some Code.

92e1169

HDDS-11463. Improve Some Code.

5d637fa

HDDS-11463. Fix CheckStyle.

a8efdd3

Merge branch 'apache:master' into HDDS-11463

07370b7

	@Option(names = { "--hostName" },
	@Option(names = { "--hostname" },

	@CommandLine.ArgGroup(exclusive = true)
	private ExclusiveLimit exclusiveLimit = new ExclusiveLimit();

	static class ExclusiveLimit {
	@CommandLine.Option(names = {"--length", "-l"},
	description = "Maximum number of items to list",
	defaultValue = "100",
	showDefaultValue = CommandLine.Help.Visibility.ALWAYS)
	private int limit;

	@CommandLine.Option(names = {"--all", "-a"},
	description = "List all results",
	defaultValue = "false")
	private boolean all;
	}

		description = "failed is used to display failed disks, " +
		"normal is used to display normal disks.")

		System.setOut(new PrintStream(outContent, false, DEFAULT_ENCODING));
		System.setErr(new PrintStream(errContent, false, DEFAULT_ENCODING));

	subcommands = {
	ListInfoSubcommand.class,
	DecommissionSubCommand.class,
	MaintenanceSubCommand.class,
	RecommissionSubCommand.class,
	StatusSubCommand.class,
	UsageInfoSubcommand.class
	})

HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

Are you sure you want to change the base?

HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

Uh oh!

Conversation

slfan1989 commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

slfan1989 commented Oct 23, 2024

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Nov 5, 2024

Uh oh!

slfan1989 commented Nov 6, 2024

Uh oh!

slfan1989 commented Nov 7, 2024

Uh oh!

slfan1989 commented Nov 19, 2024

Uh oh!

adoroszlai commented Feb 14, 2025

Uh oh!

slfan1989 commented Feb 15, 2025

Uh oh!

adoroszlai commented Feb 18, 2025

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

slfan1989 commented Oct 3, 2024 •

edited

Loading

adoroszlai May 8, 2025 •

edited

Loading

slfan1989 May 23, 2025 •

edited

Loading