Skip to content

HDDS-10374. Make container scanner generate merkle trees during the scan #7490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 63 commits into
base: HDDS-10239-container-reconciliation
Choose a base branch
from

Conversation

errose28
Copy link
Contributor

@errose28 errose28 commented Nov 26, 2024

What changes were proposed in this pull request?

Allow the container scanner to update the container Merkle Tree with the currently calculated hashes of what is on disk. If the tree does not already exist, it will be created. Even if the container is unhealthy, the merkle tree should still be updated to reflect the current state of the container. Hashes stored in RocksDB that were given by the client during write will not be modified.

Specific changes:

  • KeyValueContainerCheck has been modified to support building the merkle tree from the data the scanner sees.
  • Reworked the container checksum update methods in KeyValueHandler.
    • In particular added a public updateContainerChecksum method for scanners to call so that ICRs are sent for checksum updates even if the container's state does not change.
    • This also allows the KeyValueHandler to manage the in-memory data checksum update in the ContainerData object, so we don't need to do it from within ContainerChecksumTreeManager anymore.
  • A new ContainerScanHelper class was created to consolidate duplicate code among the scanners.
  • ContainerMerkleTree can now be built to contain blocks without any chunks, which is necessary to represent states like an empty block file.

What is the link to the Apache JIRA

HDDS-10374

How was this patch tested?

This patch contains unit, integration, and acceptance test additions and modifications. Major ones to note:

  • Unit tests for all container scanners updated to test that failure to update checksum does not impact moving a container to unhealthy.
  • The new KeyValueHandler#updateContainerChecksum method has a unit test added to check both ICR and disk updates happen correctly.
  • The TestContainerReconciliationWithMockDatanodes suite has been enabled after HDDS-12980. Add unit test framework for reconciliation. #8402
  • Container scanner integration tests have been updated to check container checksums after corruption is injected.
  • Existing tests have been migrated to use the on-demand scanner to generate merkle trees instead of manually building them. This removes numerous TODOs from the code.

More robust testing of fault combinations will be added later using the framework being developed under HDDS-11942

@errose28 errose28 marked this pull request as ready for review January 9, 2025 17:45
@@ -1348,7 +1358,6 @@ public void markContainerUnhealthy(Container container, ScanResult reason)
} finally {
container.writeUnlock();
}
createContainerMerkleTree(container);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a container moves from OPEN to UNHEALTHY state we should try to build a merkle tree with whatever data we have at the moment before the scanner builds the actual merkle tree. If for some reason (either metadata/data error) we are unable to build it, then we can log the exception and move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when we have an unhealthy container we should build the tree from the actual data, but this method only builds the tree from the metadata. I'm renaming it in the next commit to make that clearer. We should not depend on just the metadata if we suspect corruption. The way this should work is open -> unhealthy would trigger an on-demand scan of the container, but that code path is not in place right now. We may want to add that to our branch actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should add the on-demand scan here to generate the Merkle tree.

Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good to me. I have left a minor comment to my previous comment.

write(data, checksumInfoBuilder.build());
LOG.debug("Data merkle tree for container {} updated", containerID);
// If write succeeds, update the checksum in memory. Otherwise 0 will be used to indicate the metadata failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a way to tell if a checksum failed vs. scanner has not yet run? Should failure to generate checksum == -1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, the in-memory hash should be what is written to disk as it is an in-memory cache. If updating the Merkle tree is failing, we can have a metric + log message and/or a status report to SCM or a new Admin API to Datanode to query the Merkle tree (I think there are going to be multiple debug scenarios where we would like to query a Datanode what it knows independent of SCM reports).

…anner-builds-mt

* HDDS-10239-container-reconciliation: (646 commits)
  HDDS-11763. Implement container repair logic within datanodes. (apache#7474)
  HDDS-11887. Recon - Identify container replicas difference based on content checksums (apache#7942)
  HDDS-12397. Persist putBlock for closed container. (apache#7943)
  HDDS-12361. Mark testGetSnapshotDiffReportJob as flaky
  HDDS-12375. Random object created and used only once (apache#7933)
  HDDS-12335. Fix ozone admin namespace summary to give complete output (apache#7908)
  HDDS-12364. Require Override annotation for overridden methods (apache#7923)
  HDDS-11530. Support listMultipartUploads max uploads and markers (apache#7817)
  HDDS-11867. Remove code paths for non-Ratis SCM. (apache#7911)
  HDDS-12363. Add import options to IntelliJ IDEA style settings (apache#7921)
  HDDS-12215. Mark testContainerStateMachineRestartWithDNChangePipeline as flaky
  HDDS-12331. BlockOutputStream.failedServers is not thread-safe (apache#7885)
  HDDS-12188. Move server-only upgrade classes from hdds-common to hdds-server-framework (apache#7903)
  HDDS-12362. Remove temporary checkstyle suppression file (apache#7920)
  HDDS-12343. Fix spotbugs warnings in Recon (apache#7902)
  HDDS-12286. Fix license headers and imports for ozone-tools (apache#7919)
  HDDS-12284. Fix license headers and imports for ozone-s3-secret-store (apache#7917)
  HDDS-12275. Fix license headers and imports for ozone-integration-test (apache#7904)
  HDDS-12164. Rename and deprecate DFSConfigKeysLegacy config keys (apache#7803)
  HDDS-12283. Fix license headers and imports for ozone-recon-codegen (apache#7916)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeTestUtils.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerMerkleTreeWriter.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestKeyValueContainerCheck.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestBackgroundContainerDataScanner.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOnDemandContainerDataScanner.java
hadoop-ozone/dist/src/main/smoketest/admincli/container.robot
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestBackgroundContainerDataScannerIntegration.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestBackgroundContainerMetadataScannerIntegration.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/scanner/TestOnDemandContainerDataScannerIntegration.java
@errose28 errose28 marked this pull request as draft April 11, 2025 17:15
Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @errose28 for the patch.

@@ -465,26 +476,46 @@ private static List<ContainerScanError> verifyChecksum(BlockData block,
StringUtils.bytes2Hex(expected.asReadOnlyByteBuffer()),
StringUtils.bytes2Hex(actual.asReadOnlyByteBuffer()),
block.getBlockID());
chunkHealthy = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can break the loop here because smallest unit is chunk.

@@ -32,9 +32,10 @@ public enum FailureType {
MISSING_METADATA_DIR,
MISSING_CONTAINER_FILE,
MISSING_CHUNKS_DIR,
MISSING_CHUNK_FILE,
MISSING_DATA_FILE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Is this for the cases when *.block file is missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I changed the name so it is agnostic of container layout.

errose28 added 4 commits May 5, 2025 14:10
…anner-builds-mt

* HDDS-10239-container-reconciliation:
  HDDS-12849. On demand container scanner should not be static (apache#8296)
  HDDS-12745. Container checksum should be reported during container close and DN restart (apache#8204)

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerDataScanner.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/impl/TestContainerSet.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/ozoneimpl/TestOnDemandContainerDataScanner.java
…builds-mt

* reconcile-unit-test-framework:
  Checkstyle
  Update test name
  Improve char data generation, reset scan metrics
  Wait for on-demand scans to complete in test
  Use tree writer between peer updates. All tests pass
  Add checksum validation, generate readable data
  Logging improvements
  Use mixin to standardize scanner operations, log checksum changes in scanner
  Fix tree tracking during reconcile process
  Improve logging in test and prod code
  Most tests passing
  WIP migrate reconciliation unit tests

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerDataScanner.java
errose28 added 14 commits May 12, 2025 22:49
Clean up checksum update invocation using container controller.
…anner-builds-mt

* HDDS-10239-container-reconciliation:
  HDDS-12980. Add unit test framework for reconciliation. (apache#8402)

hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/ContainerController.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeTestUtils.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/checksum/TestContainerMerkleTreeWriter.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/keyvalue/TestContainerReconciliationWithMockDatanodes.java
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/dn/checksum/TestContainerCommandReconciliation.java
…anner-builds-mt

* HDDS-10239-container-reconciliation: (433 commits)
  Revert "HDDS-11232. Spare InfoBucket RPC call for the FileSystem#getFileStatus calls for the general case. (apache#6988)" (apache#8358)
  HDDS-12993. Use DatanodeID in ReconNodeManager. (apache#8456)
  HDDS-13024. HTTP connect to 0.0.0.0 failed (apache#8439)
  HDDS-12914. Bump awssdk to 2.31.40, test ResumableFileDownload (apache#8455)
  HDDS-4677. Document Ozone Ports and Connection End Points (apache#8226)
  HDDS-13044. Remove DatanodeDetails#getUuid usages from hdds-common/client/container-service (apache#8462)
  HDDS-12568. Implement MiniOzoneCluster.Service for Recon (apache#8452)
  HDDS-13046. Add .vscode to .gitignore (apache#8461)
  HDDS-13022. Split up exclusive size tracking for key and directory cleanup in SnapshotInfo (apache#8433)
  HDDS-12966. DBDefinitionFactory should not throw InvalidArnException (apache#8454)
  HDDS-12948. [Snapshot] Increase `ozone.om.fs.snapshot.max.limit` default value to 10k (apache#8376)
  HDDS-11904. Fix HealthyPipelineSafeModeRule logic. (apache#8386)
  HDDS-12989. Throw CodecException for the Codec byte[] methods (apache#8444)
  HDDS-11298. Support owner field in the S3 listBuckets request (apache#8315)
  HDDS-12810. Check and reserve space atomically in VolumeChoosingPolicy (apache#8360)
  HDDS-13016. Add a getAllNodeCount() method to NodeManager. (apache#8445)
  HDDS-12299. Merge OzoneAclConfig into OmConfig (apache#8383)
  HDDS-12501. Remove OMKeyInfo from SST files in backup directory. (apache#8214)
  HDDS-13021. Make callId unique for each request in AbstractKeyDeletingService (apache#8432)
  HDDS-13000. [Snapshot] listSnapshotDiffJobs throws NPE. (apache#8418)
  ...
Copy link
Member

@aswinshakil aswinshakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that struck my mind is that it's not related to this PR, but we can have it as a separate JIRA. We shouldn't allow reconciliation of a container that is already reconciling. This can lead to some other bugs where we update the metadata or the data in parallel.

@errose28
Copy link
Contributor Author

errose28 commented May 19, 2025

We shouldn't allow reconciliation of a container that is already reconciling. This can lead to some other bugs where we update the metadata or the data in parallel.

Most of this should be handled for us by the ReplicationSupervisor and tested generally here. However this depends on how we define the uniqueness of the reconciliation commands and here I think we do have a problem as you mentioned, because we can get multiple simultaneous reconciliation tasks for the same container if the peer list is different. We should fix this in a follow-up so that ReconcileContainerCommand equality is based only on container ID, and add a test for this that is unique to reconciliation commands.

Update: Filed HDDS-13086 for this.

@errose28
Copy link
Contributor Author

Thanks for the review @aswinshakil. All comments should be addressed. I still need to stabilize this test on our branch before trying to get green CI with this change.

@aswinshakil
Copy link
Member

The changes for the scan gap look good. I see there are some checkstyle-related CI failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants