HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full #8492

siddhantsangwan · 2025-05-20T11:23:04Z

What changes were proposed in this pull request?

This pull request is for implementing a part of the design proposed in HDDS-12929. This only contains the implementation for detecting a full volume, getting the latest storage report, adding the container action, then immediately triggering (or throttling) a heartbeat.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13045

How was this patch tested?

Modified existing unit tests. Also did some manual testing using the ozone docker compose cluster.

a. Simulated a close to full volume with a capacity of 2 GB, available space of 150 MB and min free space of 100 MB. Datanode log:

2025-05-20 09:47:05,899 [main] INFO volume.HddsVolume: HddsVolume: { id=DS-64dd669c-71fe-492f-903c-4fc7dbe4440a dir=/data/hdds/hdds type=DISK capacity=2147268899 used=1990197248 available=157071651 minFree=104857600 committed=0 }

b. Wrote 100 MB of data using freon, with the expectation that an immediate heartbeat will be triggered as soon as the available space drops to 100 MB. Datanode log shows that this happened at 09:50:52:

2025-05-20 09:50:52,028 [f8714dd7-31fc-4c63-9703-6fdb1a59b5c4-ChunkWriter-7-0] INFO impl.HddsDispatcher: Triggering heartbeat for full volume /data/hdds/hdds, with node report storageReport {
   storageUuid: "DS-bd34474b-8fd4-49be-be78-72e708b543c0"
   storageLocation: "/data/hdds/hdds"
   capacity: 2147268899
   scmUsed: 2042626048
   remaining: 104642851
   storageType: DISK
   failed: false
   committed: 0
   freeSpaceToSpare: 104857600
 }
 metadataStorageReport {
   storageLocation: "/data/metadata/ratis"
   storageType: DISK
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   failed: false
 }

c. In the SCM, the last storage report BEFORE the write operation was received at 09:50:09:

2025-05-20 09:50:09,399 [IPC Server handler 12 on default port 9861] INFO server.SCMDatanodeHeartbeatDispatcher: Dispatching Node Report storageReport {
storageUuid: "DS-27210be2-ee53-4035-a3a3-63ec8a162456"
   storageLocation: "/data/hdds/hdds"
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   storageType: DISK
   failed: false
   committed: 0
   freeSpaceToSpare: 104857600
 }
 metadataStorageReport {
   storageLocation: "/data/metadata/ratis"
   storageType: DISK
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   failed: false
 }

So, the next storage report should be received a minute later at 09:51:09, unless it's triggered immediately due to volume full. The SCM log shows that the immediately triggered report was received at 09:50:52, corresponding to the DN log:

2025-05-20 09:50:52,033 [IPC Server handler 4 on default port 9861] INFO server.SCMDatanodeHeartbeatDispatcher: Dispatching Node Report storageReport {
   storageUuid: "DS-bd34474b-8fd4-49be-be78-72e708b543c0"
   storageLocation: "/data/hdds/hdds"
   capacity: 2147268899
   scmUsed: 2042626048
   remaining: 104642851
   storageType: DISK
   failed: false
   committed: 0
   freeSpaceToSpare: 104857600
 }
 metadataStorageReport {
   storageLocation: "/data/metadata/ratis"
   storageType: DISK
   capacity: 2147268899
   scmUsed: 1990197248
   remaining: 157071651
   failed: false
 }

The next storage report is received at the expected time of 09:51:09, showing that throttling also worked.

Green CI in my fork: https://github.com/siddhantsangwan/ozone/actions/runs/15135787944/job/42547140475

siddhantsangwan · 2025-05-21T05:13:35Z