-
Notifications
You must be signed in to change notification settings - Fork 535
HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full #8492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
nodeReport = context.getParent().getContainer().getNodeReport(); | ||
context.refreshFullReport(nodeReport); | ||
context.getParent().triggerHeartbeat(); | ||
LOG.info("Triggering heartbeat for full volume {}, with node report: {}.", volume, nodeReport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on the write path, so we must be extra careful about performance. An info log will reduce performance, but I wonder if it's ok in this case because this won't happen often? What do others think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover the future plan is to fail the write anyway if the size is exceeding the min free and reserved space boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for this improvement!
...iner-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/HddsDispatcher.java
Show resolved
Hide resolved
@@ -130,6 +134,10 @@ public HddsDispatcher(ConfigurationSource config, ContainerSet contSet, | |||
this.tokenVerifier = tokenVerifier != null ? tokenVerifier | |||
: new NoopTokenVerifier(); | |||
this.slowOpThresholdNs = getSlowOpThresholdMs(conf) * 1000000; | |||
fullVolumeLastHeartbeatTriggerMs = new AtomicLong(-1); | |||
long heartbeatInterval = | |||
config.getTimeDuration("hdds.heartbeat.interval", 30000, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call HddsServerUtil#getScmHeartbeatInterval instead?
And there is HDDS_NODE_REPORT_INTERVAL for node report. Shall we use node report property instead of heartbeat property?
try { | ||
handleFullVolume(container.getContainerData().getVolume()); | ||
} catch (StorageContainerException e) { | ||
ContainerUtils.logAndReturnError(LOG, e, msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to return here?
*/ | ||
private void handleFullVolume(HddsVolume volume) throws StorageContainerException { | ||
long current = System.currentTimeMillis(); | ||
long last = fullVolumeLastHeartbeatTriggerMs.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider different volume gets full case , for example, P0, /data1 gets full, P1, /data2 gets full,
(P1-P0) < interval, do we expect two emergent container reports, or one report?
What changes were proposed in this pull request?
This pull request is for implementing a part of the design proposed in HDDS-12929. This only contains the implementation for detecting a full volume, getting the latest storage report, adding the container action, then immediately triggering (or throttling) a heartbeat.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13045
How was this patch tested?
Modified existing unit tests. Also did some manual testing using the ozone docker compose cluster.
a. Simulated a close to full volume with a capacity of 2 GB, available space of 150 MB and min free space of 100 MB. Datanode log:
b. Wrote 100 MB of data using freon, with the expectation that an immediate heartbeat will be triggered as soon as the available space drops to 100 MB. Datanode log shows that this happened at 09:50:52:
c. In the SCM, the last storage report BEFORE the write operation was received at 09:50:09:
So, the next storage report should be received a minute later at 09:51:09, unless it's triggered immediately due to volume full. The SCM log shows that the immediately triggered report was received at 09:50:52, corresponding to the DN log:
The next storage report is received at the expected time of 09:51:09, showing that throttling also worked.
Green CI in my fork: https://github.com/siddhantsangwan/ozone/actions/runs/15135787944/job/42547140475