Skip to content

HDDS-12554. Support callback on completed reconfiguration #8391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

sarvekshayr
Copy link
Contributor

@sarvekshayr sarvekshayr commented May 5, 2025

What changes were proposed in this pull request?

ReconfigurableBase updates properties one by one. But some group of related properties may be reconfigurable only atomically. Example: replication is configured with "type" and "replication" (which is the type-specific description of the replication config). Updating type or replication separately results in invalid configuration (RATIS/THREE -> EC/THREE -> EC/rs-3-2-1024k should be updated in one step).

To handle this situation, reconfiguration framework now allows registering callbacks for "reconfiguration completed" event. Reconfigured properties are accessible via ReconfigurableBase#getReconfigurationTaskStatus.

What is the link to the Apache JIRA

HDDS-12554

How was this patch tested?

Tested on a docker cluster.

bash-5.1$ ozone admin reconfig --service=OM --address=om:9862 start
OM: Started reconfiguration task on node [om:9862].
bash-5.1$ ozone admin reconfig --service=OM --address=om:9862 status
OM: Reconfiguring status for node [om:9862]: started at Fri May 16 09:04:04 UTC 2025 and finished at Fri May 16 09:04:04 UTC 2025.
SUCCESS: Changed property ozone.directory.deleting.service.interval
        From: "2m"
        To: "3m"

Logs

2025-05-16 14:34:04 2025-05-16 09:04:04,371 [Reconfiguration Task] INFO conf.ReconfigurableBase: Starting reconfiguration task.
2025-05-16 14:34:04 2025-05-16 09:04:04,393 [Reconfiguration Task] INFO conf.ReconfigurableBase: Change property: ozone.directory.deleting.service.interval from "2m" to "3m".
2025-05-16 14:34:04 2025-05-16 09:04:04,393 [Reconfiguration Task] INFO conf.ReconfigurableBase: Reconfiguration completed. 1 properties were updated.
2025-05-16 14:34:04 2025-05-16 09:04:04,404 [Reconfiguration Task] INFO om.OzoneManager: Restarting DirectoryDeletingService with new interval: 180s
2025-05-16 14:34:04 2025-05-16 09:04:04,404 [Reconfiguration Task] INFO service.DirectoryDeletingService: Updating and restarting DirectoryDeletingService with interval: 180 SECONDS
2025-05-16 14:34:04 2025-05-16 09:04:04,404 [Reconfiguration Task] INFO utils.BackgroundService: Stopping DirectoryDeletingService
2025-05-16 14:34:04 2025-05-16 09:04:04,404 [Reconfiguration Task] INFO utils.BackgroundService: Starting DirectoryDeletingService with interval 180 SECONDS
2025-05-16 14:34:04 2025-05-16 09:04:04,405 [Reconfiguration Task] INFO om.OzoneManager: Reconfiguration completed. Properties are updated.

For invalid time duration format

bash-5.1$ ozone admin reconfig --service=OM --address=om:9862 start
OM: Started reconfiguration task on node [om:9862].
bash-5.1$ ozone admin reconfig --service=OM --address=om:9862 status
OM: Reconfiguring status for node [om:9862]: started at Mon May 19 06:51:26 UTC 2025 and finished at Mon May 19 06:51:26 UTC 2025.
FAILED: Change property ozone.directory.deleting.service.interval
        From: "1m"
        To: "2w"
        Error: Invalid time duration format: 2w.

@sarvekshayr
Copy link
Contributor Author

Please review this PR @sumitagrawl @adoroszlai. TIA!

@Tejaskriya Tejaskriya self-requested a review May 6, 2025 07:24
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarvekshayr we need register final callback to be triggered with required information as framework

/**
* Base class to support dynamic reconfiguration of configuration properties at runtime.
*/
public abstract class ReconfigurableBase extends Configured implements Reconfigurable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need have test cases for this, as now ozone own this class.

@adoroszlai adoroszlai marked this pull request as draft May 14, 2025 14:16
Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @sarvekshayr.
Some minor comments inline.

public void start() {
exec.scheduleWithFixedDelay(service, 0, interval, unit);
public synchronized void start() {
scheduledHandle = exec.scheduleWithFixedDelay(service, 0, interval, unit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a check to not start the thread if scheduledHandle is not null?
cc @sumitagrawl

@@ -84,6 +84,7 @@ public class DirectoryDeletingService extends AbstractKeyDeletingService {
private final DeletedDirSupplier deletedDirSupplier;

private AtomicInteger taskCount = new AtomicInteger(0);
private String dirDeletingServiceInterval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can remove this from the constructor as well as the getter function!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change requires removing the corresponding test in TestOmReconfiguration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it's safe to remove the test right now, as it was just checking the value of OZONE_DIR_DELETING_SERVICE_INTERVAL, not tracing the actual restart of the directory deleting service.
I'm not sure, but if possible, it would be better to write a test that checks the calling of "addReconfigurationCompleteCallback()" and "updateAndRestart()" upon changing the directory deletion interval.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given few more comments and test cases for deeleteDirectoryService is not present

@@ -5107,6 +5140,37 @@ public void validateReplicationConfig(ReplicationConfig replicationConfig)
}
}

private void initOrUpdateDirDeletingService(OzoneConfiguration conf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the owner of dirDeletingService is KeyManagerImpl, so OM should not have callback and handling here.

@@ -5097,6 +5122,14 @@ private String reconfigureAllowListAllVolumes(String newVal) {
return String.valueOf(allowListAllVolumes);
}

private String reconfOzoneDirDeletingServiceInterval(String newVal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this validation be registered by the service using this configuration, ie directorydeleteingService, and below logic also do validation, can that be used instead of using pattern if not null., configuration.getTimeDuration(
OZONE_BLOCK_DELETING_SERVICE_INTERVAL,
TimeUnit.MILLISECONDS);

public synchronized void stop() {
LOG.info("Stopping {}", serviceName);
if (scheduledHandle != null) {
scheduledHandle.cancel(false); // don't interrupt running tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need interrupt running task make wait till its completed. this is as when start() new task, should not conflict is already running task. DirectoryDeletingTask can have reverse impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants