Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
1.4.1
-
None
Description
Problem
This is the method that sends the delete command in MoveManager:
private void sendDeleteCommand( final ContainerInfo containerInfo, final DatanodeDetails datanode) throws ContainerReplicaNotFoundException, ContainerNotFoundException, NotLeaderException { int replicaIndex = getContainerReplicaIndex( containerInfo.containerID(), datanode); long deleteTimeout = moveTimeout - replicationTimeout; long now = clock.millis(); replicationManager.sendDeleteCommand( containerInfo, replicaIndex, datanode, true, now + deleteTimeout); }
It calculates deleteTimeout as moveTimeout - replicationTimeout, and then sends the delete command with an SCM expiration timestamp of current time + deleteTimeout. This is wrong, the delete expiration timestamp should actually be "The time at which the move was started + moveTimeout."
This diagram can help with visualisation, the key is that move = replicate + delete.
/A/------------------------------------------------/B/-----------/C/
A = move start time
B = move start time + replication timeout
C = move start time + move timeout
The time duration that replicate command gets is replicationTimeout, and the time duration that the total move gets is moveTimeout.
So, the timestamp at which replicate command should expire is moveStart + replicationTimeout (which is correct in the code). And the time at which the delete should expire is moveStart + moveTimeout (this correction needs to be done in the code).
This bug is causing the delete expiration timestamp to be in the past (in the Datanode) because Replication Manager (via which the command is actually sent) further reduces the Datanode side expiration timestamp by event.timeout.datanode.offset. So whenever moveTimeout - replicationTimeout < event.timeout.datanode.offset, the expiration time in the DN is in the past.
Example and Repro
For example, consider the following configs:
hdds.container.balancer.move.replication.timeout=50m, hdds.container.balancer.move.timeout=55m,
hdds.scm.replication.event.timeout.datanode.offset=6m.
MoveManager#sendDeleteCommand calls ReplicationManager#sendDeleteCommand with SCM expiration timestamp of now + moveTimeout - moveReplicationTimeout, which is now + 55 - 50, which is now + 5 minutes.
The Replication Manager method further calls sendDatanodeCommand, which calculates the Datanode expiration timestamp as
datanodeDeadline = scmDeadlineEpochMs - rmConf.getDatanodeTimeoutOffset()
which translates to now + 5 minutes - 6 minutes, which is in the past.
We need to further ensure the balancer configurations are not allowed to be configured like this, which can be handled in another Jira - https://issues.apache.org/jira/browse/HDDS-13068.
Solution
For this jira, a simple fix is to keep the time when the move is scheduled in MoveManager#pendingMoves map, then use that time to calculate the delete timestamp when sending the delete command.
Attachments
Issue Links
- links to