Skip to content

HDDS-9151. Close EC Pipeline when container transitions to closing #5191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 23, 2023

Conversation

sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

In testing we have found an issues in the ECWritableContainerProvider.

For EC a pipeline is used for only one container, when the container gets closed, the pipeline also gets closed. At the moment, the only place in the code which closes the EC piplines which no longer have an open container is inside the ECWritableContainerProvider. It first gets the list of open piplines and enforces the pipeline limit, then for all open pipelines, it tries top find one the client can use.

If the client has had problems writing to the pipelines (eg it was given a container/pipeline and then the write failed as the container was closed on the DN), the pipelines get added to the exclude list. Then we can get into a situation where many pipelines need to be closed on the write path, slowing down block allocation.

Ideally, when a container transitions to CLOSING in SCM, if the container is an EC container, we should also close the associated pipeline to avoid it counting toward the limit and to avoid needing to close it during the write (block allocation) path.

This could be achieved relatively simply inside the PipelineManagerImpl.removeContainersFromPipeline() method which is called as soon as the container transitions to CLOSING via ContainerStateManagerImpl.updateContainerState() when it executes the containerStateChangeActions. Wrapping the container close and pipeline close in a lock inside PipelineManagerImpl ensure we have a consistent "ec container close" flow and it should avoid the ECWritableContainerProvider needing to close the pipelines internally. However we can leave that code in place in ECWritableContainerProvider incase some pipelines slip through somehow.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9151

How was this patch tested?

New unit test added to ensure EC pipelines get closed and Ratis do not.

@guohao-rosicky
Copy link
Contributor

Thanks @sodonnel work on this, I think it's a good idea, this can close the pipeline in advance, so that the pipeline in the open state can be used to apply the ec block after getting it in advance.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this javadoc from PipelineStateManagerImpl#removeContainerFromPipeline:

        // Typically, SCM can send a pipeline close Action to datanode and
        // receive pipelineCloseAction to close the pipeline which will remove
        // the pipelineId both from the piplineStateMap as well as
        // pipeline2containerMap Subsequently, close container handler event can
        // also try to close the container as a part of which , it will also
        // try to remove the container from the pipeline2container Map which
        // will fail with PipelineNotFoundException. These are executed over
        // ratis, and if the exception is propagated to SCMStateMachine, it will
        // bring down the SCM. Ignoring it here.

The method that's annotated with @Replicate is updateContainerState and is executed over Ratis for all the SCMs.

The javadoc suggests that we also need to catch PipelineNotFoundException in PipelineManagerImpl#removeContainerFromPipeline when getting a pipeline:

Pipeline pipeline = stateManager.getPipeline(pipelineID);

since this also throws the same exception. Still trying to understand the Ratis part in-depth, but meanwhile what do you think?

@sodonnel
Copy link
Contributor Author

The javadoc suggests that we also need to catch PipelineNotFoundException in PipelineManagerImpl#removeContainerFromPipeline when getting a pipeline:

Pipeline pipeline = stateManager.getPipeline(pipelineID);

I think you are correct. It would be safer to catch the PNF exception from the getPipeline call too. I have updated it.

I didn't even notice the replicate annotation, as it was annotating the interface rather than the implementation, and I did not check the interface!

Assertions.assertEquals(1, pipelineManager
.getContainersInPipeline(ratisPipeline.getId()).size());
Assertions.assertEquals(1, pipelineManager
.getContainersInPipeline(ratisPipeline.getId()).size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ecPipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. I just pushed a commit to fix this.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for working on this!

@siddhantsangwan siddhantsangwan merged commit 77dc4db into apache:master Aug 23, 2023
sodonnel pushed a commit to sodonnel/hadoop-ozone that referenced this pull request Sep 11, 2023
adoroszlai pushed a commit that referenced this pull request Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants