Skip to content

HDDS-9542. Ozone debug chunkinfo command shows incorrect number of entries #5703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 30, 2023

Conversation

aryangupta1998
Copy link
Contributor

@aryangupta1998 aryangupta1998 commented Nov 29, 2023

What changes were proposed in this pull request?

If we stop one or more replica datanodes of a key and then run the "ozone debug chunkinfo" command for that key then we get an Execution exception,

23/10/26 03:30:35 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock. Exception Class: java.util.concurrent.ExecutionException, Exception Message:

because we don't get any container response as we have shut down the datanode and so the pipeline is closed. The exception goes away when the dead node interval is met, i.e. after the datanode is marked closed a new pipeline is created. With this PR, we would be showing the Execution exception but will continue printing other node's getBlock result.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9542

How was this patch tested?

Tested on a cluster.
Configs set:

"hdds.scm.replication.thread.interval": "5m",
"ozone.scm.stale.node.interval": "2m",
"ozone.scm.dead.node.interval": "4m"

When all DN's were up,

23/11/29 19:58:51 INFO impl.MetricsSystemImpl: XceiverClientMetrics metrics system started
{
  "KeyLocations": [
    [
      {
        "Datanode-HostName": "ozone-rr-5.ozone-rr.root.hwx.site",
        "Datanode-IP": "172.27.34.203",

When one of the replica DN went down,

23/11/29 20:00:05 INFO impl.MetricsSystemImpl: XceiverClientMetrics metrics system started
23/11/29 20:00:07 ERROR scm.XceiverClientGrpc: Failed to execute command GetBlock on datanode ozone-rr-4.ozone-rr.root.hwx.site Exception Class: java.util.concurrent.ExecutionException, Exception Message: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
{
  "KeyLocations": [
    [
      {
        "Datanode-HostName": "ozone-rr-5.ozone-rr.root.hwx.site",
        "Datanode-IP": "172.27.34.203",
        "Container-ID": 3,

We got the exception but still printing other blocks location on remaning DN's

Copy link
Contributor

@sadanand48 sadanand48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aryangupta1998 , change looks good.

Copy link
Contributor

@sadanand48 sadanand48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@nandakumar131
Copy link
Contributor

+1, LGTM.

@nandakumar131 nandakumar131 added the tools Tools that helps with debugging label Nov 30, 2023
@sadanand48 sadanand48 merged commit c65da9e into apache:master Nov 30, 2023
@sadanand48
Copy link
Contributor

Thanks @aryangupta1998 for the change, @nandakumar131 for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Tools that helps with debugging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants