-
Notifications
You must be signed in to change notification settings - Fork 535
HDDS-11566. Replication tasks add transferredBytes and queuedTime metrics #7306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@jianghuazhu I have some questions about these metrics. My understanding is that metrics should have practical measurement value. Each machine may receive a large number of EC recover commands and replication commands; how can the bytes transferred help us? Additionally, these commands themselves have a timeout, so they won't execute beyond that time. What is the value of the queueing time? cc: @whbing @weimingdiit |
@slfan1989 @jianghuazhu I agree with this view. maybe the average waiting time of the queue is a useful metric, because it can tell us the backlog of commands in the queue, but I don't know whether the transferredBytes metric is meaningful for us to observe system performance. |
We should have some way to approximate how much network traffic is happening due to replication tasks. To really visualize this it would have to go in a dashboard. I think this would look something like this:
@kerneltime is this the way such a thing is usually implemented? I'm not sure if we have a standard in other Ozone dashboards for how to chart continuous events like network traffic. |
Thanks @errose28 for the comment and review. |
@kerneltime @slfan1989 @weimingdiit how should we proceed on this PR? |
Regarding this, I think there are two ways to achieve it:
@adoroszlai @errose28 @kerneltime , what do you think? |
@jianghuazhu Thank for your contribution! However, I still don't fully understand the specific purpose of this metric. In other words, how do the fluctuations of this metric help us assess the state of a DataNode? We have a cluster where 99% of the data uses EC, but based on my maintenance experience, I typically don’t pay much attention to traffic changes caused by reconstruction on a DataNode, because compared to read/write traffic, the impact should be negligible. If the practical use of this metric cannot be clearly explained, I think it would be better not to add it to the system, as it would only increase the complexity of the metric framework. |
This viewpoint is partially agreed upon, as it does reflect that network traffic should indeed resemble the monitoring graph. However, I believe that such a complex statistical module should not be added internally within Ozone. Instead, it could be considered for implementation in an external metric collection system. |
What changes were proposed in this pull request?
Added transferredBytes and queuedTime metrics for replication tasks, including ec reconstruction and container replication.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11566
How was this patch tested?
ci:
https://github.com/jianghuazhu/ozone/actions/runs/11310889447
datanode jmx:

