Skip to content

HDDS-11566. Replication tasks add transferredBytes and queuedTime metrics #7306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jianghuazhu
Copy link
Contributor

What changes were proposed in this pull request?

Added transferredBytes and queuedTime metrics for replication tasks, including ec reconstruction and container replication.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11566

How was this patch tested?

ci:
https://github.com/jianghuazhu/ozone/actions/runs/11310889447

datanode jmx:
image
image

@jianghuazhu
Copy link
Contributor Author

@sodonnel @errose28 @fapifta , can you help review this PR?
Thanks.

@slfan1989
Copy link
Contributor

slfan1989 commented Oct 13, 2024

@jianghuazhu I have some questions about these metrics. My understanding is that metrics should have practical measurement value. Each machine may receive a large number of EC recover commands and replication commands; how can the bytes transferred help us? Additionally, these commands themselves have a timeout, so they won't execute beyond that time. What is the value of the queueing time?

cc: @whbing @weimingdiit

@weimingdiit
Copy link
Contributor

weimingdiit commented Oct 14, 2024

My understanding is that metrics should have practical measurement value

@slfan1989 @jianghuazhu I agree with this view. maybe the average waiting time of the queue is a useful metric, because it can tell us the backlog of commands in the queue, but I don't know whether the transferredBytes metric is meaningful for us to observe system performance.

@errose28
Copy link
Contributor

We should have some way to approximate how much network traffic is happening due to replication tasks. To really visualize this it would have to go in a dashboard. I think this would look something like this:

  • Ozone tracks total number of bytes transferred since the last restart
  • In the dashboard (Grafana for example) we approximate network traffic over time by subtracting the current sampled metric value from the last sampled metric.
    • Note that the metric's implementation in Ozone would always append to the counter, so the values would never decrease unless the cluster is restarted.
    • The chart would then look like a step function where each plateau shows the amount of data transferred between its start and end times

@kerneltime is this the way such a thing is usually implemented? I'm not sure if we have a standard in other Ozone dashboards for how to chart continuous events like network traffic.

@jianghuazhu
Copy link
Contributor Author

jianghuazhu commented Oct 17, 2024

Thanks @errose28 for the comment and review.
@kerneltime , do you have any new suggestions?
Thanks.

@adoroszlai
Copy link
Contributor

@kerneltime @slfan1989 @weimingdiit how should we proceed on this PR?

@jianghuazhu
Copy link
Contributor Author

jianghuazhu commented Dec 13, 2024

  • In the dashboard (Grafana for example) we approximate network traffic over time by subtracting the current sampled metric value from the last sampled metric.

Regarding this, I think there are two ways to achieve it:

  1. Now that we have the total number of bytes transmitted, when we want to check the transmission trend, we can record the difference of the data outside the Ozone system. For example, the total number of bytes obtained at 09:10:20 is 300mb, and the total number of bytes obtained at 09:10:30 is 400mb, then the traffic trend during the period of 09:10:20~09:10:30 is 100mb.

  2. A common module can be designed to handle functions similar to traffic transmission trends. It is necessary to implement aggregation according to time periods, for example, 5s, 1min, 1h. Regularly collect the difference between two time periods, 09:10:20 ~ 09:10:25->40mb, 09:10:30 ~ 09:11:30->500mb, 09:10:30 ~ 10:10:30->2gb, these differences should be defined as instantaneous values.
    In general, their effects should be like this:
    image

@adoroszlai @errose28 @kerneltime , what do you think?

@slfan1989
Copy link
Contributor

@jianghuazhu Thank for your contribution! However, I still don't fully understand the specific purpose of this metric. In other words, how do the fluctuations of this metric help us assess the state of a DataNode? We have a cluster where 99% of the data uses EC, but based on my maintenance experience, I typically don’t pay much attention to traffic changes caused by reconstruction on a DataNode, because compared to read/write traffic, the impact should be negligible. If the practical use of this metric cannot be clearly explained, I think it would be better not to add it to the system, as it would only increase the complexity of the metric framework.

cc: @errose28 @adoroszlai @weimingdiit

@slfan1989
Copy link
Contributor

  1. A common module can be designed to handle functions similar to traffic transmission trends. It is necessary to implement aggregation according to time periods, for example, 5s, 1min, 1h. Regularly collect the difference between two time periods, 09:10:20 ~ 09:10:25->40mb, 09:10:30 ~ 09:11:30->500mb, 09:10:30 ~ 10:10:30->2gb, these differences should be defined as instantaneous values.

This viewpoint is partially agreed upon, as it does reflect that network traffic should indeed resemble the monitoring graph. However, I believe that such a complex statistical module should not be added internally within Ozone. Instead, it could be considered for implementation in an external metric collection system.

@adoroszlai adoroszlai marked this pull request as draft January 15, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants