HDDS-11566. Replication tasks add transferredBytes and queuedTime metrics #7306

jianghuazhu · 2024-10-13T05:23:53Z

What changes were proposed in this pull request?

Added transferredBytes and queuedTime metrics for replication tasks, including ec reconstruction and container replication.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11566

How was this patch tested?

ci:
https://github.com/jianghuazhu/ozone/actions/runs/11310889447

datanode jmx:

…rics

jianghuazhu · 2024-10-13T05:30:14Z

@sodonnel @errose28 @fapifta , can you help review this PR?
Thanks.

slfan1989 · 2024-10-13T19:33:26Z

@jianghuazhu I have some questions about these metrics. My understanding is that metrics should have practical measurement value. Each machine may receive a large number of EC recover commands and replication commands; how can the bytes transferred help us? Additionally, these commands themselves have a timeout, so they won't execute beyond that time. What is the value of the queueing time?

cc: @whbing @weimingdiit

weimingdiit · 2024-10-14T08:34:28Z

My understanding is that metrics should have practical measurement value

@slfan1989 @jianghuazhu I agree with this view. maybe the average waiting time of the queue is a useful metric, because it can tell us the backlog of commands in the queue, but I don't know whether the transferredBytes metric is meaningful for us to observe system performance.

errose28 · 2024-10-15T18:02:46Z

We should have some way to approximate how much network traffic is happening due to replication tasks. To really visualize this it would have to go in a dashboard. I think this would look something like this:

Ozone tracks total number of bytes transferred since the last restart
In the dashboard (Grafana for example) we approximate network traffic over time by subtracting the current sampled metric value from the last sampled metric.
- Note that the metric's implementation in Ozone would always append to the counter, so the values would never decrease unless the cluster is restarted.
- The chart would then look like a step function where each plateau shows the amount of data transferred between its start and end times

@kerneltime is this the way such a thing is usually implemented? I'm not sure if we have a standard in other Ozone dashboards for how to chart continuous events like network traffic.

jianghuazhu · 2024-10-17T03:10:40Z

Thanks @errose28 for the comment and review.
@kerneltime , do you have any new suggestions?
Thanks.

adoroszlai · 2024-12-11T17:39:01Z

@kerneltime @slfan1989 @weimingdiit how should we proceed on this PR?

jianghuazhu · 2024-12-13T09:42:47Z

In the dashboard (Grafana for example) we approximate network traffic over time by subtracting the current sampled metric value from the last sampled metric.

Regarding this, I think there are two ways to achieve it:

Now that we have the total number of bytes transmitted, when we want to check the transmission trend, we can record the difference of the data outside the Ozone system. For example, the total number of bytes obtained at 09:10:20 is 300mb, and the total number of bytes obtained at 09:10:30 is 400mb, then the traffic trend during the period of 09:10:20~09:10:30 is 100mb.
A common module can be designed to handle functions similar to traffic transmission trends. It is necessary to implement aggregation according to time periods, for example, 5s, 1min, 1h. Regularly collect the difference between two time periods, 09:10:20 ~ 09:10:25->40mb, 09:10:30 ~ 09:11:30->500mb, 09:10:30 ~ 10:10:30->2gb, these differences should be defined as instantaneous values.
In general, their effects should be like this:

@adoroszlai @errose28 @kerneltime , what do you think?

slfan1989 · 2024-12-13T10:07:49Z

@jianghuazhu Thank for your contribution! However, I still don't fully understand the specific purpose of this metric. In other words, how do the fluctuations of this metric help us assess the state of a DataNode? We have a cluster where 99% of the data uses EC, but based on my maintenance experience, I typically don’t pay much attention to traffic changes caused by reconstruction on a DataNode, because compared to read/write traffic, the impact should be negligible. If the practical use of this metric cannot be clearly explained, I think it would be better not to add it to the system, as it would only increase the complexity of the metric framework.

cc: @errose28 @adoroszlai @weimingdiit

slfan1989 · 2024-12-13T11:34:49Z

A common module can be designed to handle functions similar to traffic transmission trends. It is necessary to implement aggregation according to time periods, for example, 5s, 1min, 1h. Regularly collect the difference between two time periods, 09:10:20 ~ 09:10:25->40mb, 09:10:30 ~ 09:11:30->500mb, 09:10:30 ~ 10:10:30->2gb, these differences should be defined as instantaneous values.

This viewpoint is partially agreed upon, as it does reflect that network traffic should indeed resemble the monitoring graph. However, I believe that such a complex statistical module should not be added internally within Ozone. Instead, it could be considered for implementation in an external metric collection system.

HDDS-11566. Replication tasks add transferredBytes and queuedTime met…

2e63ac2

…rics

adoroszlai requested a review from kerneltime October 17, 2024 15:32

adoroszlai added the metrics label Nov 5, 2024

adoroszlai marked this pull request as draft January 15, 2025 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-11566. Replication tasks add transferredBytes and queuedTime metrics #7306

HDDS-11566. Replication tasks add transferredBytes and queuedTime metrics #7306

Uh oh!

jianghuazhu commented Oct 13, 2024

Uh oh!

jianghuazhu commented Oct 13, 2024

Uh oh!

slfan1989 commented Oct 13, 2024 •

edited

Loading

Uh oh!

weimingdiit commented Oct 14, 2024 •

edited

Loading

Uh oh!

errose28 commented Oct 15, 2024

Uh oh!

jianghuazhu commented Oct 17, 2024 •

edited

Loading

Uh oh!

adoroszlai commented Dec 11, 2024

Uh oh!

jianghuazhu commented Dec 13, 2024 •

edited

Loading

Uh oh!

slfan1989 commented Dec 13, 2024

Uh oh!

slfan1989 commented Dec 13, 2024

Uh oh!

Uh oh!

HDDS-11566. Replication tasks add transferredBytes and queuedTime metrics #7306

Are you sure you want to change the base?

HDDS-11566. Replication tasks add transferredBytes and queuedTime metrics #7306

Uh oh!

Conversation

jianghuazhu commented Oct 13, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

jianghuazhu commented Oct 13, 2024

Uh oh!

slfan1989 commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weimingdiit commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

errose28 commented Oct 15, 2024

Uh oh!

jianghuazhu commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Dec 11, 2024

Uh oh!

jianghuazhu commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slfan1989 commented Dec 13, 2024

Uh oh!

slfan1989 commented Dec 13, 2024

Uh oh!

Uh oh!

slfan1989 commented Oct 13, 2024 •

edited

Loading

weimingdiit commented Oct 14, 2024 •

edited

Loading

jianghuazhu commented Oct 17, 2024 •

edited

Loading

jianghuazhu commented Dec 13, 2024 •

edited

Loading