Skip to content

[CELEBORN-2005] Introduce numBytesIn, numBytesOut, numBytesInPerSecond, numBytesOutPerSecond metrics for RemoteShuffleServiceFactory #3272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

SteNicholas
Copy link
Member

@SteNicholas SteNicholas commented May 22, 2025

What changes were proposed in this pull request?

Introduce numBytesIn, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics for RemoteShuffleServiceFactory.

Scope Infix Metrics Description Type
Task Shuffle.Remote.[ShuffleId] numBytesIn The total number of bytes this shuffle has read. Counter 
Task Shuffle.Remote.[ShuffleId] numBytesOut The total number of bytes this shuffle has written. Counter 
Task Shuffle.Remote.[ShuffleId] numRecordsOut The total number of records this shuffle has written. Counter 
Task Shuffle.Remote.[ShuffleId] numBytesInPerSecond The number of bytes this shuffle reads per second. Meter 
Task Shuffle.Remote.[ShuffleId] numBytesOutPerSecond The number of bytes this shuffle writes per second. Meter 
Task Shuffle.Remote.[ShuffleId] numRecordsOutPerSecond The number of records this shuffle writes per second. Meter 

Note:

  • numBytesIn and numBytesOut metrics include the total number of bytes for records and events.
  • numRecordsOut metric only includes the total number of records, instead of records and events.

Why are the changes needed?

There is no any metrics related to shuffle read operations and operations writing shuffle data for flink shuffle. It's proposed to introduce numBytesIn, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics for RemoteShuffleServiceFactory.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • RemoteShuffleOutputGateSuiteJ#testSimpleWriteData
  • RemoteShuffleResultPartitionSuiteJ

@SteNicholas SteNicholas marked this pull request as draft May 22, 2025 07:12
@SteNicholas SteNicholas marked this pull request as ready for review May 22, 2025 09:00
@SteNicholas SteNicholas requested a review from codenohup May 22, 2025 09:05
Copy link
Contributor

@codenohup codenohup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @SteNicholas

Thank you for your contribution! This PR looks good to me. Could you please clarify the related metric names and information in the documentation?

@SteNicholas
Copy link
Member Author

SteNicholas commented May 22, 2025

@codenohup, I will clarify the related metric names and information in the documentation in another pull request.

@SteNicholas SteNicholas requested review from FMX and RexXiong May 22, 2025 10:04
@SteNicholas SteNicholas force-pushed the CELEBORN-2005 branch 3 times, most recently from fbd0997 to f920934 Compare May 22, 2025 16:15
@SteNicholas SteNicholas requested a review from FMX May 22, 2025 16:46
@SteNicholas SteNicholas marked this pull request as draft May 23, 2025 02:35
@SteNicholas SteNicholas force-pushed the CELEBORN-2005 branch 2 times, most recently from 4c201e3 to 1541c89 Compare May 23, 2025 03:12
…BytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics for RemoteShuffleServiceFactory
@SteNicholas SteNicholas marked this pull request as ready for review May 23, 2025 03:17
@SteNicholas SteNicholas requested review from reswqa and codenohup May 23, 2025 03:17
@SteNicholas
Copy link
Member Author

@reswqa, @codenohup, thanks for review. I have already added numRecordsOut and numRecordsOutPerSecond metrics in this pull request. PTAL.

…BytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics for RemoteShuffleServiceFactory
…BytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics for RemoteShuffleServiceFactory
Copy link
Contributor

@codenohup codenohup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution; this PR looks good to me.

It would be advisable to test this PR by running a Flink job and obtaining shuffle metrics through the Flink metric reporter. Then compare the metrics from Flink and Celeborn to ensure they are effective and reasonable.

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. Merged into main(v0.7.0).

@FMX FMX closed this in 11ca1a7 May 23, 2025
reswqa pushed a commit that referenced this pull request May 30, 2025
…, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics

### What changes were proposed in this pull request?

Introduce `ShuffleMetricGroup` for `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics.

Follow up #3272.

### Why are the changes needed?

`numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics should put shuffle id into variables, which could introduce `ShuffleMetricGroup` to support. Meanwhile, #3272 would print many same logs as follows that shoud be improved:

```
2025-05-28 10:48:54,433 WARN  [flink-akka.actor.default-dispatcher-18] org.apache.flink.metrics.MetricGroup                         [] - Name collision: Group already contains a Metric with the name 'numRecordsOut'. Metric will not be reported.[11.66.62.202, taskmanager, antc4flink3980005426-taskmanager-3-70, antc4flink3980005426, [vertex-2]HashJoin(joinType=[LeftOuterJoin], where=[(f0 = f00)], select=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60], build=[right]) -> Sink: Sink(table=[default_catalog.default_database.sink], fields=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60]), 2, Shuffle, Remote, 1]
```

### Does this PR introduce _any_ user-facing change?

Introduce `celeborn.client.flink.metrics.scope.shuffle` config option to define the scope format string that is applied to all metrics scoped to a shuffle:

- Variables:
   - Shuffle: `<task_id>, <task_name>, <task_attempt_id>, <task_attempt_num>, <subtask_index>, <shuffle_id>`.
- Metrics:

Scope | Metrics | Description | Type
-- | -- | -- | --
Shuffle | numBytesIn | The total number of bytes this shuffle has read. | Counter |
Shuffle | numBytesOut| The total number of bytes this shuffle has written. | Counter |
Shuffle | numRecordsOut | The total number of records this shuffle has written. | Counter |
Shuffle | numBytesInPerSecond | The number of bytes this shuffle reads per second. | Meter |
Shuffle | numBytesOutPerSecond | The number of bytes this shuffle writes per second. | Meter |
Shuffle | numRecordsOutPerSecond | The number of records this shuffle writes per second. | Meter |

### How was this patch tested?

Manual test.

![image](https://github.com/user-attachments/assets/a10c18ab-84f9-44f5-bb2d-e6b08e5bc64e)
![image](https://github.com/user-attachments/assets/0cb29c17-3388-4608-b7a4-ee7e3c9b43c1)

Closes #3296 from SteNicholas/CELEBORN-2005.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: Weijie Guo <[email protected]>
SteNicholas added a commit that referenced this pull request May 30, 2025
…d, numBytesOutPerSecond metrics for RemoteShuffleServiceFactory

### What changes were proposed in this pull request?

Introduce `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics for `RemoteShuffleServiceFactory`.

Scope | Infix | Metrics | Description | Type
-- | -- | -- | -- | --
Task | Shuffle.Remote.[ShuffleId] | numBytesIn | The total number of bytes this shuffle has read. | Counter |
Task | Shuffle.Remote.[ShuffleId] | numBytesOut | The total number of bytes this shuffle has written. | Counter |
Task | Shuffle.Remote.[ShuffleId] | numRecordsOut | The total number of records this shuffle has written. | Counter |
Task | Shuffle.Remote.[ShuffleId] | numBytesInPerSecond | The number of bytes this shuffle reads per second. | Meter |
Task | Shuffle.Remote.[ShuffleId] | numBytesOutPerSecond | The number of bytes this shuffle writes per second. | Meter |
Task | Shuffle.Remote.[ShuffleId] | numRecordsOutPerSecond | The number of records this shuffle writes per second. | Meter |

Note:

- `numBytesIn` and `numBytesOut` metrics include the total number of bytes for records and events.
- `numRecordsOut` metric only includes the total number of records, instead of records and events.

### Why are the changes needed?

There is no any metrics related to shuffle read operations and operations writing shuffle data for flink shuffle. It's proposed to introduce `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics for `RemoteShuffleServiceFactory`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `RemoteShuffleOutputGateSuiteJ#testSimpleWriteData`
- `RemoteShuffleResultPartitionSuiteJ`

Closes #3272 from SteNicholas/CELEBORN-2005.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: mingji <[email protected]>
SteNicholas added a commit that referenced this pull request May 30, 2025
…, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics

### What changes were proposed in this pull request?

Introduce `ShuffleMetricGroup` for `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics.

Follow up #3272.

### Why are the changes needed?

`numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics should put shuffle id into variables, which could introduce `ShuffleMetricGroup` to support. Meanwhile, #3272 would print many same logs as follows that shoud be improved:

```
2025-05-28 10:48:54,433 WARN  [flink-akka.actor.default-dispatcher-18] org.apache.flink.metrics.MetricGroup                         [] - Name collision: Group already contains a Metric with the name 'numRecordsOut'. Metric will not be reported.[11.66.62.202, taskmanager, antc4flink3980005426-taskmanager-3-70, antc4flink3980005426, [vertex-2]HashJoin(joinType=[LeftOuterJoin], where=[(f0 = f00)], select=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60], build=[right]) -> Sink: Sink(table=[default_catalog.default_database.sink], fields=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60]), 2, Shuffle, Remote, 1]
```

### Does this PR introduce _any_ user-facing change?

Introduce `celeborn.client.flink.metrics.scope.shuffle` config option to define the scope format string that is applied to all metrics scoped to a shuffle:

- Variables:
   - Shuffle: `<task_id>, <task_name>, <task_attempt_id>, <task_attempt_num>, <subtask_index>, <shuffle_id>`.
- Metrics:

Scope | Metrics | Description | Type
-- | -- | -- | --
Shuffle | numBytesIn | The total number of bytes this shuffle has read. | Counter |
Shuffle | numBytesOut| The total number of bytes this shuffle has written. | Counter |
Shuffle | numRecordsOut | The total number of records this shuffle has written. | Counter |
Shuffle | numBytesInPerSecond | The number of bytes this shuffle reads per second. | Meter |
Shuffle | numBytesOutPerSecond | The number of bytes this shuffle writes per second. | Meter |
Shuffle | numRecordsOutPerSecond | The number of records this shuffle writes per second. | Meter |

### How was this patch tested?

Manual test.

![image](https://github.com/user-attachments/assets/a10c18ab-84f9-44f5-bb2d-e6b08e5bc64e)
![image](https://github.com/user-attachments/assets/0cb29c17-3388-4608-b7a4-ee7e3c9b43c1)

Closes #3296 from SteNicholas/CELEBORN-2005.

Authored-by: SteNicholas <[email protected]>
Signed-off-by: Weijie Guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants