HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance #8438

ivandika3 · 2025-05-13T06:08:43Z

What changes were proposed in this pull request?

From the 5 minutes flamegraph of S3G, we see that nearly 40% of the CPU usage are spent on PrometheusMetrics#putMetrics.
Moreover, 30% of the CPU usage are attributed only for PrometheusMetricsSinkUtil#normalizeName. We see that most of the time are spent on regex matching (probably due to the lookbehind and the lookahead mechanisms).

We need to find a way to improve the performance
Two possible improvements

Optimize the regex matching or replace it entirely: Perhaps we can take some logic from Prometheus JMX exporter (https://github.com/prometheus/jmx_exporter) or replace it with jmx_exporter implementation.
Add a name conversion cache between the Hadoop metrics name and the Prometheus metrics name

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13014

How was this patch tested?

Clean CI: https://github.com/ivandika3/ozone/actions/runs/15001669302

Simple microbenchmark. See: https://issues.apache.org/jira/secure/attachment/13076478/TestPrometheusMetricsSinkUtilPerformance.java

# With cache

Warming up...

Running performance test...

Performance Test Results:
Total test cases: 5
Total iterations: 100000
Total operations: 500000
Total time: 47.00 ms
Average time per operation: 0.000 ms

Testing cache hit performance...
Cache hit results:
Total time with cache: 39.00 ms
Average time per operation with cache: 0.000 ms


# Without cache

Warming up...

Running performance test...

Performance Test Results:
Total test cases: 5
Total iterations: 100000
Total operations: 500000
Total time: 1236.00 ms
Average time per operation: 0.002 ms

Testing cache hit performance...
Cache hit results:
Total time with cache: 1167.00 ms
Average time per operation with cache: 0.002 ms

ivandika3 · 2025-05-14T01:56:13Z

Currently for datanode, if datanode still uses the schema V2 (i.e. one RocksDB per container), there might be high memory overhead due to the RocksDB metrics.

I think to reduce the number of metrics stored in the cache, we might need to change the metrics to use tag instead
So instead of

rocksdb_ds_4545195c_120e_47c0_832a_25c4e618250e_container_db_write_raw_block_micros_percentile95{hostname="<redacted>"}

We use

rocksdb_container_db_write_raw_block_micros_percentile95{storageid="ds_4545195c_120e_47c0_832a_25c4e618250", hostname="<redacted>"}

This applies to other similar metrics

adoroszlai · 2025-05-14T07:32:53Z

Thanks @ivandika3 for the patch. Can we use some kind of size-limited cache?

we might need to change the metrics to use tag
instead of rocksdb_ds_4545195c_120e_47c0_832a_25c4e618250e_container_db_write_raw_block_micros_percentile95{hostname="<redacted>"}
We use rocksdb_container_db_write_raw_block_micros_percentile95{storageid="ds_4545195c_120e_47c0_832a_25c4e618250", hostname="<redacted>"}

This applies to other similar metrics

I think that would also improve usability of metrics, regardless of the cache size problem.

ivandika3 · 2025-05-15T03:14:46Z

@adoroszlai Thanks for the review. I have updated the patch to use a fixed cache size with max size 100,000. Please let me know what you think.

adoroszlai

I have updated the patch to use a fixed cache size with max size 100,000

Thanks @ivandika3 for updating the patch, LGTM.

adoroszlai · 2025-05-16T10:10:02Z

hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/PrometheusMetricsSinkUtil.java

+  // Original metric name -> Normalized Prometheus metric name
+  private static final CacheLoader<String, String> NORMALIZED_NAME_CACHE_LOADER =
+      CacheLoader.from(PrometheusMetricsSinkUtil::normalizeImpl);
+  private static final com.google.common.cache.LoadingCache<String, String> NORMALIZED_NAME_CACHE =


nit (change only if you need to update the patch for any other reason): add import.

Apologies, previously I was using Guava Cache interface that conflicts with our internal Cache class.

ivandika3 · 2025-05-20T09:36:29Z

FYI, raised a similar patch to Hadoop as HADOOP-19571 (apache/hadoop#7692)

adoroszlai · 2025-05-22T14:16:36Z

Thanks @ivandika3 for the patch.

ivandika3 · 2025-05-22T14:20:46Z

Thanks @adoroszlai for the review.

ivandika3 force-pushed the HDDS-13014 branch from f15bc7e to 401f577 Compare May 13, 2025 06:14

HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance

5a1b162

ivandika3 force-pushed the HDDS-13014 branch from 401f577 to 5a1b162 Compare May 13, 2025 06:16

ivandika3 self-assigned this May 13, 2025

ivandika3 added performance metrics labels May 13, 2025

Merge remote-tracking branch 'origin/master' into HDDS-13014

ec64357

ivandika3 marked this pull request as ready for review May 14, 2025 01:16

Use fixed cache size

47d14ad

adoroszlai reviewed May 16, 2025

View reviewed changes

adoroszlai requested a review from kerneltime May 16, 2025 10:10

adoroszlai merged commit 87dfa5a into apache:master May 22, 2025
42 checks passed

ivandika3 deleted the HDDS-13014 branch May 22, 2025 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance #8438

HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance #8438

Uh oh!

ivandika3 commented May 13, 2025 •

edited

Loading

Uh oh!

ivandika3 commented May 14, 2025 •

edited

Loading

Uh oh!

adoroszlai commented May 14, 2025

Uh oh!

ivandika3 commented May 15, 2025

Uh oh!

adoroszlai left a comment

Uh oh!

adoroszlai May 16, 2025

Uh oh!

ivandika3 May 16, 2025

Uh oh!

ivandika3 commented May 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

adoroszlai commented May 22, 2025

Uh oh!

ivandika3 commented May 22, 2025

Uh oh!

Uh oh!

HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance #8438

HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance #8438

Uh oh!

Conversation

ivandika3 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

ivandika3 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented May 14, 2025

Uh oh!

ivandika3 commented May 15, 2025

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai May 16, 2025

Choose a reason for hiding this comment

Uh oh!

ivandika3 May 16, 2025

Choose a reason for hiding this comment

Uh oh!

ivandika3 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

adoroszlai commented May 22, 2025

Uh oh!

ivandika3 commented May 22, 2025

Uh oh!

Uh oh!

ivandika3 commented May 13, 2025 •

edited

Loading

ivandika3 commented May 14, 2025 •

edited

Loading

ivandika3 commented May 20, 2025 •

edited

Loading