-
Notifications
You must be signed in to change notification settings - Fork 537
HDDS-13014. Improve PrometheusMetricsSink#normalizeName performance #8438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently for datanode, if datanode still uses the schema V2 (i.e. one RocksDB per container), there might be high memory overhead due to the RocksDB metrics. I think to reduce the number of metrics stored in the cache, we might need to change the metrics to use tag instead
We use
This applies to other similar metrics |
Thanks @ivandika3 for the patch. Can we use some kind of size-limited cache?
I think that would also improve usability of metrics, regardless of the cache size problem. |
@adoroszlai Thanks for the review. I have updated the patch to use a fixed cache size with max size 100,000. Please let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the patch to use a fixed cache size with max size 100,000
Thanks @ivandika3 for updating the patch, LGTM.
// Original metric name -> Normalized Prometheus metric name | ||
private static final CacheLoader<String, String> NORMALIZED_NAME_CACHE_LOADER = | ||
CacheLoader.from(PrometheusMetricsSinkUtil::normalizeImpl); | ||
private static final com.google.common.cache.LoadingCache<String, String> NORMALIZED_NAME_CACHE = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (change only if you need to update the patch for any other reason): add import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, previously I was using Guava Cache
interface that conflicts with our internal Cache
class.
FYI, raised a similar patch to Hadoop as HADOOP-19571 (apache/hadoop#7692) |
Thanks @ivandika3 for the patch. |
Thanks @adoroszlai for the review. |
What changes were proposed in this pull request?
From the 5 minutes flamegraph of S3G, we see that nearly 40% of the CPU usage are spent on PrometheusMetrics#putMetrics.
Moreover, 30% of the CPU usage are attributed only for PrometheusMetricsSinkUtil#normalizeName. We see that most of the time are spent on regex matching (probably due to the lookbehind and the lookahead mechanisms).
We need to find a way to improve the performance
Two possible improvements
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13014
How was this patch tested?
Clean CI: https://github.com/ivandika3/ozone/actions/runs/15001669302
Simple microbenchmark. See: https://issues.apache.org/jira/secure/attachment/13076478/TestPrometheusMetricsSinkUtilPerformance.java