-
Notifications
You must be signed in to change notification settings - Fork 457
[Nvidia/GPU] Introduce Nvidia GPU Integration #12768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Nvidia/GPU] Introduce Nvidia GPU Integration #12768
Conversation
Todo: Add k8s container, pod, and namespace info from labels Should also include labels / mapping for kubernetes container, pod, and namespace
container => kubernetes.container.name
Perhaps with corresponding dashboard elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution.
Shared inital set of comments from the 1st review.
target_field: gpu.nvlink.bandwidth.total | ||
ignore_missing: true | ||
- rename: | ||
field: prometheus.DCGM_FI_DEV_GPU_UTIL.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While going through this issue, there is a mention about
- high resource utilisation
- deprecation of a few metrics including
DCGM_FI_DEV_GPU_UTIL
.
Is this scenario observed while integration testing? If there are metrics that are deprecated or result in high resource intensive, it would be best to not consider this metric for creating dashboard visualisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove this metric and consider replacing with the referenced ones
Added @daniela-elastic as the reviewer for the dashboard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and great to see starting to provide support for GPUs. Minor comments (important if we want to move this integration to GA) - it would be great to have some totals at the top of the dashboard - eg total energy consumption, total number of errors, etc. Also, it would be great to have the dashboard follow best practices for integration dashboards described here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the comments, I understand that there is certain improvements we must make on the dashboard.
The PR has been in the hold state for some time.
We can take up the dashboard improvements as part of the new release of the package.
PR Apporved.
@strawgate , you have all the approvals for this PR. Do you want to consider merging? Do let us know if you need any help. |
Enabling the expensive metrics is a client side change and so us showing them isn't a problem afaik. I'm good with merging and opening an issue with the current feedback will bump min kibana ver to try to fix build |
It appears that the build.yml, reference, is missing from this package causing the error. Also, a few fields mappings are also missing, leading to the following errors
|
b37d119
to
24ebd64
Compare
24ebd64
to
54f2b09
Compare
|
💚 Build Succeeded
History
|
Package nvidia_gpu - 0.1.0 containing this change is available at https://epr.elastic.co/package/nvidia_gpu/0.1.0/ |
Proposed commit message
Introduce NVIDIA GPU Monitoring Integration
Checklist
changelog.yml
file.Author's Checklist
How to test this PR locally
Deploy NVIDIA DGCM on a device with an NVIDIA GPU to get a prometheus metrics endpoint that you can provide to the integration.
If you have docker this just requires:
Configure the integration to point at the host running the container and GPU
http://nvidiahost:9400/metrics
Some metrics are not enabled by default with the container, enabling all metrics requires some extra steps.
Related issues
Fixes #11930
Screenshots
WIP: