-
Notifications
You must be signed in to change notification settings - Fork 457
[System] Fix metrics overview dashboard #9771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[System] Fix metrics overview dashboard #9771
Conversation
🚀 Benchmarks reportTo see the full report comment with |
While comparing the before migration system metrics dashboards to this PR, I have observed that the fields are getting changed. And from this change, the metrics visualization and the table chart for CPU usage are showing different values! Were these changes expected? |
Do we have closure on this ? |
@milan-elastic : Was this issue reprodicible where in 10s we don't see the CPU values? |
No, not yet! I am looking forward to get a confirmation from @drewdaemon on this.
The panel previously displayed "-" instead of the actual value, resulting in 0% being shown. This issue has been resolved as part of this pull request. |
Yes, it is reproducible and fixed now as part of this PR |
I believe there are two things going on here. Differences between TSVB and LensThe number shown could be different even if the field hadn't changed since the old TSVB visualizations used an aggregation over a recent slice of the selected timeframe, while Lens's See #1437 (comment) for more discussion about this. Different field being usedI actually don't remember this change. But, looking at the system docs it seems like it is a correct one. The old field
The new field
Feels like you'd want to see total CPU usage, instead of just the user space. That said, I'm no expert on the data here. Maybe @cmacknz could double-check |
The total system CPU usage definitely seems more correct. @fearful-symmetry can you sanity check the correct CPU metrics are being used here? |
A side-note I wanted to state out-loud here. Completely removing the reduced time range setting will mean that the table may show out-of-date metrics and this may not be clear to the user. For example, maybe the last CPU metric for host Not saying it's the wrong choice to remove this, just saying there's a trade-off. Another possibility would be increasing the reduced time range to a larger window to account for slight variances in the ingest frequency. That said, the user can always check the health of their agents and ingest through other means than these visualizations. |
Increasing the reduced time range seems like it might be a better choice. Showing 1 week old data for metrics feels wrong. The data collection interval defaults to 10s, so at most you'd have one sample in that period. Using a longer one makes more sense. What value to use feels a bit arbitrary, maybe 15m instead of 10s to guarantee we have something unless there is a real problem? I think in an ideal world we'd want this to be some fixed multiple of the configured collection period, but I'm not sure that's possible. |
FWIW, I think this makes sense.
Not possible today. But, I think variables in the integration assets would be powerful. |
After our team discussion, we've decided to adjust the time range to 15 minutes. While this won't always ensure complete data population,But any day it's a better option than sticking with a reduced time range of 10 seconds. |
}, | ||
{ | ||
"color": "#cc5642", | ||
"stop": 100 | ||
"stop": 1.85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 1.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drewdaemon Yes, Ideally it should be 1
instead of 1.85
, seems like it has calculated it's value automatically! From the UI there is no place I've found from where I can manipulate this value! To make it 1
I've changed the value manually in json file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milan-elastic I'm sorry... I misunderstood. If this is the value the Kibana automatically set, we should stick to that. Let's revert to 1.85
.
I think Kibana is probably doing its best to make sure that we still get a red color if the percentage value rises above 1
…stic/integrations into bugfix-system-metrics-overview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work.
I recommend respecting the final color stop value from Kibana (ref).
Approving to unblock.
@milan-elastic : I hope the dashboards json is autogenerated from Kibana and not manipulated manually. Nit: There are 3 files showing in diff due to extra line addition. Please take care of that. |
@ishleenk17 After @drewdaemon comment I've reverted the
I think we should not revert them back, because those changes are result of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I require code owner review to merge this PR, can someone from @elastic/sec-linux-platform and @elastic/sec-windows-platform review this PR and provide approval ? |
💚 Build Succeeded
History
|
|
Package system - 1.58.1 containing this change is available at https://epr.elastic.co/search?package=system |
Proposed commit message
Top hosts by CPU usage over time
andTop hosts by memory usage over time
.Related Issues
Checklist
changelog.yml
file.Screenshots
Before
After