[System] Fix metrics overview dashboard #9771

milan-elastic · 2024-05-01T14:07:32Z

Bug

Proposed commit message

Fixed the Table chart visualization, where CPU usage is comming as '-'.
Changed palette param from 'Percent' to 'Number' for Table chart visualization.
Added data_stream.dataset filter at panel for Top hosts by CPU usage over time and Top hosts by memory usage over time.

Related Issues

closes [Metric System] Overview - wrongly configured table #9733

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.

Screenshots

Before

After

elasticmachine · 2024-05-01T14:23:30Z

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

milan-elastic · 2024-05-02T06:36:52Z

While comparing the before migration system metrics dashboards to this PR, I have observed that the fields are getting changed. And from this change, the metrics visualization and the table chart for CPU usage are showing different values!

Before Migration

After Migration

Were these changes expected?
cc: @drewdaemon

ishleenk17 · 2024-05-03T06:25:54Z

While comparing the before migration system metrics dashboards to this PR, I have observed that the fields are getting changed. And from this change, the metrics visualization and the table chart for CPU usage are showing different values!

Before Migration

After Migration

Were these changes expected? cc: @drewdaemon

Do we have closure on this ?
Do we know the reason of CPU values being shown as 0% ?

ishleenk17 · 2024-05-03T06:27:02Z

@milan-elastic : Was this issue reprodicible where in 10s we don't see the CPU values?

milan-elastic · 2024-05-03T06:45:45Z

Do we have closure on this ?

No, not yet! I am looking forward to get a confirmation from @drewdaemon on this.

Do we know the reason of CPU values being shown as 0% ?

The panel previously displayed "-" instead of the actual value, resulting in 0% being shown. This issue has been resolved as part of this pull request.

milan-elastic · 2024-05-03T06:48:49Z

@milan-elastic : Was this issue reprodicible where in 10s we don't see the CPU values?

Yes, it is reproducible and fixed now as part of this PR

drewdaemon · 2024-05-03T14:56:43Z

Hi @milan-elastic

I believe there are two things going on here.

Differences between TSVB and Lens

The number shown could be different even if the field hadn't changed since the old TSVB visualizations used an aggregation over a recent slice of the selected timeframe, while Lens's Last value function retrieves literally the last-reported value. The TSVB approach consistently confused... well... everybody.

See #1437 (comment) for more discussion about this.

Different field being used

I actually don't remember this change. But, looking at the system docs it seems like it is a correct one.

The old field system.cpu.user.norm.pct is

The percentage of CPU time spent in user space.

The new field system.process.cpu.total.norm.pct

The percentage of CPU time spent by the process since the last event. This value is normalized by the number of CPU cores and it ranges from 0 to 100%.

Feels like you'd want to see total CPU usage, instead of just the user space.

That said, I'm no expert on the data here. Maybe @cmacknz could double-check

cmacknz · 2024-05-03T15:18:35Z

The total system CPU usage definitely seems more correct.

@fearful-symmetry can you sanity check the correct CPU metrics are being used here?

drewdaemon · 2024-05-06T16:00:51Z

A side-note I wanted to state out-loud here. Completely removing the reduced time range setting will mean that the table may show out-of-date metrics and this may not be clear to the user. For example, maybe the last CPU metric for host A was reported at 80% yesterday. My dashboard time range is set to the last 7 days. I will see 80% as the "current" value even though it was reported yesterday.

Not saying it's the wrong choice to remove this, just saying there's a trade-off. Another possibility would be increasing the reduced time range to a larger window to account for slight variances in the ingest frequency.

That said, the user can always check the health of their agents and ingest through other means than these visualizations.

cmacknz · 2024-05-06T19:41:59Z

Not saying it's the wrong choice to remove this, just saying there's a trade-off. Another possibility would be increasing the reduced time range to a larger window to account for slight variances in the ingest frequency.

Increasing the reduced time range seems like it might be a better choice. Showing 1 week old data for metrics feels wrong.

The data collection interval defaults to 10s, so at most you'd have one sample in that period. Using a longer one makes more sense. What value to use feels a bit arbitrary, maybe 15m instead of 10s to guarantee we have something unless there is a real problem?

I think in an ideal world we'd want this to be some fixed multiple of the configured collection period, but I'm not sure that's possible.

drewdaemon · 2024-05-06T22:02:02Z

What value to use feels a bit arbitrary, maybe 15m instead of 10s to guarantee we have something unless there is a real problem?

FWIW, I think this makes sense.

I think in an ideal world we'd want this to be some fixed multiple of the configured collection period, but I'm not sure that's possible.

Not possible today. But, I think variables in the integration assets would be powerful.

milan-elastic · 2024-05-08T13:55:03Z

After our team discussion, we've decided to adjust the time range to 15 minutes. While this won't always ensure complete data population,But any day it's a better option than sticking with a reduced time range of 10 seconds.
cc: @tommyers-elastic @ishleenk17 @lalit-satapathy

drewdaemon · 2024-05-09T14:58:40Z

packages/system/kibana/dashboard/system-Metrics-system-overview.json

                                                    },
                                                    {
                                                        "color": "#cc5642",
-                                                        "stop": 100
+                                                        "stop": 1.85


Should this be 1.0?

@drewdaemon Yes, Ideally it should be 1 instead of 1.85, seems like it has calculated it's value automatically! From the UI there is no place I've found from where I can manipulate this value! To make it 1 I've changed the value manually in json file.

@milan-elastic I'm sorry... I misunderstood. If this is the value the Kibana automatically set, we should stick to that. Let's revert to 1.85.

I think Kibana is probably doing its best to make sure that we still get a red color if the percentage value rises above 1

…fix-system-metrics-overview

…stic/integrations into bugfix-system-metrics-overview

drewdaemon

Great work.

I recommend respecting the final color stop value from Kibana (ref).

Approving to unblock.

ishleenk17 · 2024-05-14T06:09:21Z

@milan-elastic : I hope the dashboards json is autogenerated from Kibana and not manipulated manually.
As in changes done on dashboards and then exported to json.
As I see in some comment a change was done manually.

Nit: There are 3 files showing in diff due to extra line addition. Please take care of that.

milan-elastic · 2024-05-14T06:19:08Z

@milan-elastic : I hope the dashboards json is autogenerated from Kibana and not manipulated manually. As in changes done on dashboards and then exported to json. As I see in some comment a change was done manually.

@ishleenk17 After @drewdaemon comment I've reverted the 1 to 1.85 that I have changed manually previously. So it's been taken care of.

Nit: There are 3 files showing in diff due to extra line addition. Please take care of that.

I think we should not revert them back, because those changes are result of elastic-package check command. so everytime anyone is going to run this command this diff will be appear!

ishleenk17

Looks good!

milan-elastic · 2024-05-14T08:59:00Z

I require code owner review to merge this PR, can someone from @elastic/sec-linux-platform and @elastic/sec-windows-platform review this PR and provide approval ?
cc: @lalit-satapathy @ishleenk17

elasticmachine · 2024-05-23T07:17:11Z

💚 Build Succeeded

Buildkite Build
Commit: e284471

History

💚 Build #11405 succeeded c9e8f56
💔 Build #11377 failed af90798
💚 Build #11374 succeeded 5b98008
💚 Build #11277 succeeded 005b20a
💔 Build #11275 failed 4833769
💚 Build #11044 succeeded 41510b0

cc @milan-elastic

elastic-sonarqube · 2024-05-23T07:17:13Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

elasticmachine · 2024-05-28T05:38:51Z

Package system - 1.58.1 containing this change is available at https://epr.elastic.co/search?package=system

fix metrics overview dashboard

b782ac5

update pr link in changelog

41510b0

milan-elastic marked this pull request as ready for review May 2, 2024 06:34

milan-elastic requested review from a team as code owners May 2, 2024 06:34

milan-elastic requested a review from drewdaemon May 2, 2024 06:34

milan-elastic requested a review from harnish-elastic May 2, 2024 07:09

harnish-elastic assigned milan-elastic May 3, 2024

harnish-elastic approved these changes May 3, 2024

View reviewed changes

cmacknz requested review from fearful-symmetry and leehinman May 3, 2024 15:16

milan-elastic added 2 commits May 8, 2024 20:15

reevaluate reducedtimerange

4833769

resolve conflicts

005b20a

drewdaemon reviewed May 9, 2024

View reviewed changes

harnish-elastic and others added 3 commits May 13, 2024 12:04

Merge branch 'elastic:main' into bugfix-system-metrics-overview

5b98008

updated the ndjson

e817bdb

Merge branch 'main' of github.com:milan-elastic/integrations into bug…

93df9b1

…fix-system-metrics-overview

Merge branch 'bugfix-system-metrics-overview' of github.com:milan-ela…

af90798

…stic/integrations into bugfix-system-metrics-overview

milan-elastic requested a review from drewdaemon May 13, 2024 09:36

drewdaemon approved these changes May 13, 2024

View reviewed changes

resolve review comments

c9e8f56

ishleenk17 approved these changes May 14, 2024

View reviewed changes

fearful-symmetry approved these changes May 21, 2024

View reviewed changes

resolve merge conflict

e284471

milan-elastic merged commit 0e26f73 into elastic:main May 28, 2024
5 checks passed

andrewkroh added the Integration:system System label Jul 22, 2024

[System] Fix metrics overview dashboard #9771

[System] Fix metrics overview dashboard #9771

Uh oh!

Conversation

milan-elastic commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Related Issues

Checklist

Screenshots

Uh oh!

elasticmachine commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Benchmarks report

Uh oh!

milan-elastic commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishleenk17 commented May 3, 2024

Uh oh!

ishleenk17 commented May 3, 2024

Uh oh!

milan-elastic commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milan-elastic commented May 3, 2024

Uh oh!

drewdaemon commented May 3, 2024

Differences between TSVB and Lens

Different field being used

Uh oh!

cmacknz commented May 3, 2024

Uh oh!

drewdaemon commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmacknz commented May 6, 2024

Uh oh!

drewdaemon commented May 6, 2024

Uh oh!

milan-elastic commented May 8, 2024

Uh oh!

drewdaemon May 9, 2024

Choose a reason for hiding this comment

Uh oh!

milan-elastic May 13, 2024

Choose a reason for hiding this comment

Uh oh!

drewdaemon May 13, 2024

Choose a reason for hiding this comment

Uh oh!

drewdaemon left a comment

Choose a reason for hiding this comment

Uh oh!

ishleenk17 commented May 14, 2024

Uh oh!

milan-elastic commented May 14, 2024

Uh oh!

ishleenk17 left a comment

Choose a reason for hiding this comment

Uh oh!

milan-elastic commented May 14, 2024

Uh oh!

elasticmachine commented May 23, 2024

💚 Build Succeeded

History

Uh oh!

elastic-sonarqube bot commented May 23, 2024

Quality Gate passed

Uh oh!

Uh oh!

elasticmachine commented May 28, 2024

Uh oh!

Uh oh!

milan-elastic commented May 1, 2024 •

edited

Loading

elasticmachine commented May 1, 2024 •

edited

Loading

milan-elastic commented May 2, 2024 •

edited

Loading

milan-elastic commented May 3, 2024 •

edited

Loading

drewdaemon commented May 6, 2024 •

edited

Loading