chronosphere.io
chronosphere.io
Choose Your Own Adventure
Eric D. Schabell
Director Evangelism
@ericschabell{@fosstodon.org}
KCD Porto, 27-28 Sep 2024
Cloud Native Observability Pitfalls
chronosphere.io
Cloud Native Observability
chronosphere.io
Cloud Native
chronosphere.io
Data volume
Experiment:
- Hello World app on 4 node
Kubernetes cluster with Tracing,
End User Metrics (EUM), Logs,
Metrics (containers / nodes)
- 30 days == +450 GB
chronosphere.io
Retention
Retention
Retention
Retention
Retention
Retention
Retention
chronosphere.io
Cloud Native at Scale
chronosphere.io
Observability…
chronosphere.io
Cloud Native Observability at
Scale
chronosphere.io
O11y at Scale (need)
chronosphere.io
Picking Your Pitfalls
1. Ignoring existing landscape
2. Focusing on The Pillars
3. Sneaky sprawling mess
4. Controlling costs
5. The protocol jungle
6. Underestimating cardinality
(Click on a pitfall to jump to that section)
chronosphere.io
1. Ignoring existing landscape
chronosphere.io
If they can’t
see me…
they can’t
hurt me...
chronosphere.io
chronosphere.io
Prometheus for metrics, alerting, queries
chronosphere.io
Prometheus auto discovery
chronosphere.io
Manual instrumentation (java client lib)
chronosphere.io
Short link: bit.ly/prom-
workshop
chronosphere.io
Applications (Java)
OTel Auto Instrumentation (libraries)
OTel API
OTel SDK
OTel Collector
OTLP
OTLP
OTLP
OpenTelemetry (Auto) instrumentation
chronosphere.io
Host
Observability Backend
(Prometheus, Jaeger, Fluent Bit, etc.),
Applications
OTel Auto Instrumentation
OTel API
OTel SDK
OTel Collector Agent
OTLP
OTLP
OTLP
OTLP
OTLP
OpenTelemetry Collector (agent)
chronosphere.io
Host
Host
Host
Observability Backend
(Prometheus, Jaeger, Fluent Bit, etc.),
Applications
OTel Auto Instrumentation
OTel API
OTel SDK
OTel Collector Agent
OTLP
OTLP
OTLP
OTLP
Collector (gateway)
OTel Collector Gateway
chronosphere.io
Short link: bit.ly/opentelemetry-
workshop
chronosphere.io
1. Ignoring existing landscape
2. Focusing on The Pillars
3. Sneaky sprawling mess
4. Controlling costs
5. The protocol jungle
6. Underestimating cardinality
Picking Your Next Pitfall
chronosphere.io
2. Focusing on The Pillars
chronosphere.io
Pillars
Phases
chronosphere.io
Developer
Technology
Bottom up
chronosphere.io
Pillar problems…
chronosphere.io
Car is on fire…
chronosphere.io
Better outcomes…
Faster remediation…
Easier detection…
Happier customers…
chronosphere.io
Phase 1
Know something is
happening as fast as
possible…
chronosphere.io
Phase 2
Triage with specific
information…
chronosphere.io
Phase 3
Understand to
ensure never happens
again…
chronosphere.io
chronosphere.io
chronosphere.io
1. Ignoring existing landscape
2. Focusing on The Pillars
3. Sneaky sprawling mess
4. Controlling costs
5. The protocol jungle
6. Underestimating cardinality
Picking Your Next Pitfall
chronosphere.io
3. Sneaky sprawling mess
chronosphere.io
Over 66% of organizations use
more than 10 different
observability tools
– ESG report over exploding data volumes
chronosphere.io
chronosphere.io
Know
Triage
Understand
chronosphere.io
chronosphere.io
1. Ignoring existing landscape
2. Focusing on The Pillars
3. Sneaky sprawling mess
4. Controlling costs
5. The protocol jungle
6. Underestimating cardinality
Picking Your Next Pitfall
chronosphere.io
4. Controlling costs
chronosphere.io
“It’s remarkable how common this
situation is, where an organization
is paying more for their
observability data, than they do for
their production infrastructure.”
?
chronosphere.io
O11y data storage costs
are broken.
Keeping everything model?
chronosphere.io
Know the cost of
observability metrics
data?
Control costs and improve productivity
Observability Platform
DATA COLLECTION CONTROL PLANE STORE LENS
Telemetry Pipeline
Reduce
Enrich
Secure
TRANSFORM AND ROUTE DATA
IN YOUR ENVIRONMENT
STORE DATA IN THIRD PARTY
LOG & SIEM SOLUTIONS
chronosphere.io
chronosphere.io
Chronosphere named a Leader in
the 2024 Gartner® Magic
Quadrant™ for Observability
Platforms
Gartner, Magic Quadrant for Observability Platforms: By Gregg Siegfried, Padraig Byrne, Mrudula Bangera, Matt Crossley (12 August 2024)
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and MAGIC QUADRANT is a registered
trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved.
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors
with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be
construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability
or fitness for a particular purpose
https://chronosphere.io/2024-gartner-magic-quadrant
chronosphere.io
1. Ignoring existing landscape
2. Focusing on The Pillars
3. Sneaky sprawling mess
4. Controlling costs
5. The protocol jungle
6. Underestimating cardinality
Picking Your Next Pitfall
chronosphere.io
5. The protocol jungle
chronosphere.io
Without open standards,
you’ll not find a way back…
chronosphere.io
chronosphere.io
Host
Observability Backend
(Prometheus, Jaeger, Fluent Bit, etc.),
Applications
OTel Auto Instrumentation
OTel API
OTel SDK
OTel Collector Agent
OTLP
OTLP
OTLP
OTLP
OTLP
OpenTelemetry Collector (agent)
chronosphere.io
Prometheus for metrics, alerting, queries
chronosphere.io
1. Ignoring existing landscape
2. Focusing on The Pillars
3. Sneaky sprawling mess
4. Controlling costs
5. The protocol jungle
6. Underestimating cardinality
Picking Your Next Pitfall
chronosphere.io
6. Underestimating cardinality
chronosphere.io
The struggle is real
“I don't yet collect spans/traces because I can hardly get our devs to care about basic metrics, let alone traces.”
“This is a large enterprise with approx. 1000 developers. Cultivating a culture of engineering that cares about
availability is a challenge that we need to solve alongside any technical implementations.”
chronosphere.io
10 hours
on average, per week,
trying to triage and
understand incidents -
a quarter of a
40 hour work week
chronosphere.io
33%
said those issues
disrupted their
personal life
39%
admitting they are
frequently
stressed out
chronosphere.io
Cloud Native
Observability at
Scale
Control costs and improve productivity
Observability Platform
DATA COLLECTION CONTROL PLANE STORE LENS
Telemetry Pipeline
Reduce
Enrich
Secure
TRANSFORM AND ROUTE DATA
IN YOUR ENVIRONMENT
STORE DATA IN THIRD PARTY
LOG & SIEM SOLUTIONS
chronosphere.io
1. Ignoring existing landscape
2. Focusing on The Pillars
3. Sneaky sprawling mess
4. Controlling costs
5. The protocol jungle
6. Underestimating cardinality
Picking Your Next Pitfall
chronosphere.io
What should be
the #1 item on
cloud wishlist?
What should be #1 item on your
cloud native observability wishlist?
chronosphere.io
chronosphere.io
Questions?
Eric D. Schabell
Director Evangelism
@ericschabell{@fosstodon.org}

More Related Content

PDF
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
PPTX
Choose Your Own Adventure - Cloud Native Observability Pitfalls
PPTX
Checking the pulse of your cloud native architecture
PPTX
3 Pitfalls Everyone Should Avoid with Cloud Data
PPTX
3 Pitfalls Everyone Should Avoid with Cloud Data
PPTX
Optimizing Observability Spend: Metrics
PPTX
3 Pitfalls Everyone Should Avoid with Cloud Native Data
PDF
Shift left Observability
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
Choose Your Own Adventure - Cloud Native Observability Pitfalls
Checking the pulse of your cloud native architecture
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
Optimizing Observability Spend: Metrics
3 Pitfalls Everyone Should Avoid with Cloud Native Data
Shift left Observability

Similar to KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls (20)

PPTX
How to Wrestle Your Observability Data Demons and Win!
PPTX
Choose Your Own Observability Adventure
PDF
From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation
PDF
DZone webinar - Shift left Observability
PPTX
SRECon EU 2023 - Three Phases to Better Observability Outcomes
PPTX
Infobip Shift EU 2024: Platform Engineers Arise - Adding Observability to You...
PDF
Observability For You and Me with openTelemetry
PPTX
Open Source 101 - Observability For You and Me with OpenTelemetry
PDF
Trajectory 2022 - Shifting Cloud Native Observability to the Left
PPTX
Observability For You and Me with OpenTelemetry
PDF
Observability For You and Me with OpenTelemetry
PDF
Observability For You and Me with OpenTelemetry (with demo)
PPTX
Optimizing Observability Spend: Metrics
PPTX
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
PDF
Observability For You and Me with OpenTelemetry
PDF
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
PPTX
Finding observability and DevEx tranquility sailing the monitoring data seas
PPTX
PromCon EU 2024: Meet the New Kid in the Sandbox - Integrating Visualization ...
PDF
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
PPTX
Cloud Native Bedtime Stories - Terrifying Execs into Action
How to Wrestle Your Observability Data Demons and Win!
Choose Your Own Observability Adventure
From Cardinal(ity) Sins to Cost-Efficient Metrics Aggregation
DZone webinar - Shift left Observability
SRECon EU 2023 - Three Phases to Better Observability Outcomes
Infobip Shift EU 2024: Platform Engineers Arise - Adding Observability to You...
Observability For You and Me with openTelemetry
Open Source 101 - Observability For You and Me with OpenTelemetry
Trajectory 2022 - Shifting Cloud Native Observability to the Left
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry (with demo)
Optimizing Observability Spend: Metrics
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Observability For You and Me with OpenTelemetry
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Finding observability and DevEx tranquility sailing the monitoring data seas
PromCon EU 2024: Meet the New Kid in the Sandbox - Integrating Visualization ...
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Cloud Native Bedtime Stories - Terrifying Execs into Action
Ad

More from Eric D. Schabell (13)

PPTX
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
PPTX
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
PPTX
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
PPTX
Observability-as-a-Service: When Platform Engineers meet SREs
PPTX
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
PPTX
When Platform Engineers meet SREs - The Birth of O11y-as-a-Service Superpowers
PPTX
Taking Back Control of Your Telemetry Data with Fluent Bit
PPTX
MTTS - Sleep more, slog less with automated cloud native o11y platforms
PPTX
Taking Back Control of Your Telemetry Data with Fluent Bit
PPTX
Power Up with Podman - Cloud Native + K8s Meetup
PDF
Roadmap to Becoming a CNCF Ambassador
PPTX
Engaging Your Execs - Telling Great Observability Tales Inspiring Action
PPTX
WTF is SRE - Telling Effective Tales about Production
Meet the New Kid in the Sandbox - Integrating Visualization with Prometheus
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Observability-as-a-Service: When Platform Engineers meet SREs
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
When Platform Engineers meet SREs - The Birth of O11y-as-a-Service Superpowers
Taking Back Control of Your Telemetry Data with Fluent Bit
MTTS - Sleep more, slog less with automated cloud native o11y platforms
Taking Back Control of Your Telemetry Data with Fluent Bit
Power Up with Podman - Cloud Native + K8s Meetup
Roadmap to Becoming a CNCF Ambassador
Engaging Your Execs - Telling Great Observability Tales Inspiring Action
WTF is SRE - Telling Effective Tales about Production
Ad

Recently uploaded (20)

PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PDF
4 layer Arch & Reference Arch of IoT.pdf
PPTX
Microsoft User Copilot Training Slide Deck
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
SaaS reusability assessment using machine learning techniques
PDF
Comparative analysis of machine learning models for fake news detection in so...
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
Electrocardiogram sequences data analytics and classification using unsupervi...
Training Program for knowledge in solar cell and solar industry
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
MuleSoft-Compete-Deck for midddleware integrations
SGT Report The Beast Plan and Cyberphysical Systems of Control
4 layer Arch & Reference Arch of IoT.pdf
Microsoft User Copilot Training Slide Deck
EIS-Webinar-Regulated-Industries-2025-08.pdf
SaaS reusability assessment using machine learning techniques
Comparative analysis of machine learning models for fake news detection in so...
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
NewMind AI Weekly Chronicles – August ’25 Week IV
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
future_of_ai_comprehensive_20250822032121.pptx
Improvisation in detection of pomegranate leaf disease using transfer learni...
LMS bot: enhanced learning management systems for improved student learning e...
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf

KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls

Editor's Notes

  • #1: Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability? When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives. That hasty decision up front leads to big headaches very quickly down the road. In this talk, I'll introduce the problem facing everyone with cloud native observability followed by 3 common mistakes that I'm seeing organizations make and how you can avoid them! Key takeaways - This session is never the same twice as you the audience / attendees choose from a list of cloud native observability pitfalls that DevOps have to contend with in their daily cloud native lives! Super engaging and fun to tour the challenges that interest you most!
  • #2: Cloud native is one thing, but doing it at scale requires great observability. This is the promise of cloud native observability and it feels like we’ve been let down…. let’s examine what we mean by cloud native and observability at scale along with some of the unexpected things you will have to deal with.
  • #3: Cloud native means your organizations is using the following to deliver and run its applications and infrastructure: kubernetes containers cloud providers automated deployment delivery pipelines DevOps methodologies and teams
  • #4: Observability metric data explosion will cause plenty of issues, not to mention costs… dare to flip the switch on new data collection? An experiment: Hello World application was deployed to a four node Kubernetes cluster on GKE. Load was generated using the script that comes with the app. Wrote some additional scripting to scrape the Prometheus end points and record the size of the data payloads. Another script accepted Jaeger tracing spans and EUM beacons, recording the size of the data payloads. Fluentd collected all the logs and concatenated them all into one flat file. Using the timestamps from the log file, one hour was extracted into a new file, which was then measured. Observability Data Volume: Tracing At a rate of 1 trace per second, over 24 hours per day and 30 days in a month, the total number of traces is 2.5 million. The average trace size was 66kB. Therefore, the total data size for traces was 161GB. Looks like my estimate of fitting inside 100GB has already been proved wrong. While Tracing can be sampled at source, that would mean having to throw away nearly half of the data to fit inside the original estimate of 100GB. Observability Data Volume: EUM Each back-end call is triggered by a user interaction at the browser, which produces an EUM beacon – conveniently making the number of beacons generated the same as the number of traces – 2.5 million. The average size of an End User Metrics (EUM) beacon is a lot smaller at 397 bytes, making our total data size for a month of EUM beacons 1 GB. Observability Data Volume: Logs For logs, especially when it comes to data volumes, your mileage may vary – depending on your app, configuration settings, etc. The application logs generate quite a bit at INFO level, though not nearly as much as some other real-world applications. From the experiment, the log file size for one hour was 5 MB, making the total log volume for one month 3.4 GB. Observability Data Volume: Metrics Collected metrics – using Prometheus – from from every container, each worker node and from kube state metrics for the cluster giving a total of 1.1 MB per sample period. With a sample every ten seconds, that’s 259,200 samples per month, which results in a total data volume of 285 GB. Total Observability Data Volumes The grand total across all datasets is 452 GB per month for a simple Hello World application running on a small Kubernetes cluster. A note on data granularity: As you may or may not know, Instana collects all metrics at 1-second granularity. Doing this with Prometheus would so devastatingly skew the experiment results, since Prometheus has none of the optimizations built into the Instana sensors and agents. Thus, the experiment was conducted at 10 second sample rate for Prometheus metrics. The load generation script produces one request per second to the application back-end services. (Source: The Hidden Cost of Data Observability)
  • #5: Most companies default to 13 months retention for all data. But in the modern cloud native architecture, where we are deploying multiple times a day, and a container is only around for a couple of hours, a huge amount of that modern observability data does not need to be retained for 13 months. One tactic for reducing your data footprint is setting the optimal retention period for each data type. For example, you might only need to keep observability data from your lab environment for two weeks if the environment is torn down and rebuilt on a bi-weekly basis anyways. Source -- The Growth of Observability Data is out of Control
  • #6: It’s one thing to use Cloud Native technologies, but it’s a whole other story to do it at scale: DevOps methodologies and teams auto scaling cloud data on-call engineers Shift-left as stress climbs Tooling sprawl
  • #7: Observability data in the old world of monolithic VM based architectures. What’s that look like when life is simple: open source standards single pane of glass is the first attempt metrics and logs on-call engineers
  • #8: Data flood is reality of o11y at scale…. drowning. What’s that looks like when life is not simple: metrics, logs, tracing mess MTTD & MTTR (remediation) failures on-call engineers tooling sprawl cloud data spend
  • #9: This is what it can and should look like, even though the world is complex, we can create functionality and order: know, triage, understand worst shortest time to repair (MTTR to focus on WTTR) observability SREs Centralized o11y teams platform engineering teams on-call engineers
  • #10: NOTE: It’s choose your own adventure time! All items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around).
  • #11: Don't ignore cost of instrumenting your application landscape: open standards tracing with otel metrics with prom client libraries or custom instrumentation (java example)
  • #12: You have a lot of applications out there in your architecture, but during the decision making process around cloud native observability, they often are forgotten. The cost that they bring is in the hidden facts around instrumentation. You have auto-instrumentation that is quick and easy, but often does not bring the needed insights. On top of that, auto-instrumentation shows you with extra data points from metrics and tracing activities that you most likely are not all that interested in. Manual instrumentation is the work that is needed to provide the actual insights and data you want to collect from your application landscape, but that brings an often unquantified amount of work (aka costs) with it to change, test, and deploy new versions of existing applications.
  • #13: Where can you find the cloud native standards? The CNCF is the community for all things cloud native and observability where all widely adopted open standards for all manner of observability protocols; Prometheus, OpenTelemetry, Jaeger, FluentD, and FluentBit.
  • #14: Widely adopted and accepted standards for metrics can be found in the Prometheus project, including time-series storage, communication protocols to scrape (pull) data from targets, and PromQL the query language for visualizing the data.
  • #15: This also includes standards in communication to detect services across various cloud native technologies.
  • #16: While some of the data can be automatically gathered, that’s just generic information often based on the language you are using for your applications and services. Manual instrumentation is the cost you can’t forget, where you need to make code changes and redeploy.
  • #17: Explore this instrumenting of your applications in this workshop, where a Java example for developers has been created for them to experience what it’s like to instrument an application in lab 8, see entire workshop at https://bit.ly/prom-workshop.
  • #18: Instrumentation Libraries OpenTelemetry supports a broad number of components that generate relevant telemetry data from popular libraries and frameworks for supported languages. For example, inbound and outbound HTTP requests from an HTTP library will generate data about those requests. It is a long-term goal that popular libraries are authored to be observable out of the box, such that pulling in a separate component is not required. For more information, see Instrumenting Libraries. Automatic Instrumentation If applicable a language specific implementation of OpenTelemetry will provide a way to instrument your application without touching your source code. While the underlying mechanism depends on the language, at a minimum this will add the OpenTelemetry API and SDK capabilities to your application. Additionally they may add a set of Instrumentation Libraries and exporter dependencies. For more information, see Instrumenting.
  • #19: The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported. Collector contrib packages bring support for more data formats and vendor backends. Agent: A Collector instance running with the application or on the same host as the application (e.g. binary, sidecar, or daemonset). For more information, see Collector.
  • #20: The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported. Collector contrib packages bring support for more data formats and vendor backends. Gateway: One or more Collector instances running as a standalone service (e.g. container or deployment) typically per cluster, data center or region. For more information, see Collector.
  • #21: This OpenTelemetry workshop also has a lab for exploring a tracing example for developers to experience what it’s like to instrument an application in lab 5, see entire workshop at https://bit.ly/opentelemetry-workshop.
  • #22: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
  • #23: Don't focus on pillars, but on phases: logs, metrics, traces, events know, triage, and understand more important integration use and track what you need, not what you produce
  • #24: When exploring the world of o11y, there are two very distinct lines of discussion: Pillars of observability Phases of observability
  • #25: Pillars of Observability: It’s the same as is often found in a developer world, where it's all about technology. A very developer centric and bottom up approach to any technical problem.
  • #26: The problem with the three pillars is that you are talking about technology aspects and not about solutions. It's like talking about the tools in a mechanics toolbox used to make your car run again…
  • #27: …instead of focusing on the blue smoke coming out of the exhaust, the rising engine temperature, and using that data to quickly remediate the problem by replacing the seals to prevent oil leaking in the engine.
  • #28: We all want to have better business outcomes for our organizations solutions, such as faster remediation of problems, easier problem detection, greater revenue generation, happier customers, and engineering teams that can remain focused on delivering more business value.
  • #29: The phases you go through start with knowing the problem is happening as fast as possible and might even lead to fixing it immediately.
  • #30: If not immediately fixable, then you start triaging based on specific information related to the problem which quickly leads to fixing it.
  • #31: Finally, you want to have a very deep understanding of the issues you just encountered to ensure it never happens again.
  • #32: Do you still want to be faced with these kinds of dashboards, where you are forced into the pillars of observability?
  • #33: Or are you ready for the phases of observability where none of these phases require you to focus on data types or specific technology details. They do need you to have the o11y platform in place that can provide sharply focused insights and put enough information at your fingertips for you to make informed decisions quickly.
  • #34: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
  • #35: Tooling sprawl: avoid individual tooling for each type of monitoring data Using multiple tools for each pillar of observability—metrics, events, logs, and traces—can often become a burden when selecting observability tools. According to a report by ESG, two-thirds of organizations today use more than 10 different observability tools for their needs. https://chronosphere.io/learn/esg-report-managing-the-exploding-volumes-of-observability-data/?utm_source=schabell-blog&utm_medium=eric
  • #36: Need for new tool for each type of observability data? A report by ESG says two-thirds of organizations today use more than 10 different observability tools for their needs. https://chronosphere.io/learn/esg-report-managing-the-exploding-volumes-of-observability-data/?utm_source=schabell-blog&utm_medium=eric
  • #37: Do you still want to be faced with these kinds of dashboards, where you are forced into the pillars of observability?
  • #38: We all want to have better business outcomes for our organizations solutions, such as faster remediation of problems, easier problem detection, greater revenue generation, happier customers, and engineering teams that can remain focused on delivering more business value. The problem with the popular three pillars (metrics, logs, tracing) is that you are talking about technology aspects and not about solutions. It's like talking about the tools in a mechanics toolbox used to make your convertible run again, instead of focusing on the blue smoke coming out of the exhaust, the rising engine temperature, and using that data to quickly remediate the problem by replacing the seals to prevent oil leaking in the engine. Let’s quickly tour the phases that lead to better outcomes and get our focus back on effective observability goals. Key takeaways - Modern cloud native observability needs three guiding phases to provide better outcomes, not tooling. Based on article https://www.schabell.org/2022/09/o11y-guide-cloud-native-observability-needs-phases.html
  • #39: Or are you ready for the phases of observability where none of these phases require you to focus on data types or specific technology details. They do need you to have the o11y platform in place that can provide sharply focused insights and put enough information at your fingertips for you to make informed decisions quickly.
  • #40: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
  • #41: Don't ignore the tooling cost models: storage costs ability to filter out data not used scaling costs
  • #42: And for what purpose? If these organizations could draw a straight line from more data to better outcomes — higher levels of availability, happier customers, faster remediation, more revenue — this tradeoff might make sense. But in many cases, this isn’t true. “Paying more for logging/metrics/tracing doesn’t equate to a positive user experience. Consider how much data can be generated and shipped. $$$. You still need good people to turn data into action.” It’s remarkable how common this situation is, where an organization is paying more for their observability data (typically metrics, logs, traces, and sometimes events), than they do for their production infrastructure. -- The Growth of Observability Data is out of Control
  • #43: Does your observability tooling need to store everything? Cloud native data is a flood these days, so you need observability that provides filtering, aggregation, and just plain ability to disregard data of no importance to your organization. You can pre-process at collection, use data pipeline features, or ensure your observability tooling has some sort of data control plane with insights into data real-time. If you can decide what is important before you store it, you are going to be a much happier observability team!
  • #44: Asking the audience to give their accounts, customers and personal experiences as feedback. A banking customer OpSec wanted to leverage the cloud provider’s observability in the load balancers by using a label of the load balancer per application. However, the load balancers label was limited to a specific amount, hence even though the utilization was < 10% the Dev Team had to get more loadbalancers (one of the most expensive component in the cloud) to meet said expectation. After many escalations that was solved. However, now they ran into another issue, the loadbalancers had a limited amount of contextpaths they could support. So again they had to multiply the number of loadbalancers, without hitting the traffic limit. The alternative, a simple NGinx behind the Cloud Loadbalancer, was not permitted because of the LCM, nobody wanted to LCM said instance. Who observes the cost of sub par architectural decisions? Auditing, Monitoring, Tracing: Beautiful capabilities, highly necessary for proper observation of the health of the app and it’s capability to serve our customers in a timely and secure manner. But if each customer engagement for a purchase means N number of logs, then those will grow exponentially. Who owns that data strategy? It looks like we’ll be kicking off with FinOps.
  • #45: Key takeaway: Chronosphere offers two products: observability platform and telemetry pipeline. We are leaders in reliability and scalability. Our products help you control data volumes and let developers solve problems faster Observability platform which is a full-service, end-to-end, SaaS Observability platform that ingests metrics, events, traces, and logs Telemetry Pipeline solution that can ingest any log data, transform it, and route it to ANY destination. Data collection: Collect and manage any telemetry type in hundreds of open source or proprietary formats, from Prometheus and OpenTelemetry to Datadog or Splunk, regardless of scale. Speaking of OSS, with Chronosphere there is no vendor lock in: All the ins and outs of our platform are open source compatible, from ingestion to querying, to dashboarding, and alerting. Control plane: The data then flows into the control plane which we just talked about As a reminder, the Control Plane analyzes the value of the data and allows you to use this information to take action and optimize it before it's stored. That way you only pay for what's most useful. We analyze all of the ways you use the data, from dashboards to alerts to queries, so you know you are We've seen customers like Snap, Affirm, and Zillow optimize their data by more than 85% We also help you govern your data set and provide the ability to perform chargeback Data store: The data is then sent to our data stores. We have the most efficient engine for collecting and storing your data – we learned how to do industrial, performant and highly available storage from building M3 at Uber. We can ingest hundreds of millions of data points per second and petabytes of log data per day. Our promised SLA is 99.9% uptime, but our actual delivered SLA is better than 99.99% historically. We can achieve this because our single tenant architecture prevents noisy neighbor problems. Lens: This is our own user interface we just discussed that helps optimize how end users debug the system. You have hundreds, maybe thousands of dashboards. imagine if users could jump in and immediately see the most relevant data? Whether they are service owners or maintain infrastructure, there is a tailored view that brings together insights from all telemetry types. Everything is in one place and it’s perfectly tailored for resolving incidents and troubleshooting.
  • #46: “Chronosphere, the leading cloud native observability platform, announced its entrance into the FinOps Foundation today to help address the rising costs of cloud infrastructure. Chronosphere's 2023 Cloud Native Observability report revealed 87% of engineers using cloud native architectures say it has increased the complexity of discovering and troubleshooting incidents —leading to greater costs such as increases in solution charges and inefficient use of engineer's time. As more companies transition to the cloud, there's been a movement to form centralized FinOps teams, which embed financial governance and accountability into engineering and cloud operations teams. The rapid growth of this movement is staggering, leading to the formation of the FinOps Foundation in 2019, which today boasts a community of more than 11,000. Today, 90% of the Fortune 50 now have FinOps teams and Chronosphere is now the first observability company to join the foundation.” (Source: https://www.prnewswire.com/news-releases/chronosphere-becomes-first-observability-company-to-join-the-finops-foundation-301884878.html)
  • #47: Get the report: https://chronosphere.io/2024-gartner-magic-quadrant
  • #48: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
  • #49: Don't take the path of non-open standards: CNCF community full of observability standards otel prom PromQL Fluentd / FluentBit
  • #50: Open standards are the holy grail for any architect and any organization wanting to survive longer than a few years in the cloud native world. Without a standard, you don’t have an exit strategy for your cloud native observability tooling.
  • #51: The CNCF is the community for all things cloud native and observability where all widely adopted open standards for all manner of observability protocols; Prometheus, OpenTelemetry, Jaeger, FluentD, and FluentBit.
  • #52: The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported. Collector contrib packages bring support for more data formats and vendor backends. Agent: A Collector instance running with the application or on the same host as the application (e.g. binary, sidecar, or daemonset). For more information, see Collector.
  • #53: Widely adopted and accepted standards for metrics can be found in the Prometheus project, including time-series storage, communication protocols to scrape (pull) data from targets, and PromQL the query language for visualizing the data.
  • #54: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
  • #55: Underestimating the effects of data cardinality: when the bomb explodes, paths to effective action? how to corral costs while troubleshooting single most cost effective measure a tool can have
  • #56: Setting the stage with the problem in a nutshell as to the struggles DevOps and all variations have… so let’s look at why they are struggling to understand each other and / or have no time to invest into caring about availability. source: https://www.reddit.com/r/sre/comments/xyt2re/request_tracing_or_not
  • #57: How do you survive cloud native complexity? With great observability, according to a 2023 Cloud Native Observability Report. This observability report is based on a survey of 500 engineers and software developers who weighed in on ways cloud native complexity makes their jobs harder and the hours longer. With observability, the report concludes, businesses can quickly mitigate incidents, teams innovate faster, and engineering time ROI improves. When done right, observability helps improve both the top and bottom lines. 96% spend most of their time resolving low-level issues 88% reporting that amount of time negatively impacts them and their careers because so much time is spent troubleshooting IT issues. 33% said those issues disrupted their personal life 39% admitting they are frequently stressed out 22% said they want to quit 40% frequently get alerts from their observability solution without enough context to triage the incident 59% said half of the incident alerts they receive from their current observability solution aren’t actually helpful or usable 49% struggle with inconsistent performance using their current approach to observability 45% said their current observability solution requires a lot of manual time and labor 42% of those using a vendor solution said they experienced high-severity incidents quarterly or more versus 61% relying on a platform they built Among organizations not using a vendor solution, the majority said they would consider doing so to enhance team productivity (61%) or improve reliability (54%). https://go.chronosphere.io/2023-observability-report.html
  • #58: How do you survive cloud native complexity? With great observability, according to a 2023 Cloud Native Observability Report. This observability report is based on a survey of 500 engineers and software developers who weighed in on ways cloud native complexity makes their jobs harder and the hours longer. With observability, the report concludes, businesses can quickly mitigate incidents, teams innovate faster, and engineering time ROI improves. When done right, observability helps improve both the top and bottom lines. 96% spend most of their time resolving low-level issues 88% reporting that amount of time negatively impacts them and their careers because so much time is spent troubleshooting IT issues. 33% said those issues disrupted their personal life 39% admitting they are frequently stressed out 22% said they want to quit 40% frequently get alerts from their observability solution without enough context to triage the incident 59% said half of the incident alerts they receive from their current observability solution aren’t actually helpful or usable 49% struggle with inconsistent performance using their current approach to observability 45% said their current observability solution requires a lot of manual time and labor 42% of those using a vendor solution said they experienced high-severity incidents quarterly or more versus 61% relying on a platform they built Among organizations not using a vendor solution, the majority said they would consider doing so to enhance team productivity (61%) or improve reliability (54%). https://go.chronosphere.io/2023-observability-report.html
  • #59: Remember, it can be an unchecked data flood, the reality of o11y at scale is the feeling of drowning in too much information and data.
  • #60: Key takeaway: Chronosphere offers two products: observability platform and telemetry pipeline. We are leaders in reliability and scalability. Our products help you control data volumes and let developers solve problems faster Observability platform which is a full-service, end-to-end, SaaS Observability platform that ingests metrics, events, traces, and logs Telemetry Pipeline solution that can ingest any log data, transform it, and route it to ANY destination. Data collection: Collect and manage any telemetry type in hundreds of open source or proprietary formats, from Prometheus and OpenTelemetry to Datadog or Splunk, regardless of scale. Speaking of OSS, with Chronosphere there is no vendor lock in: All the ins and outs of our platform are open source compatible, from ingestion to querying, to dashboarding, and alerting. Control plane: The data then flows into the control plane which we just talked about As a reminder, the Control Plane analyzes the value of the data and allows you to use this information to take action and optimize it before it's stored. That way you only pay for what's most useful. We analyze all of the ways you use the data, from dashboards to alerts to queries, so you know you are We've seen customers like Snap, Affirm, and Zillow optimize their data by more than 85% We also help you govern your data set and provide the ability to perform chargeback Data store: The data is then sent to our data stores. We have the most efficient engine for collecting and storing your data – we learned how to do industrial, performant and highly available storage from building M3 at Uber. We can ingest hundreds of millions of data points per second and petabytes of log data per day. Our promised SLA is 99.9% uptime, but our actual delivered SLA is better than 99.99% historically. We can achieve this because our single tenant architecture prevents noisy neighbor problems. Lens: This is our own user interface we just discussed that helps optimize how end users debug the system. You have hundreds, maybe thousands of dashboards. imagine if users could jump in and immediately see the most relevant data? Whether they are service owners or maintain infrastructure, there is a tailored view that brings together insights from all telemetry types. Everything is in one place and it’s perfectly tailored for resolving incidents and troubleshooting.
  • #61: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
  • #62: Bonus pitfall… as a final thought, what’s the number one item at the top of your cloud native observability wishlist?
  • #63: Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability? When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives. That hasty decision up front leads to big headaches very quickly down the road. In this talk, I'll introduce the problem facing everyone with cloud native observability followed by 3 common mistakes that I'm seeing organizations make and how you can avoid them! Key takeaways - This session is never the same twice as you the audience / attendees choose from a list of cloud native observability pitfalls that DevOps have to contend with in their daily cloud native lives! Super engaging and fun to tour the challenges that interest you most!