KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls

Download as PPTX, PDF

0 likes80 views

The document discusses the challenges and pitfalls of cloud-native observability, focusing on six key areas: ignoring existing landscapes, concentrating on pillars, messy tool sprawl, cost control, protocol issues, and underestimating data cardinality. It emphasizes the importance of a well-structured observability strategy to manage large data volumes and improve organizational efficiency. Additionally, it highlights a mention of Chronosphere being recognized as a leader in the 2024 Gartner Magic Quadrant for observability platforms.

Technology

KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls

1. chronosphere.io chronosphere.io Choose Your Own Adventure Eric D. Schabell Director Evangelism @ericschabell{@fosstodon.org} KCD Porto, 27-28 Sep 2024 Cloud Native Observability Pitfalls

2. chronosphere.io Cloud Native Observability

3. chronosphere.io Cloud Native

4. chronosphere.io Data volume Experiment: - Hello World app on 4 node Kubernetes cluster with Tracing, End User Metrics (EUM), Logs, Metrics (containers / nodes) - 30 days == +450 GB

5. chronosphere.io Retention Retention Retention Retention Retention Retention Retention

6. chronosphere.io Cloud Native at Scale

7. chronosphere.io Observability…

8. chronosphere.io Cloud Native Observability at Scale

9. chronosphere.io O11y at Scale (need)

10. chronosphere.io Picking Your Pitfalls 1. Ignoring existing landscape 2. Focusing on The Pillars 3. Sneaky sprawling mess 4. Controlling costs 5. The protocol jungle 6. Underestimating cardinality (Click on a pitfall to jump to that section)

11. chronosphere.io 1. Ignoring existing landscape

12. chronosphere.io If they can’t see me… they can’t hurt me...

13. chronosphere.io

14. chronosphere.io Prometheus for metrics, alerting, queries

15. chronosphere.io Prometheus auto discovery

16. chronosphere.io Manual instrumentation (java client lib)

17. chronosphere.io Short link: bit.ly/prom- workshop

18. chronosphere.io Applications (Java) OTel Auto Instrumentation (libraries) OTel API OTel SDK OTel Collector OTLP OTLP OTLP OpenTelemetry (Auto) instrumentation

19. chronosphere.io Host Observability Backend (Prometheus, Jaeger, Fluent Bit, etc.), Applications OTel Auto Instrumentation OTel API OTel SDK OTel Collector Agent OTLP OTLP OTLP OTLP OTLP OpenTelemetry Collector (agent)

20. chronosphere.io Host Host Host Observability Backend (Prometheus, Jaeger, Fluent Bit, etc.), Applications OTel Auto Instrumentation OTel API OTel SDK OTel Collector Agent OTLP OTLP OTLP OTLP Collector (gateway) OTel Collector Gateway

21. chronosphere.io Short link: bit.ly/opentelemetry- workshop

22. chronosphere.io 1. Ignoring existing landscape 2. Focusing on The Pillars 3. Sneaky sprawling mess 4. Controlling costs 5. The protocol jungle 6. Underestimating cardinality Picking Your Next Pitfall

23. chronosphere.io 2. Focusing on The Pillars

24. chronosphere.io Pillars Phases

25. chronosphere.io Developer Technology Bottom up

26. chronosphere.io Pillar problems…

27. chronosphere.io Car is on fire…

28. chronosphere.io Better outcomes… Faster remediation… Easier detection… Happier customers…

29. chronosphere.io Phase 1 Know something is happening as fast as possible…

30. chronosphere.io Phase 2 Triage with specific information…

31. chronosphere.io Phase 3 Understand to ensure never happens again…

32. chronosphere.io

33. chronosphere.io

34. chronosphere.io 1. Ignoring existing landscape 2. Focusing on The Pillars 3. Sneaky sprawling mess 4. Controlling costs 5. The protocol jungle 6. Underestimating cardinality Picking Your Next Pitfall

35. chronosphere.io 3. Sneaky sprawling mess

36. chronosphere.io Over 66% of organizations use more than 10 different observability tools – ESG report over exploding data volumes

37. chronosphere.io

38. chronosphere.io Know Triage Understand

39. chronosphere.io

40. chronosphere.io 1. Ignoring existing landscape 2. Focusing on The Pillars 3. Sneaky sprawling mess 4. Controlling costs 5. The protocol jungle 6. Underestimating cardinality Picking Your Next Pitfall

41. chronosphere.io 4. Controlling costs

42. chronosphere.io “It’s remarkable how common this situation is, where an organization is paying more for their observability data, than they do for their production infrastructure.” ?

43. chronosphere.io O11y data storage costs are broken. Keeping everything model?

44. chronosphere.io Know the cost of observability metrics data?

45. Control costs and improve productivity Observability Platform DATA COLLECTION CONTROL PLANE STORE LENS Telemetry Pipeline Reduce Enrich Secure TRANSFORM AND ROUTE DATA IN YOUR ENVIRONMENT STORE DATA IN THIRD PARTY LOG & SIEM SOLUTIONS

46. chronosphere.io

47. chronosphere.io Chronosphere named a Leader in the 2024 Gartner® Magic Quadrant™ for Observability Platforms Gartner, Magic Quadrant for Observability Platforms: By Gregg Siegfried, Padraig Byrne, Mrudula Bangera, Matt Crossley (12 August 2024) GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and MAGIC QUADRANT is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose https://chronosphere.io/2024-gartner-magic-quadrant

48. chronosphere.io 1. Ignoring existing landscape 2. Focusing on The Pillars 3. Sneaky sprawling mess 4. Controlling costs 5. The protocol jungle 6. Underestimating cardinality Picking Your Next Pitfall

49. chronosphere.io 5. The protocol jungle

50. chronosphere.io Without open standards, you’ll not find a way back…

51. chronosphere.io

52. chronosphere.io Host Observability Backend (Prometheus, Jaeger, Fluent Bit, etc.), Applications OTel Auto Instrumentation OTel API OTel SDK OTel Collector Agent OTLP OTLP OTLP OTLP OTLP OpenTelemetry Collector (agent)

53. chronosphere.io Prometheus for metrics, alerting, queries

54. chronosphere.io 1. Ignoring existing landscape 2. Focusing on The Pillars 3. Sneaky sprawling mess 4. Controlling costs 5. The protocol jungle 6. Underestimating cardinality Picking Your Next Pitfall

55. chronosphere.io 6. Underestimating cardinality

56. chronosphere.io The struggle is real “I don't yet collect spans/traces because I can hardly get our devs to care about basic metrics, let alone traces.” “This is a large enterprise with approx. 1000 developers. Cultivating a culture of engineering that cares about availability is a challenge that we need to solve alongside any technical implementations.”

57. chronosphere.io 10 hours on average, per week, trying to triage and understand incidents - a quarter of a 40 hour work week

58. chronosphere.io 33% said those issues disrupted their personal life 39% admitting they are frequently stressed out

59. chronosphere.io Cloud Native Observability at Scale

60. Control costs and improve productivity Observability Platform DATA COLLECTION CONTROL PLANE STORE LENS Telemetry Pipeline Reduce Enrich Secure TRANSFORM AND ROUTE DATA IN YOUR ENVIRONMENT STORE DATA IN THIRD PARTY LOG & SIEM SOLUTIONS

61. chronosphere.io 1. Ignoring existing landscape 2. Focusing on The Pillars 3. Sneaky sprawling mess 4. Controlling costs 5. The protocol jungle 6. Underestimating cardinality Picking Your Next Pitfall

62. chronosphere.io What should be the #1 item on cloud wishlist? What should be #1 item on your cloud native observability wishlist?

63. chronosphere.io chronosphere.io Questions? Eric D. Schabell Director Evangelism @ericschabell{@fosstodon.org}

Editor's Notes

#1: Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability? When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives. That hasty decision up front leads to big headaches very quickly down the road. In this talk, I'll introduce the problem facing everyone with cloud native observability followed by 3 common mistakes that I'm seeing organizations make and how you can avoid them! Key takeaways - This session is never the same twice as you the audience / attendees choose from a list of cloud native observability pitfalls that DevOps have to contend with in their daily cloud native lives! Super engaging and fun to tour the challenges that interest you most!
#2: Cloud native is one thing, but doing it at scale requires great observability. This is the promise of cloud native observability and it feels like we’ve been let down…. let’s examine what we mean by cloud native and observability at scale along with some of the unexpected things you will have to deal with.
#3: Cloud native means your organizations is using the following to deliver and run its applications and infrastructure: kubernetes containers cloud providers automated deployment delivery pipelines DevOps methodologies and teams
#4: Observability metric data explosion will cause plenty of issues, not to mention costs… dare to flip the switch on new data collection? An experiment: Hello World application was deployed to a four node Kubernetes cluster on GKE. Load was generated using the script that comes with the app. Wrote some additional scripting to scrape the Prometheus end points and record the size of the data payloads. Another script accepted Jaeger tracing spans and EUM beacons, recording the size of the data payloads. Fluentd collected all the logs and concatenated them all into one flat file. Using the timestamps from the log file, one hour was extracted into a new file, which was then measured. Observability Data Volume: Tracing At a rate of 1 trace per second, over 24 hours per day and 30 days in a month, the total number of traces is 2.5 million. The average trace size was 66kB. Therefore, the total data size for traces was 161GB. Looks like my estimate of fitting inside 100GB has already been proved wrong. While Tracing can be sampled at source, that would mean having to throw away nearly half of the data to fit inside the original estimate of 100GB. Observability Data Volume: EUM Each back-end call is triggered by a user interaction at the browser, which produces an EUM beacon – conveniently making the number of beacons generated the same as the number of traces – 2.5 million. The average size of an End User Metrics (EUM) beacon is a lot smaller at 397 bytes, making our total data size for a month of EUM beacons 1 GB. Observability Data Volume: Logs For logs, especially when it comes to data volumes, your mileage may vary – depending on your app, configuration settings, etc. The application logs generate quite a bit at INFO level, though not nearly as much as some other real-world applications. From the experiment, the log file size for one hour was 5 MB, making the total log volume for one month 3.4 GB. Observability Data Volume: Metrics Collected metrics – using Prometheus – from from every container, each worker node and from kube state metrics for the cluster giving a total of 1.1 MB per sample period. With a sample every ten seconds, that’s 259,200 samples per month, which results in a total data volume of 285 GB. Total Observability Data Volumes The grand total across all datasets is 452 GB per month for a simple Hello World application running on a small Kubernetes cluster. A note on data granularity: As you may or may not know, Instana collects all metrics at 1-second granularity. Doing this with Prometheus would so devastatingly skew the experiment results, since Prometheus has none of the optimizations built into the Instana sensors and agents. Thus, the experiment was conducted at 10 second sample rate for Prometheus metrics. The load generation script produces one request per second to the application back-end services. (Source: The Hidden Cost of Data Observability)
#5: Most companies default to 13 months retention for all data. But in the modern cloud native architecture, where we are deploying multiple times a day, and a container is only around for a couple of hours, a huge amount of that modern observability data does not need to be retained for 13 months. One tactic for reducing your data footprint is setting the optimal retention period for each data type. For example, you might only need to keep observability data from your lab environment for two weeks if the environment is torn down and rebuilt on a bi-weekly basis anyways. Source -- The Growth of Observability Data is out of Control
#6: It’s one thing to use Cloud Native technologies, but it’s a whole other story to do it at scale: DevOps methodologies and teams auto scaling cloud data on-call engineers Shift-left as stress climbs Tooling sprawl
#7: Observability data in the old world of monolithic VM based architectures. What’s that look like when life is simple: open source standards single pane of glass is the first attempt metrics and logs on-call engineers
#8: Data flood is reality of o11y at scale…. drowning. What’s that looks like when life is not simple: metrics, logs, tracing mess MTTD & MTTR (remediation) failures on-call engineers tooling sprawl cloud data spend
#9: This is what it can and should look like, even though the world is complex, we can create functionality and order: know, triage, understand worst shortest time to repair (MTTR to focus on WTTR) observability SREs Centralized o11y teams platform engineering teams on-call engineers
#10: NOTE: It’s choose your own adventure time! All items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around).
#11: Don't ignore cost of instrumenting your application landscape: open standards tracing with otel metrics with prom client libraries or custom instrumentation (java example)
#12: You have a lot of applications out there in your architecture, but during the decision making process around cloud native observability, they often are forgotten. The cost that they bring is in the hidden facts around instrumentation. You have auto-instrumentation that is quick and easy, but often does not bring the needed insights. On top of that, auto-instrumentation shows you with extra data points from metrics and tracing activities that you most likely are not all that interested in. Manual instrumentation is the work that is needed to provide the actual insights and data you want to collect from your application landscape, but that brings an often unquantified amount of work (aka costs) with it to change, test, and deploy new versions of existing applications.
#13: Where can you find the cloud native standards? The CNCF is the community for all things cloud native and observability where all widely adopted open standards for all manner of observability protocols; Prometheus, OpenTelemetry, Jaeger, FluentD, and FluentBit.
#14: Widely adopted and accepted standards for metrics can be found in the Prometheus project, including time-series storage, communication protocols to scrape (pull) data from targets, and PromQL the query language for visualizing the data.
#15: This also includes standards in communication to detect services across various cloud native technologies.
#16: While some of the data can be automatically gathered, that’s just generic information often based on the language you are using for your applications and services. Manual instrumentation is the cost you can’t forget, where you need to make code changes and redeploy.
#17: Explore this instrumenting of your applications in this workshop, where a Java example for developers has been created for them to experience what it’s like to instrument an application in lab 8, see entire workshop at https://bit.ly/prom-workshop.
#18: Instrumentation Libraries OpenTelemetry supports a broad number of components that generate relevant telemetry data from popular libraries and frameworks for supported languages. For example, inbound and outbound HTTP requests from an HTTP library will generate data about those requests. It is a long-term goal that popular libraries are authored to be observable out of the box, such that pulling in a separate component is not required. For more information, see Instrumenting Libraries. Automatic Instrumentation If applicable a language specific implementation of OpenTelemetry will provide a way to instrument your application without touching your source code. While the underlying mechanism depends on the language, at a minimum this will add the OpenTelemetry API and SDK capabilities to your application. Additionally they may add a set of Instrumentation Libraries and exporter dependencies. For more information, see Instrumenting.
#19: The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported. Collector contrib packages bring support for more data formats and vendor backends. Agent: A Collector instance running with the application or on the same host as the application (e.g. binary, sidecar, or daemonset). For more information, see Collector.
#20: The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported. Collector contrib packages bring support for more data formats and vendor backends. Gateway: One or more Collector instances running as a standalone service (e.g. container or deployment) typically per cluster, data center or region. For more information, see Collector.
#21: This OpenTelemetry workshop also has a lab for exploring a tracing example for developers to experience what it’s like to instrument an application in lab 5, see entire workshop at https://bit.ly/opentelemetry-workshop.
#22: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
#23: Don't focus on pillars, but on phases: logs, metrics, traces, events know, triage, and understand more important integration use and track what you need, not what you produce
#24: When exploring the world of o11y, there are two very distinct lines of discussion: Pillars of observability Phases of observability
#25: Pillars of Observability: It’s the same as is often found in a developer world, where it's all about technology. A very developer centric and bottom up approach to any technical problem.
#26: The problem with the three pillars is that you are talking about technology aspects and not about solutions. It's like talking about the tools in a mechanics toolbox used to make your car run again…
#27: …instead of focusing on the blue smoke coming out of the exhaust, the rising engine temperature, and using that data to quickly remediate the problem by replacing the seals to prevent oil leaking in the engine.
#28: We all want to have better business outcomes for our organizations solutions, such as faster remediation of problems, easier problem detection, greater revenue generation, happier customers, and engineering teams that can remain focused on delivering more business value.
#29: The phases you go through start with knowing the problem is happening as fast as possible and might even lead to fixing it immediately.
#30: If not immediately fixable, then you start triaging based on specific information related to the problem which quickly leads to fixing it.
#31: Finally, you want to have a very deep understanding of the issues you just encountered to ensure it never happens again.
#32: Do you still want to be faced with these kinds of dashboards, where you are forced into the pillars of observability?
#33: Or are you ready for the phases of observability where none of these phases require you to focus on data types or specific technology details. They do need you to have the o11y platform in place that can provide sharply focused insights and put enough information at your fingertips for you to make informed decisions quickly.
#34: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
#35: Tooling sprawl: avoid individual tooling for each type of monitoring data Using multiple tools for each pillar of observability—metrics, events, logs, and traces—can often become a burden when selecting observability tools. According to a report by ESG, two-thirds of organizations today use more than 10 different observability tools for their needs. https://chronosphere.io/learn/esg-report-managing-the-exploding-volumes-of-observability-data/?utm_source=schabell-blog&utm_medium=eric
#36: Need for new tool for each type of observability data? A report by ESG says two-thirds of organizations today use more than 10 different observability tools for their needs. https://chronosphere.io/learn/esg-report-managing-the-exploding-volumes-of-observability-data/?utm_source=schabell-blog&utm_medium=eric
#37: Do you still want to be faced with these kinds of dashboards, where you are forced into the pillars of observability?
#38: We all want to have better business outcomes for our organizations solutions, such as faster remediation of problems, easier problem detection, greater revenue generation, happier customers, and engineering teams that can remain focused on delivering more business value. The problem with the popular three pillars (metrics, logs, tracing) is that you are talking about technology aspects and not about solutions. It's like talking about the tools in a mechanics toolbox used to make your convertible run again, instead of focusing on the blue smoke coming out of the exhaust, the rising engine temperature, and using that data to quickly remediate the problem by replacing the seals to prevent oil leaking in the engine. Let’s quickly tour the phases that lead to better outcomes and get our focus back on effective observability goals. Key takeaways - Modern cloud native observability needs three guiding phases to provide better outcomes, not tooling. Based on article https://www.schabell.org/2022/09/o11y-guide-cloud-native-observability-needs-phases.html
#39: Or are you ready for the phases of observability where none of these phases require you to focus on data types or specific technology details. They do need you to have the o11y platform in place that can provide sharply focused insights and put enough information at your fingertips for you to make informed decisions quickly.
#40: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
#41: Don't ignore the tooling cost models: storage costs ability to filter out data not used scaling costs
#42: And for what purpose? If these organizations could draw a straight line from more data to better outcomes — higher levels of availability, happier customers, faster remediation, more revenue — this tradeoff might make sense. But in many cases, this isn’t true. “Paying more for logging/metrics/tracing doesn’t equate to a positive user experience. Consider how much data can be generated and shipped. $$$. You still need good people to turn data into action.” It’s remarkable how common this situation is, where an organization is paying more for their observability data (typically metrics, logs, traces, and sometimes events), than they do for their production infrastructure. -- The Growth of Observability Data is out of Control
#43: Does your observability tooling need to store everything? Cloud native data is a flood these days, so you need observability that provides filtering, aggregation, and just plain ability to disregard data of no importance to your organization. You can pre-process at collection, use data pipeline features, or ensure your observability tooling has some sort of data control plane with insights into data real-time. If you can decide what is important before you store it, you are going to be a much happier observability team!
#44: Asking the audience to give their accounts, customers and personal experiences as feedback. A banking customer OpSec wanted to leverage the cloud provider’s observability in the load balancers by using a label of the load balancer per application. However, the load balancers label was limited to a specific amount, hence even though the utilization was < 10% the Dev Team had to get more loadbalancers (one of the most expensive component in the cloud) to meet said expectation. After many escalations that was solved. However, now they ran into another issue, the loadbalancers had a limited amount of contextpaths they could support. So again they had to multiply the number of loadbalancers, without hitting the traffic limit. The alternative, a simple NGinx behind the Cloud Loadbalancer, was not permitted because of the LCM, nobody wanted to LCM said instance. Who observes the cost of sub par architectural decisions? Auditing, Monitoring, Tracing: Beautiful capabilities, highly necessary for proper observation of the health of the app and it’s capability to serve our customers in a timely and secure manner. But if each customer engagement for a purchase means N number of logs, then those will grow exponentially. Who owns that data strategy? It looks like we’ll be kicking off with FinOps.
#45: Key takeaway: Chronosphere offers two products: observability platform and telemetry pipeline. We are leaders in reliability and scalability. Our products help you control data volumes and let developers solve problems faster Observability platform which is a full-service, end-to-end, SaaS Observability platform that ingests metrics, events, traces, and logs Telemetry Pipeline solution that can ingest any log data, transform it, and route it to ANY destination. Data collection: Collect and manage any telemetry type in hundreds of open source or proprietary formats, from Prometheus and OpenTelemetry to Datadog or Splunk, regardless of scale. Speaking of OSS, with Chronosphere there is no vendor lock in: All the ins and outs of our platform are open source compatible, from ingestion to querying, to dashboarding, and alerting. Control plane: The data then flows into the control plane which we just talked about As a reminder, the Control Plane analyzes the value of the data and allows you to use this information to take action and optimize it before it's stored. That way you only pay for what's most useful. We analyze all of the ways you use the data, from dashboards to alerts to queries, so you know you are We've seen customers like Snap, Affirm, and Zillow optimize their data by more than 85% We also help you govern your data set and provide the ability to perform chargeback Data store: The data is then sent to our data stores. We have the most efficient engine for collecting and storing your data – we learned how to do industrial, performant and highly available storage from building M3 at Uber. We can ingest hundreds of millions of data points per second and petabytes of log data per day. Our promised SLA is 99.9% uptime, but our actual delivered SLA is better than 99.99% historically. We can achieve this because our single tenant architecture prevents noisy neighbor problems. Lens: This is our own user interface we just discussed that helps optimize how end users debug the system. You have hundreds, maybe thousands of dashboards. imagine if users could jump in and immediately see the most relevant data? Whether they are service owners or maintain infrastructure, there is a tailored view that brings together insights from all telemetry types. Everything is in one place and it’s perfectly tailored for resolving incidents and troubleshooting.
#46: “Chronosphere, the leading cloud native observability platform, announced its entrance into the FinOps Foundation today to help address the rising costs of cloud infrastructure. Chronosphere's 2023 Cloud Native Observability report revealed 87% of engineers using cloud native architectures say it has increased the complexity of discovering and troubleshooting incidents —leading to greater costs such as increases in solution charges and inefficient use of engineer's time. As more companies transition to the cloud, there's been a movement to form centralized FinOps teams, which embed financial governance and accountability into engineering and cloud operations teams. The rapid growth of this movement is staggering, leading to the formation of the FinOps Foundation in 2019, which today boasts a community of more than 11,000. Today, 90% of the Fortune 50 now have FinOps teams and Chronosphere is now the first observability company to join the foundation.” (Source: https://www.prnewswire.com/news-releases/chronosphere-becomes-first-observability-company-to-join-the-finops-foundation-301884878.html)
#47: Get the report: https://chronosphere.io/2024-gartner-magic-quadrant
#48: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
#49: Don't take the path of non-open standards: CNCF community full of observability standards otel prom PromQL Fluentd / FluentBit
#50: Open standards are the holy grail for any architect and any organization wanting to survive longer than a few years in the cloud native world. Without a standard, you don’t have an exit strategy for your cloud native observability tooling.
#51: The CNCF is the community for all things cloud native and observability where all widely adopted open standards for all manner of observability protocols; Prometheus, OpenTelemetry, Jaeger, FluentD, and FluentBit.
#52: The OpenTelemetry Collector is a vendor-agnostic proxy that can receive, process, and export telemetry data. It supports receiving telemetry data in multiple formats (for example, OTLP, Jaeger, Prometheus, as well as many commercial/proprietary tools) and sending data to one or more backends. It also supports processing and filtering telemetry data before it gets exported. Collector contrib packages bring support for more data formats and vendor backends. Agent: A Collector instance running with the application or on the same host as the application (e.g. binary, sidecar, or daemonset). For more information, see Collector.
#53: Widely adopted and accepted standards for metrics can be found in the Prometheus project, including time-series storage, communication protocols to scrape (pull) data from targets, and PromQL the query language for visualizing the data.
#54: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
#55: Underestimating the effects of data cardinality: when the bomb explodes, paths to effective action? how to corral costs while troubleshooting single most cost effective measure a tool can have
#56: Setting the stage with the problem in a nutshell as to the struggles DevOps and all variations have… so let’s look at why they are struggling to understand each other and / or have no time to invest into caring about availability. source: https://www.reddit.com/r/sre/comments/xyt2re/request_tracing_or_not
#57: How do you survive cloud native complexity? With great observability, according to a 2023 Cloud Native Observability Report. This observability report is based on a survey of 500 engineers and software developers who weighed in on ways cloud native complexity makes their jobs harder and the hours longer. With observability, the report concludes, businesses can quickly mitigate incidents, teams innovate faster, and engineering time ROI improves. When done right, observability helps improve both the top and bottom lines. 96% spend most of their time resolving low-level issues 88% reporting that amount of time negatively impacts them and their careers because so much time is spent troubleshooting IT issues. 33% said those issues disrupted their personal life 39% admitting they are frequently stressed out 22% said they want to quit 40% frequently get alerts from their observability solution without enough context to triage the incident 59% said half of the incident alerts they receive from their current observability solution aren’t actually helpful or usable 49% struggle with inconsistent performance using their current approach to observability 45% said their current observability solution requires a lot of manual time and labor 42% of those using a vendor solution said they experienced high-severity incidents quarterly or more versus 61% relying on a platform they built Among organizations not using a vendor solution, the majority said they would consider doing so to enhance team productivity (61%) or improve reliability (54%). https://go.chronosphere.io/2023-observability-report.html
#58: How do you survive cloud native complexity? With great observability, according to a 2023 Cloud Native Observability Report. This observability report is based on a survey of 500 engineers and software developers who weighed in on ways cloud native complexity makes their jobs harder and the hours longer. With observability, the report concludes, businesses can quickly mitigate incidents, teams innovate faster, and engineering time ROI improves. When done right, observability helps improve both the top and bottom lines. 96% spend most of their time resolving low-level issues 88% reporting that amount of time negatively impacts them and their careers because so much time is spent troubleshooting IT issues. 33% said those issues disrupted their personal life 39% admitting they are frequently stressed out 22% said they want to quit 40% frequently get alerts from their observability solution without enough context to triage the incident 59% said half of the incident alerts they receive from their current observability solution aren’t actually helpful or usable 49% struggle with inconsistent performance using their current approach to observability 45% said their current observability solution requires a lot of manual time and labor 42% of those using a vendor solution said they experienced high-severity incidents quarterly or more versus 61% relying on a platform they built Among organizations not using a vendor solution, the majority said they would consider doing so to enhance team productivity (61%) or improve reliability (54%). https://go.chronosphere.io/2023-observability-report.html
#59: Remember, it can be an unchecked data flood, the reality of o11y at scale is the feeling of drowning in too much information and data.
#60: Key takeaway: Chronosphere offers two products: observability platform and telemetry pipeline. We are leaders in reliability and scalability. Our products help you control data volumes and let developers solve problems faster Observability platform which is a full-service, end-to-end, SaaS Observability platform that ingests metrics, events, traces, and logs Telemetry Pipeline solution that can ingest any log data, transform it, and route it to ANY destination. Data collection: Collect and manage any telemetry type in hundreds of open source or proprietary formats, from Prometheus and OpenTelemetry to Datadog or Splunk, regardless of scale. Speaking of OSS, with Chronosphere there is no vendor lock in: All the ins and outs of our platform are open source compatible, from ingestion to querying, to dashboarding, and alerting. Control plane: The data then flows into the control plane which we just talked about As a reminder, the Control Plane analyzes the value of the data and allows you to use this information to take action and optimize it before it's stored. That way you only pay for what's most useful. We analyze all of the ways you use the data, from dashboards to alerts to queries, so you know you are We've seen customers like Snap, Affirm, and Zillow optimize their data by more than 85% We also help you govern your data set and provide the ability to perform chargeback Data store: The data is then sent to our data stores. We have the most efficient engine for collecting and storing your data – we learned how to do industrial, performant and highly available storage from building M3 at Uber. We can ingest hundreds of millions of data points per second and petabytes of log data per day. Our promised SLA is 99.9% uptime, but our actual delivered SLA is better than 99.99% historically. We can achieve this because our single tenant architecture prevents noisy neighbor problems. Lens: This is our own user interface we just discussed that helps optimize how end users debug the system. You have hundreds, maybe thousands of dashboards. imagine if users could jump in and immediately see the most relevant data? Whether they are service owners or maintain infrastructure, there is a tailored view that brings together insights from all telemetry types. Everything is in one place and it’s perfectly tailored for resolving incidents and troubleshooting.
#61: NOTE: all items on this list are links that jump to the pitfall listed, allows you to let the audience pick the next one (it’s copied to end of each pitfall section, allowing you to jump around.
#62: Bonus pitfall… as a final thought, what’s the number one item at the top of your cloud native observability wishlist?
#63: Are you looking at your organization's efforts to enter or expand into the cloud native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud native observability? When you're moving so fast with agile practices across your DevOps, SRE's, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on both your business, your budgets, and the ultimate success of your cloud native initiatives. That hasty decision up front leads to big headaches very quickly down the road. In this talk, I'll introduce the problem facing everyone with cloud native observability followed by 3 common mistakes that I'm seeing organizations make and how you can avoid them! Key takeaways - This session is never the same twice as you the audience / attendees choose from a list of cloud native observability pitfalls that DevOps have to contend with in their daily cloud native lives! Super engaging and fun to tour the challenges that interest you most!

KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls

More Related Content

Similar to KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls (20)

More from Eric D. Schabell (13)

Recently uploaded (20)

KCD Porto: Choose Your Own Adventure - Cloud Naive Observability Pitfalls

Editor's Notes