How Observability is Changing in Technology

AI, LLMs, Observability product @ Elastic

4,481 followers 6mo

Reflecting on 16+ years in tech: the evolution of observability has been remarkable. From basic monitoring dashboards to today's AI-powered systems, the transformation continues to accelerate. Remember troubleshooting by jumping between dozens of dashboards, manually correlating metrics to pinpoint issues? Those days now feel distant. I've witnessed three distinct phases in this evolution: 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 𝟭.𝟬: Simple threshold-based alerts answering "is it up or down?" Those 3 AM pages often turned out to be non-critical issues. 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝟮.𝟬: Unified collection of logs, metrics, and traces. We gained context but still spent hours correlating data points across complex cloud environments. 𝗔𝗜-𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝟯.𝟬: Where we're heading now. AI recognizes patterns across vast infrastructures, predicts issues before user impact, and suggests remediation steps. At Elastic, I've witnessed how AI transforms observability from reactive monitoring to proactive intelligence. Key shifts include: 1️⃣ Automated anomaly detection that learns your environment's normal behavior 2️⃣ Intelligent notifications that understand context and priority 3️⃣ AI-assisted incident resolution that identifies likely root causes 4️⃣ Natural language interfaces that let you query your infrastructure in plain English Perhaps most valuable: observability is becoming increasingly invisible. The best systems shouldn't demand constant attention—they should notify you only when needed with actionable context. For organizations scaling cloud environments, this means fewer midnight alerts, faster mean time to resolution, and infrastructure management that shifts from firefighting to strategic planning. I'm curious: Where do you see AI taking observability in the next 5 years? Will systems eventually self-heal autonomously, or will human expertise remain irreplaceable? #Observability #AIOps #CloudInfrastructure #SRE #TechEvolution

1 Comment

Josh Clemm

Vice President of Engineering, Dropbox

5,997 followers 4mo

The headlines for Anthropic's multi-agent blog post missed the killer feature: self-healing infra. I recently posted how that architecture resembles large-scale microservice architectures we've been building for years. And those architectures have a major flaw: one flaky downstream and you’re in trouble. So we designed for failure. We built architectures that feature load balancers, circuit breakers, bulkheads, auto-scalers, timeouts, retries, exponential backoff, rate limits, and more. Really, anything to stop cascading failures before they took out half the fleet. And the more self-healing the solutions, the more resilient we were. Fast-forward to multi-agent AI. We're still going to orchestrate across "services," but the edges look different: prompts, tool calls, and context windows. And the best part? The prompt layer can now heal itself. Whenever Claude trips up on a bad endpoint, it will evaluate the situation, rewrite it's own prompt and ship the fix. All with zero human intervention. Future agents then will use that fixed tool. Reliability moves from "catch, contain, retry, or fallback" to "learn-and-adapt." What does this represent for the future? We can shift toil from architects and SREs to the models themselves. Each self-patch of the prompt compounds. The knowledge base of what works grows as fast as your token budget allows. And observability changes from metrics and traces to prompt revision logs. Similar playbook, new altitude. We still need many of the common resiliency patterns like retries, but we can now layer on adaptive prompts that fix problems at the reasoning layer before they escalate. We're building this at Dropbox, who else is building this into their stack?

6 Comments

Krishna Gade

Founder & CEO at Fiddler

15,030 followers 3mo

Much of the narrative around 2025 focuses on AI agents, but the real transformation is happening in multi-agent systems. LLMs changed how we generate and interact with information. Agents extended this with autonomous planning, decision-making, and tool use. The next evolution is coordination: systems of agents working together, asynchronously and autonomously, across complex workflows. This shift comes with new challenges. Multi-agent systems can be an order of magnitude more complex to monitor. Each agent has its own reasoning trace, tool interactions, and decisions, plus shared state and communication dependencies with other agents. Traditional APM tools weren’t built for this kind of semantic, dynamic behavior. That’s why at Fiddler AI we’re working on Agentic Observability: a new paradigm that merges infrastructure telemetry, LLM introspection, and agent-level reasoning into a unified control plane. It enables developers to: 🔍 Trace decisions across sessions, agents, and tools 🧠 Understand emergent behaviors from agent interactions ⚠️ Surface misalignments and regressions before they escalate This isn’t just about uptime or latency: it’s about debugging cognition and coordination in distributed intelligent systems. As agent ecosystems become more autonomous and complex, we’ll need new primitives for observability ones that account for goal alignment, adaptive behavior, and runtime reasoning. That’s the direction we’re heading. If you’re building or thinking about agent teams, we’d love to hear how you're approaching this. Please read this blog to learn more and watch a quick demo: https://lnkd.in/ge5ZyA8x #AgenticAI #MultiAgentSystems #Observability #LLMOps #AIInfrastructure #AIEngineering #SemanticTracing #ControlPlaneForAI

Agentic Observability Starts in Development: Build Reliable Agentic Systems | Fiddler AI Blog fiddler.ai

1 Comment

Neha Pawar

Head of Data Infra at StarTree

4,416 followers 1y

I recently had the opportunity to write an article for The New Stack titled Reimagining Observability: The Case for a Disaggregated Stack https://lnkd.in/gVKd7c6s 🧠 🎨 🖌️ Here's a brief summary of what I discuss in the article: The observability stack as we know it is changing. The traditional, all-or-nothing o11y stack leads to a loss of flexibility & higher costs 💸 🪢 Briefly, examining each layer of the stack to understand why: 🔭 Agents - Vendors have heavily invested in their agents, which tend to be tailored to specific formats within their stacks. - But today, agents have become commoditized, and customers are less willing to pay a premium for proprietary options. They want the flexibility to use their own agents alongside standards like OTEL, which make it easy to send data to various backends and reuse it for multiple use cases. 📩 Collection - In traditional o11y vendor solutions, egress costs inevitably skyrocket, as agents are deployed within customer accounts, and the massive metrics, logs, traces data collected must be shipped to the vendor’s account. - In a disaggregated stack, you can leverage streaming infrastructure -like Kafka, RedPanda- that’s likely already part of your data ecosystem for collection. These systems are agnostic to agent formats, easily interface with standards like OTEL, and often have native integrations with storage systems. Most importantly, they give you the flexibility to use your data for many more applications beyond o11y. 📦 🔍 Storage and Query - The storage and query layer is the most challenging piece. This layer must handle extremely high volume & velocity of data, which directly translates to extremely high cost. It must also handle high variety, in the form of diverse input formats, data types, unstructured payloads & high-cardinality dimensions. - Compared to all-in-one solutions, systems purpose-built for low-latency real-time analytics—such as Apache Pinot, Clickhouse, Apache Druid—are far better suited for such data. In particular, Apache Pinot offers robust real-time ingestion integrations, along with an army of encoding, compression & indexing techniques, rich query capabilities, native storage tiering, and has been proven at external-facing real-time analytics scale. The biggest advantage of disaggregation however, is again, having full access to your own data and being able to utilize it for a lot more usecases. 🪄 🎩 Visualization - In an all-in-one stack, this layer is highly inflexible. You can't use the tools from the stack for visualizing other datasets, nor can you use your own viz. tools with the data in the stack. - In a disaggregated stack, you have the flexibility to use popular tools like Grafana, Superset, or even build your own app. In the blog, I dive into the challenges of each layer & explain why disaggregation is the better solution in terms of cost, performance & flexibility. Do give it a read!

9 Comments

Sagar Navroop

Multi-Cloud Data Architect | AI | SIEM | Observability

3,664 followers 6mo

Is Indexless Storage the new default for AI & Quantum-Powered Observability? 𝐎𝐛𝐬𝐞𝐫𝐯𝐚𝐛𝐢𝐥𝐢𝐭𝐲 is seeing inside systems to spot problems before users find it. With cloud-native systems evolving, this practice is shifting from traditional monitoring to AI and quantum-enhanced insights. The choice between indexed and indexless storage becomes vital—especially when scalability, anomaly detection, and deep correlations are essential. 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 & 𝐈𝐧𝐠𝐞𝐬𝐭𝐢𝐨𝐧: Indexed storage outperforms at structured queries and near real-time insights. It enforces schema on write by default, which makes it slower for ingestion but lightning-fast for structured access and alerting. Indexless storage flips the model—ingesting telemetry at scale with minimal delay, making them ideal for dynamic, schema-less data. 𝐀𝐈-𝐃𝐫𝐢𝐯𝐞𝐧 𝐎𝐛𝐬𝐞𝐫𝐯𝐚𝐛𝐢𝐥𝐢𝐭𝐲: Indexed systems are good for SLA/SLO tracking, alerts and AI workloads. However, indexless storage enables long-term retention of raw telemetry—perfect for model training, anomaly scanning, and behavior prediction. AI can parse through massive unindexed data pools to find subtle trends over time - when precision matters. 𝐐𝐮𝐚𝐧𝐭𝐮𝐦-𝐃𝐫𝐢𝐯𝐞𝐧 𝐀𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬: Quantum computing uses quantum bits to process complex data faster than regular computers. For observability workloads, it can speed up anomaly detection, optimize trace correlation, and enhance predictive analytics across massive telemetry datasets, delivering real-time insights in highly dynamic, large-scale cloud environments —hard to optimize using pre-indexed data. Indexless storage supports this by allowing full-scan queries across massive datasets without schema bias. 𝐈𝐦𝐩𝐚𝐜𝐭 𝐎𝐧 𝐌𝐞𝐦𝐨𝐫𝐲, 𝐃𝐢𝐬𝐤 𝐒𝐩𝐚𝐜𝐞 & 𝐂𝐨𝐦𝐩𝐮𝐭𝐞: Indexed systems consume more CPU, disk space and memory during data ingestion and querying due to indexing and caching. Indexless systems reduce resource strain during ingestion but spike in CPU/memory during broad queries. For AI + quantum workloads, compute optimization often favors indexless, especially when paired with powerful query engines or server-less analytics frameworks. 𝐇𝐨𝐰 𝐎𝐛𝐬𝐞𝐫𝐯𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧 𝐏𝐫𝐨𝐯𝐢𝐝𝐞𝐫𝐬 𝐒𝐭𝐚𝐜𝐤-𝐔𝐩: Datadog, AppDynamics, Grafana Labs and New Relic offer schema-on-write, indexed storage. Dynatrace, Sumologic and Honeycomb defaults to indexless storage. Coralogix and Elastic offers bit more flexibility, supports both indexed and indexless storage options. 𝐓𝐚𝐤𝐞𝐚𝐰𝐚𝐲: For modern, high-volume observability workloads, indexless storage delivers scale, cost-efficiency, and agility - lowers total cost of ownership (TCO). Do you notice similar trends in other industries as well? Please add your thoughts! #observability #loganalytics #performancemonitoring #infrastructuremonitoring #twominutedigest

16 Comments

Jaya Gupta

Partner @ Foundation Capital

21,517 followers 1y

We’ve seen countless buzzwords come and go. “AIOps” is the latest in a long line of catchy but ultimately misguided terms that fail to capture the true potential of AI in the world of IT ops and observability. The observability market has been fragmented for years, with leading vendors like Datadog, New Relic, and Splunk rarely capturing more than 20% market share. Why? Because fundamentally, observability has been treated as a big data problem rather than an intelligence problem. Why is there an opportunity to change this paradigm now? LLMs offer a unified approach to data understanding. Their ability to process and correlate heterogeneous data types – logs, metrics, and traces – in their raw formats breaks down some of the silos that have plagued observability. The transformer architecture underlying LLMs excels at capturing long-range dependencies, crucial for understanding system-wide patterns. LLMs also bring the power of zero-shot and few-shot meaning they can adapt to new failure modes without extensive retraining, addressing the perennial issue of concept drift in rapidly evolving systems. Beware the technical challenges: LLMs currently struggle with both tabular and time series data, common formats in observability. Although we anticipate that innovations in newer architectures, multimodality, and multi-agent systems will mitigate some of these challenges over time, near-term solutions will require creative workarounds from builders. Read more on our point of view below! Foundation Capital https://lnkd.in/grqHXNMF

Goodbye AIOps: welcome AgentSREs—the next $100B opportunity - Foundation Capital foundationcapital.com

2 Comments

Ashu Garg

Enterprise VC-engineer-company builder. Early investor in @databricks, @tubi and 6 other unicorns - @cohesity, @eightfold, @turing, @anyscale, @alation, @amperity, | GP@Foundation Capital

37,341 followers 1y

The observability market has been fragmented for 20 years. Modern distributed systems generate an astronomical amount of telemetry data. As a result, enterprises juggle 10 different tools—each with unique query languages and data models—to get a holistic view of system health, wasting their engineers’ time. AI was supposed to make this better, but “smart” observability attempts fell short (for technical and multifaceted reasons we detail in the post). LLMs are pushing us on the cusp of a fundamental shift in how organizations monitor, debug, and optimize their increasingly complex software systems. They provide a unified data understanding, correlating logs, metrics, and traces. They bring zero-shot learning, natural language interfaces, and context-aware analysis. They give us a path to truly automated root cause analysis, reducing MTTR by an order of magnitude. “AI ops” is a buzzword that doesn’t capture the true potential of AI in the world of IT ops and observability. While there are challenges, the economic shift will be profound. Gartner predictions for AIOps are modest ($3.1B by 2025). But when Jaya and I started exploring this space, we saw something much larger—automating SREs is worth 50x-100x that. Read for more: https://lnkd.in/gvBgfN7k

15 Comments

LinkedIn respects your privacy

How Observability is Changing in Technology

Explore categories

How Observability is Changing in Technology

More in Understanding System Observability

Explore categories