The Neuroscience of Mistakes in SRE: Why Stress, Not Process, Causes Catastrophic Failures—and How to Fix It
Photo Credit: Aarna Sahu

The Neuroscience of Mistakes in SRE: Why Stress, Not Process, Causes Catastrophic Failures—and How to Fix It

When One Mistake Brings the World to a Halt

In July 2024, a single faulty update from CrowdStrike grounded flights, shut down hospitals, and paralyzed businesses worldwide—costing over $5 billion in economic losses.

A year earlier, a misconfigured Microsoft Azure update locked millions out of their accounts, crippling businesses reliant on cloud authentication. And in 2021, a Facebook BGP misconfiguration took down Instagram and WhatsApp for 6+ hours, costing $60 million+ in lost revenue.

Why do these catastrophic failures keep happening—despite robust DevOps pipelines, automation, and safeguards?

Because failures aren’t just about broken code—they’re about broken cognition.

Even with crystal-clear instructions, stress, cognitive load, and human limitations disrupt decision-making—often in unpredictable ways.

The Left-Hand Experiment: When Stress Overrides Logic

In a simple experiment, Google Executive Dave Rensin asked a packed auditorium to "raise your left hand." Despite the clear instruction, 20% raised their right hand, and 10% raised both. This wasn’t due to a lack of intelligence or misunderstanding—it was cognitive interference caused by stress and mental overload. This experiment reveals a powerful truth: under stress, our brains don’t always execute instructions as expected.

Why Mistakes Happen in Production

In high-stakes production environments, even the best SDLC processes, automated tests, and change management procedures can’t prevent all errors. Why? Because stress and cognitive load override logical decision-making, leading to mistakes despite well-defined processes.

Two Approaches to Prevent Mistakes in Production Engineering

To prevent mistakes, there are two main approaches: the process-driven approach, which relies on structured workflows and automation, and the cognitive-aware approach, which addresses the impact of stress and cognitive load. While most companies focus on process, in high-pressure environments like SRE, stress often overrides logic, leading to human errors. Unless teams address the cognitive side of failure, mistakes will keep happening.

The Neuroscience of Mistakes

Jeff Hawkins, in his book A Thousand Brains, explains that our brain consists of thousands of parallel models (cortical columns) that predict outcomes based on sensory data. Under stress, these models fail to update correctly, causing the brain to rely on outdated or incorrect predictions. This is why, during high-pressure incidents, engineers often default to past fixes—even when they no longer apply.

Article content
Brain global workspace model, Photo Credits: Aarushi Sahu

🔍 In short: When we’re under pressure, our brain relies on outdated or incorrect mental models—causing errors even when we “know” what to do.

How This Translates to SRE and Beyond

Dave Rensin didn’t just diagnose the problem—he also provided practical solutions.

One of the key methodologies he applied was Chaos Engineering for people—a practice that involves intentionally introducing failures into a system to test its resilience.

By creating controlled environments where teams could experience and learn from failures, he helped reduce the anxiety associated with making mistakes.

This approach not only improved system reliability but also created a culture where mistakes were seen as opportunities for learning—not sources of shame.

Another concept Dave introduced was “Stayaction”—a play on the words “stay” and “action.”

The idea? Create moments of pause in high-stress situations, allowing individuals to reset their emotional state before making a decision.

🚀 A simple yet powerful technique to reduce cognitive load and improve decision-making.

Gratitude and Reflection


Article content
Raj with Dave Rensin

As I reflect on the past year, I am grateful for the mentors who have guided me, including my mother and Dave Rensin. In a year marked by rapid technological changes, layoffs, and downsizing, it’s more important than ever to focus on the emotional well-being of our teams and create environments where they can thrive.

How Can You Build a Cognitive-Resilient Team?

🔹 Want to reduce mistakes in production?

  1. Run a Chaos Engineering drill to see how your team handles stress-induced failures.
  2. Introduce "Stayaction" pauses in high-stress deployments.
  3. Assess your team’s cognitive load—are they operating on outdated mental models?

And if you haven’t already, I highly recommend reading A Thousand Brains by Jeff Hawkins to better understand the neuroscience behind decision-making.

Let’s build better, more resilient teams—by understanding the brain, not just the process.

 Finally, As They Say:

"The best product people wear two hats—the naïve optimist and the paranoid realist."

As leaders, perhaps we should do the same: dream boldly but plan meticulously, and always remember—the human brain is both our greatest asset and our biggest vulnerability.

References:

  1. Auditorium experimentation by Dave Rensin
  2. Chaos Engineering for people system by Dave Rensin
  3. A Thousand Brains: A New Theory of Intelligence by Jeff Hawkins.
  4. We’ll never have true AI without first understanding the brain -MIT Press
  5. Numeta Reseach behind thousand brain theory
  6. Reverse engineering the neocortex 🧠 to revolutionize AI/ML 🤖- An open-source initiative.

Thank you for sharing your insights on production failures in critical infrastructure. We've learned that proactive risk assessment can make a significant difference. What preventative measures have you found most effective?

Like
Reply
Ravinder Kumar

Cloud Operations | Platform Engineering | Cybersecurity | CI/CD | DevSecOps | Advisor

6mo

It's a timely reminder that even the most robust processes can't shield teams from stress and cognitive overload. Your framing of mistakes as cognitive failures—not just technical ones—is compelling, and I deeply appreciate the emphasis on tools like Chaos Engineering for people and “Stayaction” to build resilience. But what stood out most was the call to leadership. As mentors and leaders, we must take responsibility for creating environments where our teams can think, operate safely, and learn without fear. Psychological safety isn’t a nice-to-have—it’s foundational. Thank you for reminding us that investing in the emotional well-being of our teams is not just kind; it’s critical to reliability and long-term success.

Like
Reply

This is great research; thanks for making a complex topic so digestable and for supplying actionable suggestions. I would ideally automate 100% of operations and push decisions up the management chain to an appropriate decision maker. When fear and blame are the norm though, decisions can get pushed downward too far. I've seen situations where the pressure and scapegoating was driven by the kind of "fear of failure" you talk about in the article, Raj. That's why it's so important to have servant-leaders on the team. Leaders take on appropriate responsibility, and take some heat off the SRE who would rather be fixing it than talking to management. So my advice to the SREs out there (and the platform coders that help them out with automation) is to make it as easy as possible for your management to make decisions, and then hold them accountable for making them.

Like
Reply
Pandarinath Siddineni

Domain Head - Systems & Software @ Tata Elxsi | Technology Leader | Designing AI-led Solutions

7mo

Very interesting article Raj.

Like
Reply

To view or add a comment, sign in

More articles by Raj Sahu

Others also viewed

Explore content categories