For a few hours yesterday, the internet trembled - AWS's us-east-1 region went down 😱 , taking with it a noticeable chunk of the online world. And it wasn't just tech companies or streaming platforms that felt the impact – the outage spilled into real life. 🥼 I heard stories of people unable to see their doctor because the clinic's system ran on AWS. 🧹 Everyday helpers stopped working – smart home devices, voice assistants, even robotic vacuum cleaners refused to start because they couldn't connect to the server, which of course relies on AWS. ⏱️ One colleague couldn't set a timer with Alexa! But the lesson isn't "cloud is unsafe." The real takeaway is that resilience must be designed, not assumed. Building fault-tolerant systems means preparing for the unexpected: 👉 deploying across multiple regions, 👉 designing for graceful degradation, 👉 isolating failure domains, 👉 having recovery playbooks that actually work when things go sideways 👉 testing how the system behaves during such an outage in a safe and controlled environment At SoftwareMill, we help companies prepare for the unpredictable – designing systems that stay reliable even when the cloud stumbles. If yesterday's outage left you wondering how your system would handle the next big disruption, let's talk. We'll review your setup and help you strengthen your architecture to survive the next battle.
Someone said: Today at Amazon, 70% of code is being pushed by AI. 😀
It was devasting. My Wordle streak was almost ended when I couldn't log on.
I think I'm having a deja vu... Don't just copy&paste the first answer from a guide/manual/SO/ChatGPT. Stop and think what problem you're trying to solve and find a proper solution. The post below presents the case of hacking a Hyundai IVI system because someone used an example AES/IVI pair from the NIST document SP800-38A: https://programmingwithstyle.com/posts/howihackedmycarguidescreatingcustomfirmware/
I would say the lesson is more: "Invest on Hardware" :)
I mean, wasn't the initial trigger a DNS issue? I wouldn't really blame AWS for anything. Still, it's a good reason not to have vendor lock-in and go for a multi-cloud solution.
Absolutely spot on, yesterday’s AWS us-east-1 incident was a great reminder that resilience isn’t automatic just because we’re in the cloud. Even major providers can (and will) fail, and it’s on us as engineers to design for failure, not react to it. Multi-region deployments, robust fallbacks, and chaos testing should be part of every serious architecture review. What’s fascinating is how these outages ripple far beyond tech from clinics to vacuum cleaners. It really highlights how deeply cloud reliability has become part of everyday life. Great insights, thanks for framing the lesson so clearly!
Its not necessarily being multi region as like most have said its often not enough, but its also being DR ready. If you company depends on zero downtime then you need a reliable DR strategy to be able to move your servives either to a new region or even cloud provider. Easier said than done of course when it comes to DNS for example but most are not ready. Is what I am bringing up with all my clients at moment. Is a must not a maybe and has been for years!
Hmm. But is cloud unsafe? And by cloud I mean governments letting more and more of our infrastructure get into the hands of mega corporations with revenues that would put some countries to shame and the responsibility levels of a toddler in a candy shop? Is that unsafe?
Ingénieur Cloud Senior GitOps AWS Kubernetes ArgoCD HashiCorp Vault
1wThere is a SPOF at AWS : some services are only managed by 1 region … so be multi-region is good but not enough !