Rafał Maciak’s Post

Business problem solver using technology || Software Engineer & Architect || Kotlin & Java || Distributed systems

For a few hours yesterday, the internet trembled - AWS's us-east-1 region went down 😱 , taking with it a noticeable chunk of the online world. And it wasn't just tech companies or streaming platforms that felt the impact – the outage spilled into real life. 🥼 I heard stories of people unable to see their doctor because the clinic's system ran on AWS. 🧹 Everyday helpers stopped working – smart home devices, voice assistants, even robotic vacuum cleaners refused to start because they couldn't connect to the server, which of course relies on AWS. ⏱️ One colleague couldn't set a timer with Alexa! But the lesson isn't "cloud is unsafe." The real takeaway is that resilience must be designed, not assumed. Building fault-tolerant systems means preparing for the unexpected: 👉 deploying across multiple regions, 👉 designing for graceful degradation, 👉 isolating failure domains, 👉 having recovery playbooks that actually work when things go sideways 👉 testing how the system behaves during such an outage in a safe and controlled environment At SoftwareMill, we help companies prepare for the unpredictable – designing systems that stay reliable even when the cloud stumbles. If yesterday's outage left you wondering how your system would handle the next big disruption, let's talk. We'll review your setup and help you strengthen your architecture to survive the next battle.

35 Comments

Olivier D.

Ingénieur Cloud Senior GitOps AWS Kubernetes ArgoCD HashiCorp Vault

There is a SPOF at AWS : some services are only managed by 1 region … so be multi-region is good but not enough !

11 Reactions

Zoltan R.

Building TWINT, the payment app of Switzerland @ UBS

Someone said: Today at Amazon, 70% of code is being pushed by AI. 😀

17 Reactions

Gary Finfrock

Sr. Cloud Solution Engineer II at Exelon

It was devasting. My Wordle streak was almost ended when I couldn't log on.

2 Reactions

Marek Niepiekło

Software engineer

I think I'm having a deja vu... Don't just copy&paste the first answer from a guide/manual/SO/ChatGPT. Stop and think what problem you're trying to solve and find a proper solution. The post below presents the case of hacking a Hyundai IVI system because someone used an example AES/IVI pair from the NIST document SP800-38A: https://programmingwithstyle.com/posts/howihackedmycarguidescreatingcustomfirmware/

2 Reactions

Javier Cuevas Braun

Senior Platform Engineer

I would say the lesson is more: "Invest on Hardware" :)

3 Reactions

David Nappi

I mean, wasn't the initial trigger a DNS issue? I wouldn't really blame AWS for anything. Still, it's a good reason not to have vendor lock-in and go for a multi-cloud solution.

1 Reaction

Oleksii Saiko, PhD

Nederlands

Absolutely spot on, yesterday’s AWS us-east-1 incident was a great reminder that resilience isn’t automatic just because we’re in the cloud. Even major providers can (and will) fail, and it’s on us as engineers to design for failure, not react to it. Multi-region deployments, robust fallbacks, and chaos testing should be part of every serious architecture review. What’s fascinating is how these outages ripple far beyond tech from clinics to vacuum cleaners. It really highlights how deeply cloud reliability has become part of everyday life. Great insights, thanks for framing the lesson so clearly!

4 Reactions

Andrew Kew

Its not necessarily being multi region as like most have said its often not enough, but its also being DR ready. If you company depends on zero downtime then you need a reliable DR strategy to be able to move your servives either to a new region or even cloud provider. Easier said than done of course when it comes to DNS for example but most are not ready. Is what I am bringing up with all my clients at moment. Is a must not a maybe and has been for years!

2 Reactions

George Væra

Hmm. But is cloud unsafe? And by cloud I mean governments letting more and more of our infrastructure get into the hands of mega corporations with revenues that would put some countries to shame and the responsibility levels of a toddler in a candy shop? Is that unsafe?

See more comments

To view or add a comment, sign in

More Relevant Posts

Forrest Shriver

CEO | AI Developer | Entrepreneur | Nuclear & Automation | DOE Innovation Crossroads Fellow | Northrop Grumman Tech Accelerator Alum | INNOVATE Georgia Thought Leader | Top 40 Innovative Companies in Georgia
1w
Report this post
If you tried to access anything online today, you probably noticed that AWS decided to take a personal day. Huge chunks of the internet have been disrupted for most of the day. I'm happy to say that Sentinel Devices operations were totally undisturbed by this. If some minor quality-of-life apps we use hadn't had issues, we wouldn't even have noticed the outage. It’s a good reminder that “the cloud” isn’t magic. It’s just someone else’s computer… that sometimes can just decide not to work. If your factory floor, power plant, or water treatment facility grinds to a halt because a data center 1,000 miles away hiccups, that’s not “resilience.” That’s a single point of failure in someone else’s room - one that you're probably not allowed to fix. Events like today are why we built our industrial monitoring product, OTAware, to be air-gapped and offline-first - so your machines keep talking, your operators keep seeing, and your operations keep running, even when the internet decides to take a nap. If your critical systems don't need the internet to function, it gets a LOT harder for the wider world to affect you. We demonstrated that today! Industrial data and AI operations don't have to be in the cloud. If today was concerning for your organization, and you want to talk about how to move critical workloads (data aggregation, visualization, AI) off of the cloud, feel free to reach out and let's talk.

6 Comments
Like Comment
To view or add a comment, sign in
Emma Bates

Startup Founder; Expert on Distributed Systems & Defense Modernization Strategy
1w
Report this post
I was ok without my SaaS tools yesterday. But if you were using AWS to orchestrate hundreds of automated systems in a factory, or a mine, or in a disaster recovery situation, you lost a lot of money. But, on-prem high availability orchestration is really hard to do. Good thing Cachai provides hardware-agnostic distributed systems infrastructure so that you can orchestrate local things locally. Excellent point by Forrest Shriver about the single point of failure involved in dependency on a cloud connection. #resilience #criticalinfrastructure #AWS

Forrest Shriver

CEO | AI Developer | Entrepreneur | Nuclear & Automation | DOE Innovation Crossroads Fellow | Northrop Grumman Tech Accelerator Alum | INNOVATE Georgia Thought Leader | Top 40 Innovative Companies in Georgia
1w

If you tried to access anything online today, you probably noticed that AWS decided to take a personal day. Huge chunks of the internet have been disrupted for most of the day. I'm happy to say that Sentinel Devices operations were totally undisturbed by this. If some minor quality-of-life apps we use hadn't had issues, we wouldn't even have noticed the outage. It’s a good reminder that “the cloud” isn’t magic. It’s just someone else’s computer… that sometimes can just decide not to work. If your factory floor, power plant, or water treatment facility grinds to a halt because a data center 1,000 miles away hiccups, that’s not “resilience.” That’s a single point of failure in someone else’s room - one that you're probably not allowed to fix. Events like today are why we built our industrial monitoring product, OTAware, to be air-gapped and offline-first - so your machines keep talking, your operators keep seeing, and your operations keep running, even when the internet decides to take a nap. If your critical systems don't need the internet to function, it gets a LOT harder for the wider world to affect you. We demonstrated that today! Industrial data and AI operations don't have to be in the cloud. If today was concerning for your organization, and you want to talk about how to move critical workloads (data aggregation, visualization, AI) off of the cloud, feel free to reach out and let's talk.

1 Comment
Like Comment
To view or add a comment, sign in
Surjeet Lodhi

Senior Software Engineer@Fareportal | .NET | ASP.NET | Reactjs | Javascript | Full Stack Web Developer | MERN | C# | MERN | WEBRTC
1w Edited
Report this post
When your smart bed is a clearer indicator of the AWS status than your monitoring dashboard... 😂 The recent Amazon Web Services (AWS) issues that hit over 1,000 businesses had a surprisingly personal impact: some Eight Sleep Pod smart beds overheated and got stuck in an upright position. Why the drama? The bed's core functions—like temperature control and positioning—rely on the cloud to analyze the biometric data it collects (your heart rate, respiratory rate, and sleep stages) to give you that "perfect" night's sleep. When AWS went down, the smart part stopped working, but the heating apparently didn't! Talk about downstream effects! One owner said it best: "It would be great if my bed wasn't stuck in an inclined position due to an AWS outage." A strong, sweaty reminder that everything is interconnected, and failover isn't just for servers—it's for your sleep-tracking, bed-adjusting, vital-sign-monitoring smart devices, too! 😴🔥 #AWSOuage #CloudComputing #SmartHome
1 Comment
Like Comment
To view or add a comment, sign in
Juan O.

Javascript, UI and Software Engineering
1w
Report this post
In light of the recent AWS outage, and the many problems that it caused for connected devices (roombas, smart matresses, and many others), it is OK to be skeptical of technology sometimes.
25 Comments
Like Comment
To view or add a comment, sign in
Alessio Tofani

Senior Cybersecurity Consultant at BxC GmbH & Co. KG
1w
Report this post
Everyone has heard about the AWS outage that happened on the 20th of October and the adverse effects it had. On the topic, I was reading this morning about the peculiar situation of a company selling smart mattresses. Imagine spending around €3,500 on a “smart” sleeping pod to improve your sleep, and during the recent AWS outage, it heats up uncontrollably and wakes you up instead. That’s what happened with the new Eight Sleep Pod, which, unlike older models, doesn’t manage temperature locally. The temperature control logic (their “AI Autopilot”) runs entirely in the AWS cloud. So: no cloud → no Autopilot → no temperature control → mattress basically useless. Made me think about how this translates to OT: If we ever start relying on AI models in production environments, will, at some point, the field level need continuous cloud access? Running AI locally is possible, but it’s expensive and complex. If so, are we heading toward full field-to-cloud convergence, where losing the cloud means losing control? We already know about concepts like SCADA in the Cloud (btw I’ve never seen it implemented myself, but it could become a reality soon). Even then, we’d still be pushing control to the cloud only up to Level 2 of ISA-95. The field and process levels (0 and 1) still run locally and remain operational for a certain time even if the SCADA layer fails. Curious to see how this will evolve in the future.

2 Comments
Like Comment
To view or add a comment, sign in
Verity

620 followers
1w
Report this post
🇺🇸 🛏️ AWS OUTAGE SPARKS SMART BED MALFUNCTIONS Techno-skeptic narrative: The AWS outage exposes the dangerous reality of tech companies controlling basic life functions like sleep. Smart beds trapping users in 110-degree heat shows how corporate greed prioritizes cloud dependency over safety, forcing customers to pay monthly subscriptions for beds that become useless bricks during outages. Techno-optimist narrative: Smart appliances revolutionize daily life by providing unprecedented convenience and efficiency through cloud connectivity. These innovative devices offer remote control and continuous updates, freeing up valuable time for families. While internet issues can happen, companies like Amazon swiftly identify and fix the issue, returning things back to normal. https://lnkd.in/e5bN4DjK

Verity - AWS Outage Sparks Smart Bed Malfunctions verity.news
Like Comment
To view or add a comment, sign in
Poonam Samb , PMP®

I Help Founders & Recruiters Accelerate Sourcing, Automate Workflows & Boost Revenue with AI Solutions 🚀
1w
Report this post
When AWS goes down, the internet holds its breath. On October 20, 2025, AWS suffered a major outage. In minutes, websites, apps, and devices around the world went dark. Slack stopped loading. Smart homes froze. Even Alexa went silent. 😶 But the real impact was human. One of the strangest reports came from users of Eight Sleep smart beds. Because their cloud controls were offline, people couldn’t adjust temperature or position. Some woke up sweating; others were stuck in awkward angles until service returned. A small glitch in one cloud region affected millions of lives — and businesses. From e-commerce to healthcare systems, the outage exposed how fragile cloud dependency can be. So what can we learn from it? - AI can’t stop an outage, but it can make systems resilient. - It can detect early signs of server stress or API degradation before failure. - It can automate traffic rerouting across regions to prevent full downtime. - It can run predictive simulations to test how systems behave under pressure. - And it can accelerate root-cause analysis when incidents happen. Resilience isn’t about preventing failure. It’s about learning faster and recovering smarter. AI gives businesses that edge - foresight before failure and speed after it.
Like Comment
To view or add a comment, sign in
Gorav Jindal

Freelance AIML Engineer & Full-Stack Developer (MERN, Next.js, Django, Flask) | Microsoft AI Intern & EY GDS Intern | Top Achiever: GFG Rank 21, Rank 1 C HackerRank, IBM Certified SQL Developer
1w
Report this post
🔴 AWS Outage: A Global Wake-Up Call for Cloud Reliability Yesterday (20 Oct 2025), Amazon Web Services (AWS) — the backbone of thousands of global platforms — experienced a major outage that disrupted services worldwide. 🕒 What Happened? The issue began in the US-EAST-1 (N. Virginia) region, one of AWS’s busiest. Root cause: a DNS resolution failure within AWS’s internal monitoring and load-balancer systems. The outage cascaded across multiple AWS services including EC2, DynamoDB, CloudWatch, and Route 53. As a result, platforms like Fortnite, Snapchat, Venmo, Signal, Zoom, and even some IoT smart devices went offline for hours. AWS has since restored services, but residual delays and backlogs remain. 💡 Why It Matters Even the most powerful cloud infrastructure can fail. Yesterday proved that no system is invincible — resilience and redundancy are not optional anymore. ⚙️ Key Lessons for Engineers & Businesses Don’t rely on a single region/provider — adopt multi-region or multi-cloud strategies. Design for failure — simulate outages, test fallbacks, and create degraded operation modes. Transparent communication during incidents builds trust with users and clients. Regular post-mortems should become part of your DevOps culture. Review your SLAs — understand your provider’s guarantees and your own recovery time objectives. 💬 Final Thought Yesterday’s AWS outage wasn’t just a downtime — it was a reminder that cloud dependency comes with responsibility. The future belongs to teams who engineer resilient, fault-tolerant, and well-communicated systems. Let’s learn from this — not just react. #AWSOutage #CloudComputing #TechResilience #MultiCloud #DevOps #CloudArchitecture #BusinessContinuity #IncidentResponse #SystemDesign #SiteReliabilityEngineering #AWS #TechLeadership #Downtime #DisasterRecovery #EngineeringCulture
Like Comment
To view or add a comment, sign in
Sam Newman
1w
Report this post
In the wake of AWS's US-EAST-1 outage earlier this week, we probably need to update Leslie Lamport's famous definition of distributed systems: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own 𝚌̶𝚘̶𝚖̶𝚙̶𝚞̶𝚝̶𝚎̶𝚛̶ mattress unusable.” https://lnkd.in/e-rXYZJU

AWS crash causes $2,000 Smart Beds to overheat and get stuck upright - Dexerto dexerto.com

17 Comments
Like Comment
To view or add a comment, sign in
Boris Dali

Senior Database Engineer / Tech Lead at Google
1w
Report this post
This week's major cloud outage is certainly not the reason to gloat. Can happen to any cloud provider or to the on-prem data center. But it does highlight the point that some workloads should perhaps have less dependency on a cloud and be able to operate in a disconnected way at least for some period of time. If only there was a database that could operate like an extension of a cloud (or totally standalone). Oh wait... may be there's one? https://lnkd.in/gJN8JV8Z

Sam Newman
1w

In the wake of AWS's US-EAST-1 outage earlier this week, we probably need to update Leslie Lamport's famous definition of distributed systems: “A distributed system is one in which the failure of a computer you didn't even know existed can render your own 𝚌̶𝚘̶𝚖̶𝚙̶𝚞̶𝚝̶𝚎̶𝚛̶ mattress unusable.” https://lnkd.in/e-rXYZJU

AWS crash causes $2,000 Smart Beds to overheat and get stuck upright - Dexerto dexerto.com
Like Comment
To view or add a comment, sign in

1,743 followers

124 Posts

View Profile Connect

LinkedIn respects your privacy

Rafał Maciak’s Post

Explore content categories