The official RCA (Root Cause Analysis) of the "Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region" that happened this week is now available at https://lnkd.in/gN2bFPNW Here's a truncated summary generated using an LLM: <<𝐓𝐋;𝐃𝐑>> 𝑇ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑁𝑜𝑟𝑡ℎ𝑒𝑟𝑛 𝑉𝑖𝑟𝑔𝑖𝑛𝑖𝑎 (𝑈𝑆-𝐸𝐴𝑆𝑇-1) 𝑅𝑒𝑔𝑖𝑜𝑛 𝑜𝑛 𝑂𝑐𝑡𝑜𝑏𝑒𝑟 19 𝑎𝑛𝑑 20, 2025. 𝐼𝑡 𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑 𝐴𝑚𝑎𝑧𝑜𝑛 𝐷𝑦𝑛𝑎𝑚𝑜𝐷𝐵, 𝑁𝑒𝑡𝑤𝑜𝑟𝑘 𝐿𝑜𝑎𝑑 𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑟 (𝑁𝐿𝐵), 𝑎𝑛𝑑 𝐸𝐶2 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑠. 𝐷𝑦𝑛𝑎𝑚𝑜𝐷𝐵 𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒𝑑 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒𝑑 𝐴𝑃𝐼 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒𝑠 𝑑𝑢𝑒 𝑡𝑜 𝑎 𝑙𝑎𝑡𝑒𝑛𝑡 𝑑𝑒𝑓𝑒𝑐𝑡 𝑖𝑛 𝑖𝑡𝑠 𝑎𝑢𝑡𝑜𝑚𝑎𝑡𝑒𝑑 𝐷𝑁𝑆 𝑚𝑎𝑛𝑎𝑔𝑒𝑚𝑒𝑛𝑡 𝑠𝑦𝑠𝑡𝑒𝑚, 𝑐𝑎𝑢𝑠𝑖𝑛𝑔 𝑒𝑛𝑑𝑝𝑜𝑖𝑛𝑡 𝑟𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑓𝑎𝑖𝑙𝑢𝑟𝑒𝑠. 𝑇ℎ𝑖𝑠 𝑤𝑎𝑠 𝑐𝑎𝑢𝑠𝑒𝑑 𝑏𝑦 𝑎 𝑟𝑎𝑐𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑖𝑛 𝑡ℎ𝑒 𝐷𝑁𝑆 𝐸𝑛𝑎𝑐𝑡𝑜𝑟, 𝑤ℎ𝑖𝑐ℎ 𝑙𝑒𝑑 𝑡𝑜 𝑎𝑛 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑒𝑚𝑝𝑡𝑦 𝐷𝑁𝑆 𝑟𝑒𝑐𝑜𝑟𝑑 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑠𝑒𝑟𝑣𝑖𝑐𝑒’𝑠 𝑟𝑒𝑔𝑖𝑜𝑛𝑎𝑙 𝑒𝑛𝑑𝑝𝑜𝑖𝑛𝑡. 𝑁𝐿𝐵 𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒𝑑 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟𝑠 𝑑𝑢𝑒 𝑡𝑜 ℎ𝑒𝑎𝑙𝑡ℎ 𝑐ℎ𝑒𝑐𝑘 𝑓𝑎𝑖𝑙𝑢𝑟𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑁𝐿𝐵 𝑓𝑙𝑒𝑒𝑡, 𝑟𝑒𝑠𝑢𝑙𝑡𝑖𝑛𝑔 𝑖𝑛 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟𝑠 𝑜𝑛 𝑠𝑜𝑚𝑒 𝑁𝐿𝐵𝑠. 𝑁𝑒𝑤 𝐸𝐶2 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑠 𝑓𝑎𝑖𝑙𝑒𝑑, 𝑎𝑛𝑑 𝑠𝑜𝑚𝑒 𝑛𝑒𝑤𝑙𝑦 𝑙𝑎𝑢𝑛𝑐ℎ𝑒𝑑 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒𝑑 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑣𝑖𝑡𝑦 𝑖𝑠𝑠𝑢𝑒𝑠. 𝑇ℎ𝑒𝑠𝑒 𝑖𝑠𝑠𝑢𝑒𝑠 𝑤𝑒𝑟𝑒 𝑟𝑒𝑠𝑜𝑙𝑣𝑒𝑑 𝑏𝑦 1:50 𝑃𝑀 𝑜𝑛 𝑂𝑐𝑡𝑜𝑏𝑒𝑟 20. 𝐴𝑊𝑆 ℎ𝑎𝑠 𝑖𝑚𝑝𝑙𝑒𝑚𝑒𝑛𝑡𝑒𝑑 𝑎𝑑𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑐𝑜𝑛𝑡𝑟𝑜𝑙𝑠 𝑡𝑜 𝑝𝑟𝑒𝑣𝑒𝑛𝑡 𝑠𝑖𝑚𝑖𝑙𝑎𝑟 𝑖𝑠𝑠𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑢𝑡𝑢𝑟𝑒 𝑎𝑛𝑑 𝑖𝑠 𝑤𝑜𝑟𝑘𝑖𝑛𝑔 𝑜𝑛 𝑎 𝑚𝑜𝑟𝑒 𝑟𝑒𝑠𝑖𝑙𝑖𝑒𝑛𝑡 𝑑𝑒𝑠𝑖𝑔𝑛 𝑓𝑜𝑟 𝑡ℎ𝑒 𝐷𝑦𝑛𝑎𝑚𝑜𝐷𝐵 𝐷𝑁𝑆 𝑚𝑎𝑛𝑎𝑔𝑒𝑚𝑒𝑛𝑡 𝑠𝑦𝑠𝑡𝑒𝑚. 𝑇ℎ𝑒𝑦 𝑎𝑟𝑒 𝑎𝑙𝑠𝑜 𝑒𝑛ℎ𝑎𝑛𝑐𝑖𝑛𝑔 𝑡ℎ𝑒 𝑟𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑁𝐿𝐵 ℎ𝑒𝑎𝑙𝑡ℎ 𝑐ℎ𝑒𝑐𝑘 𝑠𝑦𝑠𝑡𝑒𝑚 𝑎𝑛𝑑 𝑖𝑚𝑝𝑟𝑜𝑣𝑖𝑛𝑔 𝑡ℎ𝑒 𝐸𝐶2 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑙𝑎𝑢𝑛𝑐ℎ 𝑝𝑟𝑜𝑐𝑒𝑠𝑠. ... 𝐈𝐧 𝐜𝐥𝐨𝐬𝐢𝐧𝐠 𝑊𝑒 𝑎𝑝𝑜𝑙𝑜𝑔𝑖𝑧𝑒 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑖𝑚𝑝𝑎𝑐𝑡 𝑡ℎ𝑖𝑠 𝑒𝑣𝑒𝑛𝑡 𝑐𝑎𝑢𝑠𝑒𝑑 𝑜𝑢𝑟 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟𝑠. 𝑊ℎ𝑖𝑙𝑒 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑎 𝑠𝑡𝑟𝑜𝑛𝑔 𝑡𝑟𝑎𝑐𝑘 𝑟𝑒𝑐𝑜𝑟𝑑 𝑜𝑓 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑛𝑔 𝑜𝑢𝑟 𝑠𝑒𝑟𝑣𝑖𝑐𝑒𝑠 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑙𝑒𝑣𝑒𝑙𝑠 𝑜𝑓 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦, 𝑤𝑒 𝑘𝑛𝑜𝑤 ℎ𝑜𝑤 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑜𝑢𝑟 𝑠𝑒𝑟𝑣𝑖𝑐𝑒𝑠 𝑎𝑟𝑒 𝑡𝑜 𝑜𝑢𝑟 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟𝑠, 𝑡ℎ𝑒𝑖𝑟 𝑎𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠 𝑎𝑛𝑑 𝑒𝑛𝑑 𝑢𝑠𝑒𝑟𝑠, 𝑎𝑛𝑑 𝑡ℎ𝑒𝑖𝑟 𝑏𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠. 𝑊𝑒 𝑘𝑛𝑜𝑤 𝑡ℎ𝑖𝑠 𝑒𝑣𝑒𝑛𝑡 𝑖𝑚𝑝𝑎𝑐𝑡𝑒𝑑 𝑚𝑎𝑛𝑦 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟𝑠 𝑖𝑛 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡 𝑤𝑎𝑦𝑠. 𝑊𝑒 𝑤𝑖𝑙𝑙 𝑑𝑜 𝑒𝑣𝑒𝑟𝑦𝑡ℎ𝑖𝑛𝑔 𝑤𝑒 𝑐𝑎𝑛 𝑡𝑜 𝑙𝑒𝑎𝑟𝑛 𝑓𝑟𝑜𝑚 𝑡ℎ𝑖𝑠 𝑒𝑣𝑒𝑛𝑡 𝑎𝑛𝑑 𝑢𝑠𝑒 𝑖𝑡 𝑡𝑜 𝑖𝑚𝑝𝑟𝑜𝑣𝑒 𝑜𝑢𝑟 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑒𝑣𝑒𝑛 𝑓𝑢𝑟𝑡ℎ𝑒𝑟. <<𝐄𝐧𝐝 𝐨𝐟 𝐓𝐋;𝐃𝐑> #aws #rca
Thanks for sharing the RCA, Mani Chandrasekaran! It's helpful to see the detailed breakdown of the root causes and the steps being taken to prevent recurrence. Transparency like this is crucial for building trust and learning as a community. 👍
TL;DR version: as usual the DNS architecture and configuration got messed up. Not the first time, not I am afraid will be the last time. RCA is good, but where are the preventive actions to ensure this does not happen in the future?
Driving ROI from Agentic AI, Cloud & SaaS │ VP / MD Consulting Services GDC Delivery & Practice Leader │ Customer Success │ $100M+ P&L │ 1000+ FTE │ $B+ Bookings │ $400M ARR Impact │ NPS 88 │ Ex-AWS / IBM / KPMG
3dThanks. Customers need to design for resilience. Moving off of AWS / public cloud etc. will be knee jerk reactions. If the RPO and RTO demand Multi-region resiliency then it is time to design for it. I wrote a short post for CIOs and referenced thoughtleaders in it. https://www.linkedin.com/posts/srikrishnan-sundararajan_my-heart-goes-out-to-all-the-on-call-engineers-activity-7386263915665510400-We2p?utm_source=share&utm_medium=member_ios&rcm=ACoAAAA5WfABFLobni29lCQlzKIiqAf6mS1xcLA