http://highscalability.com/blog/2011/5/2/the-updated-big-list-of-articles-on-the-amazon-outage.
html
The Updated Big List of Articles on the
Amazon Outage
Monday, May 2, 2011 at 9:25AM
Todd Hoff in amazon
Since The Big List Of Articles On The Amazon Outage was published
we've a had few updates that people might not have seen. Amazon of
course released their Summary of the Amazon EC2 and Amazon
RDS Service Disruption in the US East Region. Netlix shared their
Lessons Learned from the AWS Outage as did Heroku (How Heroku
Survived the Amazon Outage), Smug Mug (How SmugMug survived
the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up
During the AWS Downtime).
The curious thing from my perspective is the general lack of response to
Amazon's explanation. I expected more discussion. There's been almost
none that I've seen. My guess is very few people understand what
Amazon was talking about enough to comment whereas almost everyone
feels qualified to talk about the event itself.
http://weibo.com/developerworks 2012-11-11 整理 第 1/7页
http://highscalability.com/blog/2011/5/2/the-updated-big-list-of-articles-on-the-amazon-outage.html
Lesson for crisis handlers: deep dive post-mortems that are timely, long,
honestish, and highly technical are the most effective means of staunching
the downward spiral of media attention.
Amazon's Explanation of What Happened
Summary of the Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
Hackers News thread on AWS Service Disruption Post
Mortem
Quite Funny Commentary on the Summary
AWS outage follow-up: if you wanted details, you got
details! by RightScale
Amazon’s Own Post Mortem by Jeff Darcy
Experiences from Specific Companies,
Both Good and Bad
Lessons Netflix Learned from the AWS Outage by several
Netflixians on the Netflix Tech Blog
How Heroku Survived the Amazon Outage on the Heroku status
page
How SimpleGeo Stayed Up During the AWS Downtime by
Mike Malone
How SmugMug survived the Amazonpocalypse by Don
MacAskill (Hacker News discussion)
How Bizo survived the Great AWS Outage of 2011 relatively
unscathed... by Someone at Bizo
Joe Stump's explanation of how SimpleGeo survived
How Netflix Survived the Outage
Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio
http://weibo.com/developerworks 2012-11-11 整理 第 2/7页
http://highscalability.com/blog/2011/5/2/the-updated-big-list-of-articles-on-the-amazon-outage.html
Engineering's Blog (Hacker News thread)
On reddit's outage
What caused the Quora problems/outage in April 2011?
Availability, redundancy, failover and data backups at
LearnBoost
How our small startup survived the Amazon EC2 Cloud-
pocalypse from mobile app developer
Recovering from Amazon cloud outage by Drew Engelson of
PBS.
❍ PBS was affected for a while primarily because we do use
EBS-backed RDS databases. Despite being spread across
multiple availability-zones, we weren’t easily able to launch
new resources ANYWHERE in the East region since everyone
else was trying to do the same. I ended up pushing the RDS
stuff out West for the time being. From Comment
Amazon Web Services Discussion Forum
A fascinating peek into the experiences of people who were dealing with
the outage while they were experiencing it. Great real-time social
archeology in action.
Amazon Web Services Discussion Forum
Cost-effective backup plan from now on?
Life of our patients is at stake - I am desperately asking you to
contact
Why did the EBS, RDS, Cloudformation, Cloudwatch and
Beanstalk all fail?
Moved all resources off of AWS
Any success stories?
Is the mass exodus from East going to cause demand
problems in the West?
http://weibo.com/developerworks 2012-11-11 整理 第 3/7页
http://highscalability.com/blog/2011/5/2/the-updated-big-list-of-articles-on-the-amazon-outage.html
Finally back online after about 71 hours
Amazon EC2 features vs windows azure
Aren't Availability Zones supposed to be "insulated from
failures"?
What a lot of people aren't realizing about the downtime:
ELB CNAME
Availability Zones were used in a misleading manner
Tip: How to recover your instance
Crying in Forum Gets Results, Silver-level AWS Premium
Support Doesn't
Well-worth reading: "design for failure" cloud deployment
strategy
New best practice
Don't bother with Premium Support
Best practices for multi-region redundancy
"Postmortum"
Learning from this case
Amazon, still no instructions what to do?
Anyone else prepared for an all-nighter?
Is Jeff Bezos going to give a public statement?
Rackspace, GoGrid, StormonDemand and Others
Jeff Barr, Werner Vogels and other AWS persons - where
have you been???
After you guys fix EBS do I have do anything on my side?
Need Help!!! Lives of people and billions in revenue are at risk
now!!!
I've Got A Suspicion
Farewell EC2, Farewell
There were also many many instances of support and help in the log.
http://weibo.com/developerworks 2012-11-11 整理 第 4/7页
http://highscalability.com/blog/2011/5/2/the-updated-big-list-of-articles-on-the-amazon-outage.html
In Summary
Amazon EC2 outage: summary and lessons learned by
RightScale
AWS outage timeline & downtimes by recovery strategy by
Eric Kidd
The Aftermath of Amazon’s Cloud Outage by Rich Miller
Taking Sides: It's the Customer's Fault
So Your AWS-based Application is Down? Don’t Blame
Amazon by The Storage Architect
The Cloud is not a Silver Bullet by Joe Stump (Hacker
News thread)
The AWS Outage: The Cloud's Shining Moment by George
Reese (Hacker News discussion)
Failing to Plan is Planning to Fail by Ted Theodoropoulos
Get a life and build redundancy/resiliency in your apps on the
Cloud Computing group
Taking Sides: It's Amazon's Fault
Stop Blaming the Customers - the Fault is on Amazon Web
Services by Klint Finley
AWS is down: Why the sky is falling by Justin Santa
Barbara (Hacker News thread)
Amazon Web Services are down - Huge Hacker News thread
The EC2/EBS outage: What Amazon didn’t tell you by Jeremy
Gaddis
Lessons Learned and Other Insight
Articles
http://weibo.com/developerworks 2012-11-11 整理 第 5/7页
http://highscalability.com/blog/2011/5/2/the-updated-big-list-of-articles-on-the-amazon-outage.html
Amazon’s EBS outage by Robin Harris of StorageMojo
People Using Amazon Cloud: Get Some Cheap Insurance At
Least by Bob Warfield
Basic scalability principles to avert downtime by Ronald
Bradford
Amazon crash reveals 'cloud' computing actually based on
data centers by Kevin Fogarty
Seven lessons to learn from Amazon's outage By Phil
Wainewright
The Cloud and Outages : Five Key Lessons by Patrick Baillie
(Cloud Computing Group discussion)
Some thoughts on outages by Till Klampaeckel
Amazon.com’s real problem isn’t the outage, it’s the
communication by Keith Smith
How to work around Amazon EC2 outages by James Cohen
(Hacker News thread)
Today’s EC2 / EBS Outage: Lessons learned on Agile
Sysadmin
Amazon EC2 has gone down -what would a prefered hosting
platform be? on Focus
Single Points of Failure by Mat
Coping with Cloud Downtime with Puppet
Amazon Outage Concerns Are Overblown by Tim Crawford
Where There Are Clouds, It Sometimes Rains by Clay Loveless
Availability, redundancy, failover and data backups at
LearnBoost by Guillermo Rauch
Cloud hosting vs colocation by Chris Chandler (Hacker News
thread)
Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
Complex Systems Have Complex Failures. That’s Cloud
http://weibo.com/developerworks 2012-11-11 整理 第 6/7页
http://highscalability.com/blog/2011/5/2/the-updated-big-list-of-articles-on-the-amazon-outage.html
Computing by Greg Ferro
Amazon Web Services, Hosting in the Cloud and
Configuration Management by Ian Chilton
Lessons learned from deploying a production database in
EC2 by by Grig Gheorghiu of Agile Testing
Bezos on Amazon as a technology and invention company by
John Gruber on Daring Fireball.
On Importance of Planning for Failure by Dmitriy Samovskiy
Vendor's Vent
Amazon Outage Proves Value of Riak’s Vision by Basho
Magical Block Store: When Abstractions Fail Us by Mark
Joyent (Hacker News discussion)
On Cascading Failures and Amazon’s Elastic Block Store by
Jason
An unofficial EC2 outage postmortem - the sky is not
falling from CloudHarmony
Cloudfail: Lessons Learned from AWS Outage by Jyoti Bansal
Article originally appeared on High Scalability (http://highscalability.com/).
See website for complete article licensing information.
http://weibo.com/developerworks 2012-11-11 整理 第 7/7页