Adrian Cockcroft's Blog: aws

Showing posts with label aws. Show all posts

Friday, December 12, 2014

Interesting Videos from AWS Re:Invent 2014

There are too many, but here are the ones I've watched that I found particularly interesting. I'll add to this list as I find time to watch more of them.

To start with here's my own talk. I'm looking for feedback on it, as it's the first time I've tried this topic and I'm looking for more data to support the concepts I'm discussing.

The "must see" talk for Re:Invent is James Hamilton on Innovation at Scale.

The new EC2 Container Service that supports Docker, described by Deepak Singh

An excellent and entertaining talk on networking optimization by Becky Weiss - learn what SR-IOV means.

EC2 Instance Deep Dive by John Phillips explains HVM and storage path optimization.

The VPC Deep Dive by Kevin Miller talks about some long awaited features such as migration tools from EC2 Classic to VPC.

There were lots of Netflix talks, but from Asgard to Zuul by Ruslan included the announcement of Docker images to make trying out NetflixOSS tools trivial.

One of the most popular talks at the event was Brendan Gregg of Netflix on EC2 Performance Tuning

PFC305 Embracing Failure Fault Injection and Service Reliability - Monkeys and the new FIT failure injection system that replaces Latency Monkey.

More Netflix Talks
BDT403 Next Generation Big Data Platform at Netflix
ARC317 Maintaining a Highly Available Front Door at Massive Scale
PFC304 Effective IPC for Microservices in the Cloud
ENT209 Cloud Migration, DevOps and Distributed Systems
APP310 Using Apache Mesos in the Cloud

Thursday, April 03, 2014

Public Cloud Instance Pricing Wars - Detailed Context and Analysis

As part of my opening keynote at Cloud Connect in Las Vegas I summarized the latest moves in cloud, the slides are available via the new Powered by Battery site as "The Good the Bad and the Ugly: Critical Decisions for the Cloud Enabled Enterprise". This blog post is a detailed analysis of just part of what happened.

Summary points

AWS users should migrate from obsolete m1, m2, c1, c2 to the new m3, r3, c3 instances to get better performance at lower prices with the latest Intel CPUs.
Any cloud benchmark or cost comparison that uses the AWS m1 family as a basis should be called out as bogus benchmarketing.
AWS and Google instance prices are essentially the same for similar specs.
Microsoft doesn’t appear to have the latest Intel CPUs generally available and only matches prices for obsolete AWS instances.
IBM Softlayer pricing is still higher, especially on small instance types
Google's statement that prices should follow Moore’s law implies that we should expect prices to halve every 18-24 months
Pricing pages by AWS, Google Compute Engine, Microsoft Azure, IBM Softlayer
Adrian’s spreadsheet summary of instances from the above vendors at http://bit.ly/cloudinstances
Analysis of the prices by Rightscale

On Tuesday 25^th March 2014 Google announced some new features and steep price cuts, the next day Amazon Web Services also announced new features and matching price cuts. On Monday 31^st March Microsoft Azure also reduced prices. Many pundits repeated talking points from press releases in blog posts but unfortunately there was little attempt to understand what really happened, and explain the context and outcome. When I wrote up a summary for my opening keynote at Cloud Connect on 31^st March I looked at the actual end result and came up with a different perspective and a list of gaps.

I’m only going to discuss instance types and on-demand prices here. There was a lot more in the announcements that other people have done a good job of summarizing. The Rightscale blog linked above also gives an accurate and broader view on what was announced. I will discuss other pricing models beyond on-demand in future blog posts.

There are some things you need to know to get the right background context for the instance price cuts. The most important is to understand that AWS has two generations of instance types, and is in a transition from Intel CPU technology they introduced five or more years ago to a new generation introduced in the last year. The new generation CPUs are based on an architecture known as Sandybridge. The latest tweak is called Ivybridge and has incremental improvements that give more cores per chip and slightly higher performance. Since Google is a recent entrant to the public cloud market, all their instances types are based on Sandybridge. To correctly compare AWS prices and features with Google, there is a like-for-like comparison that can be made. AWS is encouraging the transition by pricing its newer faster instances at a lower cost than the older slower ones. In the recent announcement, AWS cut the prices by obsolete instance type families by a smaller percentage than the newer instance type families, so the gap has just widened.

Old AWS instance types have names starting with m1, m2 and c1, c2. They all have newer replacements known as m3, r3 and c3 except the smallest one – the m1.small. The newer instances have a similar amount of RAM and CPU threads, but the CPU performance is significantly higher. The new equivalents also replace small slow local disks with smaller but far faster and more reliable solid-state disks, and the underlying networks move from 1Gbit/s to 10Gbit/s. The newer instance families should also have lower failure rates.

Most people are much more familiar with the old generation instance types, and competitors write their press releases they are able to get away with claiming that they are both faster and cheaper than AWS, by comparing against the old generation products. This is an old “benchmarketing” trick – compare your new product against the competitions older and more recognizable product.

For the most commonly used instance types there is a close specification match between the AWS m3 and the Google n1-standard. They are also exactly the same price per hour. Since AWS released its changes after Google, this implies that AWS deliberately matched Google’s price. The big architectural difference between the vendors is that Google instances are diskless, all their storage is network attached, while AWS have various amounts of SSD included. The AWS hypervisor also makes slightly more memory available per instance, and ratings for the c3 imply that AWS is supplying a slightly higher CPU clock rate for that instance type. I think that this is because AWS has based its compute intensive c3 instance types on a higher clock rate Ivybridge CPU rather than the earlier Sandybridge specification. For the high memory capacity instance types it is a little different. The Google n1-himem instances have less memory available than the AWS r3 equivalents, and cost a bit less. This makes intuitive sense as this instance type is normally bought for its memory capacity.

Microsoft previously committed to match AWS prices, and in their announcement their comparisons matched the m1 range exactly at it’s new price, and they compared their memory oriented A5 instance as cheaper than an old m2.xlarge, but the A5 is an older slower CPU type, more expensive ($0.22 vs $0.18) and has less memory (14GB vs. 15GB) than the AWS r3.large. The common CPU options on Azure are aligned with the older AWS instance types. Azure does have Intel Sandybridge CPUs for compute use cases as the A8 and A9 models, but I couldn't find list pricing for them and they appear to be a low volume special option. The Azure pricing strategy ignores the current generation AWS product, so the price match guarantee doesn’t deliver. In addition the Google and AWS price changes were effective from April 1^st, but Azure takes effect May 1^st.

IBM Softlayer has a choose-what-you-want model rather than a specific set of instance types. The smaller instances are $0.10/hr where AWS and Google n1-standard-1 are $0.07/hr. As you pick a bigger instance type on Softlayer the cost doesn’t scale up linearly, while Google and AWS double the price each time the configuration doubles. The Softlayer equivalent of the n1-standard-16 is actually slightly lower cost than Google. Softlayer pricing on most instances is in the same ballpark as AWS and Azure were before the cuts, so I expect they will eventually have to cut prices to match the new level.

Gaps and Missing Features

The remaining anomaly in AWS pricing is the low-end m1.small. There is no newer technology equivalent at present, so I wouldn’t be surprised to see AWS do something interesting in this space soon. Generally AWS has a much wider range of instances than Google, but AWS is missing an m3.4xlarge to match Google's n1-standard-16, and the Google hicpu range has double the CPU to RAM ratio of the AWS c3 range so they aren’t directly comparable.

Google has no equivalent to the highest memory and CPU AWS instances, and has no local disk or SSD options. Instead they have better attached disk performance than AWS Elastic Block Store, but attached disk adds to the instance cost, and can never be as fast as local SSD inside the instance.

Microsoft Azure needs to refresh its instance type options, it has a much smaller range, older slower CPUs, and no SSD options. It doesn’t look particularly competitive.

Conclusion

If you buy hardware and capitalize it over three years, and later on there is a price cut; you don’t get to reduce your monthly costs. Towards the end your CPUs are getting old, leading to less competitive response times and higher failure rates. With public cloud vendors driving the costs down several times a year and upgrading their instances, your model of public vs. private costs needs to factor in something like Moore’s law for cost reductions and a technology refresh more often than every three years. Google actually said we should expect Moore’s law to apply in their announcement, which I interpret to mean that we can expect costs to halve about every 18-24 months. This isn’t a race to zero; it’s a proportional reduction every year. Over a three-year period the cost at the end is a third to a quarter of the cost at the start.

I still hear CIOs worry that cloud vendor lock-in would let them raise prices. This ruse is used to justify private cloud investments. Even without switching vendors, you will see repeated price reductions for the public cloud systems you are already using. This was the 42^nd price cut for AWS, the argument is ridiculous.

I’ve previously published presentation materials on costoptimization with AWS. I’m researching this area and over the coming months will publish a series of posts on all aspects of cloud optimization.

Monday, November 26, 2012

Lots of Netflix talks at AWS Re:Invent

[Update: here's video's of these talks http://techblog.netflix.com/2012/12/videos-of-netflix-talks-at-aws-reinvent.html along with slides]

There is a Netflix booth in the expo center, we will be talking about our open source tools from http://netflix.github.com and collecting resumes from anyone interested in joining us.

Date/Time	Presenter	Topic
Wed 8:30-10:00	Reed Hastings	Keynote with Andy Jassy
Wed 1:00-1:45	Coburn Watson	Optimizing Costs with AWS
Wed 2:05-2:55	Kevin McEntee	Netflix’s Transcoding Transformation
Wed 3:25-4:15	Neil Hunt / Yury I.	Netflix: Embracing the Cloud
Wed 4:30-5:20	Adrian Cockcroft	High Availability Architecture at Netflix
Thu 10:30-11:20	Jeremy Edberg	Rainmakers – Operating Clouds
Thu 11:35-12:25	Kurt Brown	Data Science with Elastic Map Reduce (EMR)
Thu 11:35-12:25	Jason Chan	Security Panel: Learn from CISOs working with AWS
Thu 3:00-3:50	Adrian Cockcroft	Compute & Networking Masters Customer Panel
Thu 3:00-3:50	Ruslan M./Gregg U.	Optimizing Your Cassandra Database on AWS
Thu 4:05-4:55	Ariel Tseitlin	Intro to Chaos Monkey and the Simian Army

Friday, November 16, 2012

Cloud Outage Reports

The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. I've included some Google and Azure outages here because they illustrate different failure modes that should be taken into account. Recent AWS and Azure outage reports have far more detail than Google outage reports.

I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them. My naming convention is {vendor} {primary scope} {cause}. The scope may be global, a specific region, or a zone in the region. In some cases there are secondary impacts with a wider scope but shorter duration such as regional control planes becoming unavailable for a short time during a zone outage.

This post was written while researching my AWS Re:Invent talk.
Slides: http://www.slideshare.net/AmazonWebServices/arc203-netflixha
Video: http://www.youtube.com/watch?v=dekV3Oq7pH8

November 18th, 2014 - Azure Global Storage Outage

Microsoft Reports

http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/

http://azure.microsoft.com/blog/2014/12/17/final-root-cause-analysis-and-improvement-areas-nov-18-azure-storage-service-interruption/

January 10th, 2014 - Dropbox Global Outage

Dropbox Report

https://tech.dropbox.com/2014/01/outage-post-mortem/

April 20th, 2013 - Google Global API Outage

Google Report

http://googledevelopers.blogspot.com/2013/05/google-api-infrastructure-outage.html

February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report

http://blogs.msdn.com/b/windowsazure/archive/2013/03/01/details-of-the-february-22nd-2013-windows-azure-storage-disruption.aspx

December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

http://aws.amazon.com/message/680587/

Netflix Techblog Report

http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html

October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report

http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html

October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

http://aws.amazon.com/message/680342/

Netflix Techblog Report

http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html

June 29th 2012 - AWS US-East Zone Power Outage During Storm

AWS Outage Report

http://aws.amazon.com/message/67457/

Netflix Techblog Report

http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html

June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report

http://aws.amazon.com/message/65649/

February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report

http://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx

August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report

http://aws.amazon.com/message/2329B7/

April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

http://aws.amazon.com/message/65648/

Netflix Techblog Report

http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html

February 24th, 2010 - Google App Engine Power Outage

Google Forum Report

https://groups.google.com/forum/#!topic/google-appengine/p2QKJ0OSLc8

July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report

http://status.aws.amazon.com/s3-20080720.html

Saturday, June 16, 2012

Cloud Application Architectures: GOTO Aarhus, Denmark, October

Thanks to an invite via Michael Nygard, I ended up as a track chair for the GOTO Aarhus conference in Denmark in early October. The track subject is Cloud Application Architectures, and we have three speakers lined up. You can register with a discount using the code crof1000.

We are starting out with a talk from the Citrix Cloudstack folks about how to architect applications on open portable clouds. These implement a subset of functionality but have more implementation flexibility than public cloud services. [Randy Bias of Cloudscaling was going to give this talk but had to pull out of the trip due to other commitments].

To broaden our perspective somewhat, and get our hands dirty with real code, the next talk is a live demonstration by Ido Flatow, Building secured, scalable, low-latency web applications with the Windows Azure Platform.

In this session we will construct a secured, durable, scalable, low-latency web application with Windows Azure - Compute, Storage, CDN, ACS, Cache, SQL Azure, Full IIS, and more. This is a no-slides presentation!

Finally I will be giving my latest update on Globally Distributed Cloud Applications at Netflix.

Netflix grew rapidly and moved its streaming video service to the AWS cloud between 2009 and 2010. In 2011 the architecture was extended to use Apache Cassandra as a backend, and the service was internationalized to support Latin America. Early in 2012 Netflix launched in the UK and Ireland, using the the combination of AWS capacity in Ireland and Cassandra to create a truly global backend service. Since then the code that manages and operates the global Netflix platform is being released as a series of open source projects at netflix.github.com (Asgard, Priam etc.). The platform is structured as a large scale PaaS, strongly leveraging advanced features of AWS to deploy many thousands of instances. The platform has primary language support for Java/Tomcat with most management tools built using Groovy/Grails and operations tooling in Python. Continuous integration and deployment tooling leverages Jenkins, Ivy/Gradle, Artifactory. This talk will explain how to build your own custom PaaS on AWS using these components.

There are many other excellent speakers at this event, which is run by the same team as the global series of QCon conferences, unfortunately, the cloud track runs at at the same time as Michael Nygard and Jez Humble on Continuous Delivery and Continuous Integration, however I'm doing another talk in the NoSQL track, (along with Martin Fowler and Coda Hale). Running Netflix on Cassandra in the Cloud.

Netflix used to be a traditional Datacenter based architecture using a few large Oracle database backends. Now it is one of the largest cloud based architectures, with master copies of all data living in Cassandra. This talk will discuss how we made the transition, how we automated and open sourced Cassandra management for tens of clusters and hundreds of nodes using Priam and Astyanax, backups, archiving and performance and scalability benchmarks.

I'm looking forward to meeting old friends, getting to know some new people, and visiting Denmark for the first time. See you there!

Monday, March 19, 2012

Ops, DevOps and PaaS (NoOps) at Netflix

There has been a sometimes heated discussion on twitter about the term NoOps recently, and I've been quoted extensively as saying that NoOps is the way developers work at Netflix. However, there are teams at Netflix that do traditional Operations, and teams that do DevOps as well. To try and clarify things I need to explain the history and current practices at Netflix in chunks of more than 140 characters at a time.

When I joined Netflix about five years ago, I managed a development team, building parts of the web site. We also had an operations team who ran the systems in the single datacenter that we deployed our code to. The systems were high end IBM P-series virtualized machines with storage on a virtualized Storage Area Network. The idea was that this was reliable hardware with great operational flexibility so that developers could assume low failure rates and concentrate on building features. In reality we had the usual complaints about how long it took to get new capacity, the lack of consistency across supposedly identical systems, and failures in Oracle, in the SAN and the networks, that took the site down too often for too long.

At that time we had just launched the streaming service, and it was still an experiment, with little content and no TV device support. As we grew streaming over the next few years, we saw that we needed higher availability and more capacity, so we added a second datacenter. This project took far longer than initial estimates, and it was clear that deploying capacity at the scale and rates we were going to need as streaming took off was a skill set that we didn't have in-house. We tried bringing in new ops managers, and new engineers, but they were always overwhelmed by the fire fighting needed to keep the current systems running.

Netflix is a developer oriented culture, from the top down. I sometimes have to remind people that our CEO Reed Hastings was the founder and initial developer of Purify, which anyone developing serious C++ code in the 1990's would have used to find memory leaks and optimize their code. Pure Software merged with Atria and Rational before being swallowed up by IBM. Reed left IBM and formed Netflix. Reed hired a team of very strong software engineers who are now the VPs who run developer engineering for our products. When we were deciding what to do next Reed was directly involved in deciding that we should move to cloud, and even pushing us to build an aggressively cloud optimized architecture based on NoSQL. Part of that decision was to outsource the problems of running large scale infrastructure and building new datacenters to AWS. AWS has far more resources to commit to getting cloud to work and scale, and to building huge datacenters. We could leverage this rather than try to duplicate it at a far smaller scale, with greater certainty of success. So the budget and responsibility for managing AWS and figuring out cloud was given directly to the developer organization, and the ITops organization was left to run its datacenters. In addition, the goal was to keep datacenter capacity flat, while growing the business rapidly by leveraging additional capacity on AWS.

Over the next three years, most of the ITops staff have left and been replaced by a smaller team. Netflix has never had a CIO, but we now have an excellent VP of ITops Mike Kail (@mdkail), who now runs the datacenters. These still support the DVD shipping functions of Netflix USA, and he also runs corporate IT, which is increasingly moving to SaaS applications like Workday. Mike runs a fairly conventional ops team and is usually hiring, so there are sysadmin, database,, storage and network admin positions. The datacenter footprint hasn't increased since 2009, although there have been technology updates, and the over-all size is order-of-magnitude a thousand systems.

As the developer organization started to figure out cloud technologies and build a platform to support running Netflix on AWS, we transferred a few ITops staff into a developer team that formed the core of our DevOps function. They build the Linux based base AMI (Amazon Machine Image) and after a long discussion we decided to leverage developer oriented tools such as Perforce for version control, Ivy for dependencies, Jenkins to automate the build process, Artifactory as the binary repository and to construct a "bakery" that produces complete AMIs that contain all the code for a service. Along with AWS Autoscale Groups this ensured that every instance of a service would be totally identical. Notice that we didn't use the typical DevOps tools Puppet or Chef to create builds at runtime. This is largely because the people making decisions are development managers, who have been burned repeatedly by configuration bugs in systems that were supposed to be identical.

By 2012 the cloud capacity has grown to be order-of-magnitude 10,000 instances, ten times the capacity of the datacenter, running in nine AWS Availability zones (effectively separate datacenters) on the US East and West coast, and in Europe. A handful of DevOps engineers working for Carl Quinn (@cquinn - well known from the Java Posse podcast) are coding and running the build tools and bakery, and updating the base AMI from time to time. Several hundred development engineers use these tools to build code, run it in a test account in AWS, then deploy it to production themselves. They never have to have a meeting with ITops, or file a ticket asking someone from ITops to make a change to a production system, or request extra capacity in advance. They use a web based portal to deploy hundreds of new instances running their new code alongside the old code, put one "canary" instance into traffic, if it looks good the developer flips all the traffic to the new code. If there are any problems they flip the traffic back to the previous version (in seconds) and if it's all running fine, some time later the old instances are automatically removed. This is part of what we call NoOps. The developers used to spend hours a week in meetings with Ops discussing what they needed, figuring out capacity forecasts and writing tickets to request changes for the datacenter. Now they spend seconds doing it themselves in the cloud. Code pushes to the datacenter are rigidly scheduled every two weeks, with emergency pushes in between to fix bugs. Pushes to the cloud are as frequent as each team of developers needs them to be, incremental agile updates several times a week is common, and some teams are working towards several updates a day. Other teams and more mature services update every few weeks or months. There is no central control, the teams are responsible for figuring out their own dependencies and managing AWS security groups that restrict who can talk to who.

Automated deployment is part of the normal process of running in the cloud. The other big issue is what happens if something breaks. Netflix ITops always ran a Network Operations Center (NOC) which was staffed 24x7 with system administrators. They were familiar with the datacenter systems, but had no experience with cloud. If there was a problem, they would start and run a conference call, and get the right people on the call to diagnose and fix the issue. As the Netflix web site and streaming functionality moved to the cloud it became clear that we needed a cloud operations reliability engineering (CORE) team, and that it would be part of the development organization. The CORE team was lucky enough to get Jeremy Edberg (@jedberg - well know from running Reddit) as its initial lead engineer, and also picked up some of the 24x7 shift sysadmins from the original NOC. The CORE team is still staffing up, looking for Site Reliability Engineer skill set, and is the second group of DevOps engineers within Netflix. There is a strong emphasis on building tools too make as much of their processes go away as possible, for example they have no run-books, they develop code instead,

To get themselves out of the loop, the CORE team has built an alert processing gateway. It collects alerts from several different systems, does filtering, has quenching and routing controls (that developers can configure), and automatically routes alerts either to the PagerDuty system (a SaaS application service that manages on call calendars, escalation and alert life cycles) or to a developer team email address. Every developer is responsible for running what they wrote, and the team members take turns to be on call in the PagerDuty rota. Some teams never seem to get calls, and others are more often on the critical path. During a major production outage con call, the CORE team never make changes to production applications, they always call a developer to make the change. The alerts mostly refer to business transaction flows (rather than typical operations oriented Linux level issues) and contain deep links to dashboards and developer oriented Application Performance Management tools like AppDynamics which let developers quickly see where the problem is at the Java method level and what to fix,

The transition from datacenter to cloud also invoked a transition from Oracle, initially to SimpleDB (which AWS runs) and now to Apache Cassandra, which has its own dedicated team. We moved a few Oracle DBAs over from the ITops team and they have become experts in helping developers figure out how to translate their previous experience in relational schemas into Cassandra key spaces and column families. We have a few key development engineers who are working on the Cassandra code itself (an open source Java distributed systems toolkit), adding features that we need, tuning performance and testing new versions. We have three key open source projects from this team available on github.com/Netflix. Astyanax is a client library for Java applications to talk to Cassandra, CassJmeter is a Jmeter plugin for automated benchmarking and regression testing of Cassandra, and Priam provides automated operation of Cassandra including creating, growing and shrinking Cassandra clusters, and performing full and incremental backups and restores. Priam is also written in Java. Finally we have three DevOps engineers maintaining about 55 Cassandra clusters (including many that span the US and Europe), a total of 600 or so instances. They have developed automation for rolling upgrades to new versions, and sequencing compaction and repair operations. We are still developing our Cassandra tools and skill sets, and are looking for a manager to lead this critical technology, as well as additional engineers. Individual Cassandra clusters are automatically created by Priam, and it's trivial for a developer to create their own cluster of any size without assistance (NoOps again). We have found that the first attempts to produce schemas for Cassandra use cases tend to cause problems for engineers who are new to the technology, but with some familiarity and assistance from the Cloud Database Engineering team, we are starting to develop better common patterns to work to, and are extending the Astyanax client to avoid common problems.

In summary, Netflix stil does Ops to run its datacenter DVD business. we have a small number of DevOps engineers embedded in the development organization who are building and extending automation for our PaaS, and we have hundreds of developers using NoOps to get their code and datastores deployed in our PaaS and to get notified directly when something goes wrong. We have built tooling that removes many of the operations tasks completely from the developer, and which makes the remaining tasks quick and self service. There is no ops organization involved in running our cloud, no need for the developers to interact with ops people to get things done, and less time spent actually doing ops tasks than developers would spend explaining what needed to be done to someone else. I think that's different to the way most DevOps places run, but its similar to other PaaS enviroments, so it needs it's own name, NoOps. [Update: the DevOps community argues that although it's different, it's really just a more advanced end state for DevOps, so lets just call it PaaS for now, and work on a better definition of DevOps].

Friday, March 16, 2012

Cloud Architecture Tutorial

I presented a whole day tutorial at QCon London on March 5th, and presented subsets at an AWS Meetup, a Big Data / Cassandra Meetup and a Java Meetup the same week. I updated the slides and split them into three sections which are hosted on http://slideshare.net/adrianco along with all my other presentations. You can find many more related slide decks at http://slideshare.net/Netflix and other information at the Netflix Tech Blog.

The first section tells the story of why Netflix migrated to cloud, how we think about choosing AWS as our cloud supplier and what features of the Netflix site were moved to the cloud over the last three years.

Cloud Architecture Tutorial - Why and What (1of 3)

View more presentations from Adrian Cockcroft

The second section is a detailed explanation of the globally distributed Java based Platform as a Service (PaaS) we built, the open source components that we depend on, and the open source projects that we have started to share at http://netflix.github.com.

Cloud Architecture Tutorial - Platform Component Architecture (2of3)

View more presentations from Adrian Cockcroft

The final section talks about how we run these PaaS services in the cloud, and includes details of our million writes per second scalability benchmark.

Cloud Architecture Tutorial - Running in the Cloud (3of3)

View more presentations from Adrian Cockcroft

If you would like to see these slides presented in person, I'm teaching a half day cloud architecture tutorial at Gluecon in Broomfield Colorado in May 23-24th. I hope to see you there...

Thursday, January 19, 2012

Thoughts on SimpleDB, DynamoDB and Cassandra

I've been getting a lot of questions about DynamoDB, and these are my personal thoughts on the product, and how it fits into cloud architectures.

I'm excited to see the release of DynamoDB, it's a very impressive product with great performance characteristics. It should be the first option for startups and new cloud projects on AWS. I also think it marks the turning point on solid state disks, they will be the default for new database products and benchmarks going forward.

There are a few use cases where SimpleDB may still be useful, but DynamoDB replaces it in almost all cases. I've talked about the history of Netflix use of SimpleDB before, but it's relevant to the discussion on DynamoDB, so here goes.

When Netflix was looking at moving to cloud about three years ago we had an internal debate about how to handle storage on AWS. There were strong proponents for using MySQL, SimpleDB was fairly new, and other alternatives were nascent NoSQL projects or expensive enterprise software. We started some pathfinder projects to explore the two alternatives and decided to port an existing MySQL app to AWS, while building a replication pipeline that copied data out of our Oracle datacenter systems into SimpleDB to be consumed in the cloud. The MySQL experience showed that we would have trouble scaling, and SimpleDB seemed reliable, so we went ahead and kept building more data sources on SimpleDB, with large blobs of data on S3.

Along the way we put memcached in front of SimpleDB and S3 to improve read latency. The durability of SimpleDB is its strongest point, we have had Oracle and SAN data corruption bugs in the Datacenter over the last few years, but never lost or corrupted any SimpleDB data. The limitations of SimpleDB are its biggest problem. We worked around limits on table size, row and attribute size, and per-request overhead caused by http requests needing to be authenticated for every call.

So the lesson here is that for a first step into NoSQL, we went with a hosted solution so that we didn't have to build a team of experts to run it, and we didn't have to decide in advance how much scale we needed. Starting again from scratch today, I would probably go with DynamoDB. It's a low "friction" and developer friendly solution.

Late in 2010 we were planning the next step, turning off Oracle and making the cloud the master copy of the data. One big problem is that our backups and archives were based on Oracle, and there was no way to take a snapshot or incremental backup of SimpleDB. The only way to get data out in bulk is to run "SELECT * FROM table" over HTTP and page the requests. This adds load, takes too long, and costs a lot because SimpleDB charges for the time taken in SELECT calls.

We looked at about twenty NoSQL options during 2010, trying to understand what the differences were, and eventually settled on Cassandra as our candidate for prototyping. In a week or so, we had ported a major data source from SimpleDB to Cassandra and were getting used to the new architecture, running benchmarks etc. We evaluated some other options, but decided to take Cassandra to the next stage and develop a production data source using it.

The things we liked about Cassandra were that it is written in Java (we have a building full of Java engineers), it is packed full of state of the art distributed systems algorithms, we liked the code quality, we could get commercial support from Datastax, it is scalable and as an extra bonus it had multi-region support. What we didn't like so much was that we had to staff a team to own running Cassandra for us, but we retrained some DBAs and hired some good engineers including Vijay Parthasarathy, who had worked on the multi-region Cassandra development at Cisco Webex and who recently became a core committer on the Apache Cassandra project. We also struggled with the Hector client library, and have written our own (which we plan to release soon). The blessing and a curse of Cassandra is that it is an amazingly fast moving project. New versions come fast and furiously, which makes it hard to pick a version to stabilize on, however the changes we make turn up in the mainstream releases after a few weeks. Saying "Cassandra doesn't do X" is more of a challenge than a statement. If we need "X" we work with the rest of the Apache project to add it.

Throughout 2010 the product teams at Netflix gradually moved their backend data sources to Cassandra, we worked out the automation we needed to do easy self service deployments and ended up with a large number of clusters. In preparation for the UK launch we also made changes to Cassandra to better support multi-region deployment on EC2, and we are currently running several Cassandra clusters that span the US and European AWS regions.

Now that DynamoDB has been released, the obvious question is whether Netflix has any plans to use it. The short answer is no, because it's a subset of the Cassandra functionality that we depend on. However that doesn't detract from the major step forward from SimpleDB in performance, scalability and latency. For new customers, or people who have outgrown the scalability of MySQL or MongoDB, DynamoDB is an excellent starting point for data sources on AWS. The advantages of zero administration combined with the performance and scalability of a solid state disk backend are compelling.

Personally my main disappointment with DynamoDB is that it doesn't have any snapshot or incremental backup capability. The AWS answer is that you can extract data into EMR then store it in S3. This is basically the same answer as SimpleDB, it's a full table scan data extraction (which takes too long and costs too much and isn't incremental). The mechanism we built for Cassandra leverages the way that Cassandra writes immutable files to get a callback and compress/copy them to S3 as they are written, it's extremely low overhead. If we corrupt our data with a code bug and need to roll back, or take a backup in production and restore in test, we have all the files archived in S3.

One argument against DynamoDB is that DynamoDB is on AWS only, so customers could get locked in, however it's easy to upgrade applications from SimpleDB, to DynamoDB and to Cassandra. They have similar schema models, consistency options and availability models. It's harder to go backwards, because Cassandra has more features and fewer restrictions. Porting between NoSQL data stores is trivial compared to porting between relational databases, due to the complexity of the SQL language dialects and features and the simplicity of the NoSQL offerings. Starting out on DynamoDB then switching to Cassandra when you need more direct control over the installation or Cassandra specific features like multi-region support is a very viable strategy.

As early adopters we have had to do a lot more pioneering engineering work than more recent cloud converts. Along the way we have leveraged AWS heavily to accelerate our own development, and built a lot of automation around Cassandra. While SimpleDB has been a minor player in the NoSQL world DynamoDB is going to have a much bigger impact. Cassandra has matured and got easier to use and deploy over the last year but it doesn't scale down as far. By that I mean a single developer in a startup can start coding against DynamoDB without needing any support and with low and incremental costs. The smallest Cassandra cluster we run is six m1.xlarge instances spread over three zones with triple replication.

I've been saying for a while that 2012 is the year that NoSQL goes mainstream, DynamoDB is another major step in validating that move. The canonical CEO to CIO conversation is moving from 2010: "what's our cloud strategy?", 2011: "what's our big data strategy?" to 2012: "what's our NoSQL strategy?".

Tuesday, August 30, 2011

I come to use clouds, not to build them...

[Update: Thanks for all the comments and Ryan Lawler's GigaOM summary - also I would like to credit James Urquhart's posting on Competing With Amazon Part 1. for giving me the impetus to write this.]

My question is what are the alternatives to AWS from a developer perspective, and when might they be useful? However I will digress into a little history to frame the discussion.

There are really two separate threads of development in cloud architectures, the one I care about is how to build applications on top of public cloud infrastructure, the other is about how to build cloud infrastructure itself.

In 1984 I didn't care about how the Zilog Z80 or the Motorola 6809 microprocessors were made, but I built my own home-brew 6809 machine and wrote a code generator for a C compiler because I thought it was the best architecture, and I needed something to distract me from a particularly mind-numbing project at work.

In 1988 I joined Sun Microsystems and was one of the people who could argue in detail how SPARC was better than MIPS or whatever as an instruction set, or how Solaris and OpenLook window system were better. However I never designed a RISC microprocessor, wrote kernel code or developed window systems libraries. I helped customers use them to solve their own problems.

In 1993 I moved to the USA and worked to help customers scale their applications on the new big multiprocessor systems that Sun had just launched. I didn't re-write the operating system myself, but I figured out how to measure it and explain how to get good performance in some books I wrote at the time.

In 1995 when Java was released and the WWW was taking off, I didn't work on the Java implementation or IETF standards body, I helped people to figure out how to use Java and to get the first web servers to scale, so they could build new kinds of applications.

In 2004 I made a career change to move from the Enterprise Computing market place with Sun to the Consumer Web Services space with eBay. At the time eBay was among the first sites to have a public web API. It seemed to me that the interesting innovation was now taking place in the creation and mash-up of web services and APIs, no-one cared about what operating system they ran, what hardware that ran on, or who sold those computers.

Over time, the interesting innovation that matters has moved up the food chain to higher and higher levels of abstraction, leveraging and taking for granted the layers underneath. A few years ago I had to explain to friends who still worked at Sun, how I was completely uninterested in whether my servers ran Linux or Solaris, but I did care what version of Java we were using.

Now I'm working on a cloud architecture for Netflix, we don't really care which Content Delivery Network is used to stream the TV shows over the Internet to the customers, we interchangeably use three vendors at present. I also don't care how the cloud works underneath, I hear that AWS uses Xen, but it's invisible to me. What I do care about is how the cloud behaves, i.e. does it scale and does it have the feature set that I need.

That brings me back to my original question, what are the alternatives to AWS and when might they be useful.

Last week I attended an OpenStack meetup, thinking that I might learn about its feature set, scale and roadmap as an alternative to AWS. However the main objective of the presenters seemed to be to recruit the equivalent of chip designers and kernel developers to help them build out the project itself, and to appeal to IT operations people who want to build and run their own cloud. There was no explanation or outreach to developers who might want to build applications that run on OpenStack.

I managed to get the panel to spend a little while explaining what OpenStack consists of, and figured out a few things. The initial release is only really usable via the AWS clone APIs and doesn't have an integrated authentication system across the features. The "Diablo" release this fall should be better integrated and will have native APIs, it is probably suitable for proof of concept implementations by people building private clouds. The "Essex" version targeted at March next year is supposed to be the first production oriented release.

There are several topics that I would like to have seen discussed, perhaps people could discuss them in the comments to this blog post? One is a feature set comparison with AWS, and a discussion of whether OpenStack plans to continue to support the AWS clone APIs for equivalent features as it adds them. So far I think OpenStack has a basic EC2 clone and S3 clone, plus some networking and identity management that doesn't map to equivalent AWS APIs.

The point of my history lesson in the introduction is that a few very specialized engineers are needed to build microprocessors, operating systems, servers, datacenters, CDNs and clouds. It's difficult and interesting work, but in the end if its done right it's a commodity that is invisible to developers and their customers. One of the slides proudly showed how many developers OpenStack had, a few hundred, mostly building it from scratch. There wasn't room on the slide to show how many developers AWS has on the same scale. Someone said recently that the far bigger AWS team has more open headcount than the total number of people working on OpenStack. When you consider the developer ecosystem around AWS, there must be hundreds of thousands of developers familiar with AWS concepts and APIs.

Some of the proponents of OpenStack argue that because it's an open source community project it will win in the end. I disagree, the most successful open source projects I can think of have a strong individual leader who spends a lot of time saying no to keep the project on track. Some of the least successful are large multi-vendor industry consortiums.

The analogy that seems to fit is Apple's iOS vs. Google's Android in the smartphone market. The parts of the analogy that resonate with me are that Apple came out first and dominates the market, taking most of the profit and forcing it's competitors to try and band together to compete, changing the rules of the game and creating new products like the iPad that leave their competition floundering. By adding together all the incompatible fragmented Android market together it's possible to claim that Android is selling in a similar volume to iPhone. However it's far harder for developers to build Android apps that work on all devices, and then they usually make much less money from them. Apple and it's ecosystem is dominant, growing fast, and extremely profitable.

In the cloud space, OpenStack appears to be the consortium of people who can't figure out how to compete with AWS on their own. AWS is dominant, growing its feature set and installed capacity very fast. Every month that passes, AWS is refining and extending it's products to meet real customer needs. Measured by the reserved IP address ranges used by its instances AWS has more than doubled in the last year and now has over 800,000 IP addresses assigned to its regions worldwide.

The problem with a consortium is that it is hard to get it to agree on anything, and Brooks law applies (The Mythical Man-Month - adding resources to a late software project makes it later). While it seems obvious that adding more members to OpenStack is a good thing, in practice, it will slow the project down. I was once told that the way to kill a standards body or consortium is to keep inviting new people to join and adding to its scope. With the huge diversity of datacenter hardware and many vendors with a lot to lose if they get sidelined I expect OpenStack to fracture into multiple vendor specific "stacks" with narrow test matrixes and extended features that lock customers in and don't interoperate well.

I come to use clouds, because I work for a developer oriented company that has decided that building and running infrastructure on a global scale is undifferentiated heavy lifting, and we can leverage outside investment from AWS and others to do a better job than we could do ourselves, while we focus on the real job of developing global streaming to TVs.

Operations oriented companies tend to focus on costs and ways to control their developers. They want to build clouds, and may use OpenStack, but their developers aren't going to wait, they may be allowed to use AWS "just for development and testing" but when the time comes to deploy on OpenStack, it's lack of features is going to add a significant burden of complexity to the development team. OpenStack's lack of scale and immaturity compared to AWS is also going to make it harder to deploy products. I predict that the best developers will get frustrated and leave to work at places like Netflix (hint, we're hiring).

I haven't yet seen a viable alternative to AWS, but that doesn't mean I don't want to see one. My guess is that in about two to three years from now there may be a credible alternative. Netflix has already spent a lot of time helping AWS scale as we figured out our architecture, we don't want to do that again, so I'm also waiting for someone else (another large end-user) to kick the tires and prove that an alternative works.

Here's my recipe for a credible alternative that we could use:

AWS has too many features to list, we use almost all of them, because they were all built to solve real customer problems and make life easier for developers. The last slide of my recent cloud presentations at http://slideshare.net/adrianco contains a partial list as a terminology glossary. AWS is adding entirely new capabilities and additional detailed features every month, so this is a moving target that is accelerating fast away from the competition...

From a scale point of view Netflix has several thousand instances organized into hundreds of different instance types (services), and routinely allocates and deallocates over a thousand new instances each day as we autoscale to the traffic load and push new code. Often a few hundred instances are created in a few minutes. Some other cloud vendors we have talked to consider a hundred instances a large customer, and their biggest instances are too small for us to use. We mostly use m2.4xl and we need the 68GB RAM for memcached, Cassandra or our very large Java applications, so a 15GB max doesn't work.

In summary, although the CDN space is already a commodity with multiple interchangeable vendors, we are several years from having multiple cloud vendors that have enough features and scale to be interchangeable. The developer ecosystem around AWS concepts and APIs is dominant, so I don't see any value in alternative concepts and APIs, please try to build AWS clones that scale. Good luck :-)

Archive

Friday, December 12, 2014

Thursday, April 03, 2014

Monday, November 26, 2012

Friday, November 16, 2012

November 18th, 2014 - Azure Global Storage Outage

Microsoft Reports

January 10th, 2014 - Dropbox Global Outage

Dropbox Report

April 20th, 2013 - Google Global API Outage

Google Report

February 22nd, 2013 - Azure Global Outage Cert Expiry

Azure Report

December 24th, 2012 - AWS US-East Partial Regional ELB State Overwritten

AWS Service Event Report

Netflix Techblog Report

October 26th, 2012 - Google AppEngine Network Router Overload

Google Outage Report

October 22, 2012 - AWS US-East Zone EBS Data Collector Bug

AWS Outage Report

Netflix Techblog Report

June 29th 2012 - AWS US-East Zone Power Outage During Storm

AWS Outage Report

Netflix Techblog Report

June 13th, 2012 - AWS US-East SimpleDB Region Outage

AWS Outage Report

February 29th, 2012 - Microsoft Azure Global Leap-Year Outage

Azure Outage Report

August 17th, 2011 - AWS EU-West Zone Power Outage

AWS Outage Report

April 2011 - AWS US-East Zone EBS Outage

AWS Outage Report

Netflix Techblog Report

February 24th, 2010 - Google App Engine Power Outage

Google Forum Report

July 20th, 2008 - AWS Global S3 Gossip Protocol Corruption

AWS Outage Report

Saturday, June 16, 2012

Monday, March 19, 2012

Friday, March 16, 2012

Thursday, January 19, 2012

Tuesday, August 30, 2011