Building Real Software: deployment

Showing posts with label deployment. Show all posts

Monday, October 24, 2011

Rolling Forward and other Deployment Myths

There is more and more writing on Devops lately, which is good and bad. There still remains a small core of thoughtful people that are worth listening to and learning from. There’s more and more marketing from vendors and consultants jumping on the Devops bandwagon. There’s some naïve silliness (“Hire wicked smart people and give them all access to root.”) which can probably be safely ignored. And then there’s stuff that is half-right and half-wrong, too dangerous to be ignored. Like this recent post on Rollbacks and Other Deployment Myths, in which John Vincent lists 5 “myths” about system deployment, which I want to take some time to respond to here.

Change is change?

The author tries to make the point that “Change is neither good or bad. It’s just change.” Therefore we do not need to be afraid of making changes.

I don’t agree. This attitude to change ignores the fact that once a system is up and running and customers are relying on it to conduct their business, whatever change you are making to the system is almost never as important as making sure that the system keeps running properly. Unfortunately, changes often lead to problems. We know from Visible Ops that based on studies of hundreds of companies, 80% of operational failures are caused by mistakes made during changes. This is where heavyweight process control frameworks like ITIL and COBIT, and detective change control tools like Tripwire came from. To help companies get control over IT change, because people had to find some way to stop shit from breaking.

Yes, I get the point that in IT we tend to over-compensate, and I agree that calling in the Release Police and trying to put up a wall around all changes isn’t sustainable. People don’t have to work this way. But trivializing change, pretending that changes don't lead to problems, is dangerous.

Deploys are not risky?

You can be smart and careful and break changes down into small steps and try to automate code pushes and configuration changes, and plan ahead and stage and review and test all your changes, and after all of this you can still mess up the deploy. Even if you make frequent small changes and simplify the work and practice it a lot.

For systems like Facebook and online games and a small number of other cases, maybe deployment really is a non-issue. I don’t care if Facebook deploys in the middle of the day – I can usually tell when they are doing a “zero downtime” deploy (or maybe they are “transparently” recovering from a failure) because data disappears temporarily or shows up in the wrong order, functions aren’t accessible for a while, forms don’t resolve properly, and other weird shit happens, and then things come back later or they don’t. As a customer, do I care? No. It’s an inconvenience, and it’s occasionally unsettling ("WTF just happened?"), but I get used to it and so do millions of others. That’s because most of us don’t use Facebook or systems like this for anything important.

For business-critical systems handling thousands of transactions a second that are tied into hundreds of other company’s systems (the world that I work in) this doesn’t cut it. Maybe I spend too much time at this extreme, where even small problems with compatibility that only affect a small number of customers, or slight and temporary performance slow downs, are a big deal. But most people I work with and talk to in software development and maintenance and system operations agree that deployment is a big deal and needs to be done with care and attention, no matter how simple and small the changes are and no matter how clean and simple and automated the deployment process is.

Rollbacks are a myth?

Vincent wants us to “understand that it’s typically more risky to rollback than rolling forward. Always be rolling forward.”

Not even the Continuous Deployment advocates (who are often some of the most radical – and I think some of the most irresponsible – voices in the Devops community) agree with this – they still roll back if they find problems with changes.

“Rollbacks are a myth” is an echo of the “real men fail forward” crap I heard at Velocity last year and it is where I draw the line. It's one thing to state an extreme position for argument's sake or put up a straw man – but this is just plain wrong.

If you're going to deploy, you have to anticipate roll back and think about it when you make changes and you have to test rolling back to make sure that it works. All of this is hard. But without a working roll back you have no choice other than to fail forward (whatever this means, because nobody who talks about it actually explains how to do it), and this is putting your customers and their business at unnecessary risk. It’s not another valid way of thinking. It’s irresponsible.

James Hamilton wrote an excellent paper on Designing and Delivering Internet-Scale Services when he was at Microsoft (now he’s a an executive and Distinguished Engineer at Amazon). Hamilton’s paper remains one of the smartest things that anyone has written about how to deal with deployment and operational problems at scale. Everyone who designs, builds, maintains or operates an online system should read it. His position on roll back is simple and obvious and right:

Reverting to the previous version is a rip cord that should always be available on any deployment.

Everything fails. Embrace failure.

I agree that everything can and will fail some day, that we can’t pretend that we can prevent failures in any system. But I don’t agree with embracing failure, at least in business-critical enterprise systems, where recovering from a failure means lost business and requires unraveling chains of transactions between different upstream and downstream systems and different companies, messing up other companies' businesses as well as your own and dealing the follow-on compliance problems. Failures in these kinds of systems, and a lot of other systems, are ugly and serious, and they should be treated seriously.

We do whatever we can to make sure that failures are controlled and isolated, and we make sure that we can recover quickly if something goes wrong (which includes being able to roll back!). But we also do everything that we can to prevent failures. Embracing failure is fine for online consumer web site startups – let’s leave it to them.

SLAs

I wanted to respond to the points about SLAs, but it’s not clear to me what the author was trying to say. SLAs are not about servers. Umm, yes that’s right of course...

SLAs are important to set business expectations with your customers (the people who are using the system) and with your partners and suppliers. So that your partners and suppliers know what you need from them and what you are paying them for, and so that you know if you can depend on them when you have to. So that your customers know what they are paying you for. And SLAs (not just high-level uptime SLAs, but SLAs for Recovery Time and Recovery Point goals and incident response and escalation) are important so that your team understands the constraints that they need to work under, what trade-offs to make in design and implementation and in operations.

Under-compensating is worse than Over-compensating

I spent more time than I thought I would responding to this post, because some of what the author says is right – especially in the second part of his post, Deploy All the Things where he provides some good practical advice on how to reduce risk in deployment. He’s right that Operations main purpose isn’t to stop change – it can’t be. We have to be able to keep changing, and developers and operations have to work together to do this in safe and efficient ways. But trivializing the problems and risks of change and over-simplifying how to deal with these risks and how to deal with failures, isn’t the way to do this. There has to be a middle way between the ITIL and COBIT world of controls and paper and process, and cool Web startups failing forward, a way that can really work for the rest of us.

Monday, May 24, 2010

Fast or Secure Software Development - but you can't have both

There is a lot of excitement in the software development community around Lean Software Development: applying Japanese manufacturing principles to software development to reduce waste and eliminate overhead, to streamline production, to get software out to customers and get their feedback as quickly as possible.

Some people are going so far as to eliminate review gates and release overhead as waste: “iterations are muda”. The idea of Continuous Deployment takes this to the extreme: developers push software changes and fixes out immediately to production, with all the good and bad that you would expect from doing this.

Continuous Deployment has been followed to success at IMVU and Wordpress and Facebook and other Web 2.0 companies. The CEO of Automattic, the team behind Wordpress, recently bragged about their success in following Continuous Deployment:

“The other day we passed product release number 25,000 for WordPress.com. That means we’ve averaged about 16 product releases a day, every day for the last four and a half years!”

I am sure that he is not proud of their history of security problems however, which you can read about here, here, here, here, here and elsewhere.

And Facebook? You can read about how they use Continuous Deployment practices to push code out to production several times a day. As for their security posture, Facebook has "faced" a series of severe security and privacy problems and continues to run into them, as recently as last week.

I’ve ranted before about the risks that Continuous Deployment forces on customers. Continuous Deployment is based on the rather naïve assumption that if something is broken, if you did something wrong, you will know right away: either through your automated tests or by monitoring the state of production, errors and changes in usage patterns, or from direct customer feedback. If it doesn’t look like it’s working, you roll it back as soon as you can, before the next change is pushed out. It’s all tied to a direct feedback loop.

Of course it’s not always that simple. Security problems don’t show up like that, they show up later as successful exploits and attacks and bad press and a damaged brand and upset customers and the kind of mess that Facebook is in again. I can’t believe that the CEO of Facebook appreciates getting this kind of feedback on his company's latest security and privacy problems.

Now, maybe the rules about how to build secure and reliable software don’t, and shouldn’t, all apply to Web 2.0, as Dr. Boaz Gelbord proposes:

“Facebook are not fools of course. You don't build a business that engages every tenth adult on the planet without honing a pretty good sense for which way the wind is blowing. The company realizes that it is under no obligation to provide any real security controls to its users.

Maybe to be truly open and collaborative, you are obliged to make compromises on security and data integrity and confidentiality. Some of these Web 2.0 sites like Facebook are phenomenally successful, and it seems that most of their customers don’t care that much about security and privacy, and as long as you haven’t been foolish enough to use tools like Facebook to support your business in a major way, maybe that’s fine.

And I also don’t care how a startup manages to get software out the door. If Continuous Deployment helps you get software out faster to your customers, and your customers are willing to help you test and put up with whatever problems they find, if it gives you a higher chance of getting your business launched, then by all means consider giving it a try.

Just keep in mind that some day you may need to grow up and take a serious look at how you build and release software – that the approach that served you well as a startup may not cut it any more.

But let’s not pretend that this approach can be used for online mission-critical or business-critical enterprise or B2B systems, where your system may be hooked up to dozens or hundreds of other systems, where you are managing critical business transactions. Enterprise systems are not a game:

“I understand why people would think that a consumer internet service like IMVU isn't really mission critical. I would posit that those same people have never been on the receiving end of a phone call from a sixteen-year-old girl complaining that your new release ruined their birthday party. That's where I learned a whole new appreciation for the idea that mission critical is in the eye of the beholder.”

This is a joke right?

But seriously, I get concerned when thoughtful people in the development community, people like Kent Beck and Michael Feathers start to explore Continuous Deployment Immersion and zero-length iterations. These aren’t kids looking to launch a Web 2.0 site, they are leaders who the development community looks to for insight, for what is important and right in building software.

There is a clear risk here of widening the already wide disconnect between the software development community and the security community.

On one side we have Lean and Continuous Deployment evangelists pushing us to get software out faster and cheaper, reducing the batch size, eliminating overhead, optimizing for speed, optimizing the feedback loop.

On the other side we have the security community pleading with us to do more upfront, to be more careful and disciplined and thoughtful, to invest more in training and tools and design and reviews and testing and good engineering, all of which adds to the cost and time of building software.

Our job in software development is to balance these two opposing pressures: to find a way to build software securely and efficiently, to take the good ideas from Lean, and from Continuous Deployment (yes, there are some good ideas there in how to make deployment more automated and streamlined and reliable), and marry them with disciplined secure development and engineering practices. There is an answer to be found, but we need to start working on it together.

Tuesday, March 2, 2010

Continuously Putting Your Customers at Risk

Over the past year or so there has been some buzz around Continuous Deployment: immediately and constantly pushing out new features directly to customers. This practice is followed by companies such as Facebook, Flickr and IMVU, where apparently programmers push changes out to production up to 50 times per day.

Continuous integration and rapid deployment have a number of advantages, and continuous deployment appears to maximize value from these ideas: immediate and continuous feedback from customers, and cutting deployment and release costs way down, eliminating overheads and manual checks.

A profile on Extreme Agility at Facebook describes how the company’s small development team deploys rapid updates to production:

“…Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of the system and surely fixing any bugs that would result from these frequent small changes.”

But there are some fundamental and serious challenges with continuous deployment. Success depends on a few key factors:

a comprehensive and fast automated test suite to catch mistakes, especially regressions;
customers that are willing to let you test in production;
an architecture that catches and isolates failures, preventing problems from chaining or cascading across the cluster;
a disciplined, proven, robust deployment model.

Automated Testing

At IMVU, all changes are run through an automated test suite, which executes on a cluster of test servers, before deploying to production.

"So what magic happens in our test suite that allows us to skip having a manual Quality Assurance step in our deploy process? The magic is in the scope, scale and thoroughness.”

The author goes on to say that

“We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day.”

Ummm, actually, no it isn't. That’s 15k test cases a day. Run 70 times [and no explanation why the tests are 70 times per day, since we’re talking about pushing out 50 changes per day, but anyways…]. You could run 15k test cases a million times and it would still be 15k test cases, in the same way that running 1 test case a million times is still only 1 test case.

A regression suite of 15,000 automated unit and functional tests sounds impressive – of course what is much more important than the number of tests is the quality of tests. We’re a small shop, and we run more than 15,000 automated tests as part of our continuous integration environment, and we also run static analysis checks on all code, and we do peer code reviews, and manual testing including operations tests, exploratory and destructive tests, system trials, stress and performance testing, and end-to-end integration tests before we push to production.

From subsequent statements, it seems clear that most of the changes deployed this way at IMVU at least are trivial, bug fixes and minor modifications: schema changes, for example, are made out of band (they take 2 days to rollout to production at IMVU). So we can assume that the scope, and therefore, risk of any one change can be contained. This is backed up by a comment made by the author at 50 Deployments A Day and the Perpetual Beta:

“when working on new features (not fixing bugs, not refactoring, not making performance enhancements, not solving scalability bottlenecks, etc), we’ll have a controlled deliberate roll out plan that involves manual QE checks along the way, as well as a gradual roll0out and A/B testing.”

I’m not sure why you wouldn’t have a controlled roll-out plan for solving scalability bottlenecks, but let’s assume that he was referring to minor tweaks to the code or configuration, say increasing the size of a resource pool or something.

Testing in Production

After hearing about the approach followed by IMVU last year, a couple of exploratory testing experts, Michael Bolton and James Bach, spent a few minutes trying out IMVU’s system. They, not surprisingly, found a lot of problems without making much of an effort:

“Yes folks, you can deploy 50 times a day. If you don’t care bout the quality of what you’re deploying…”

The writer from IMVU admits that they have a lot of bugs:

“continuous deployment lets you write software *regression free*, it sure doesn’t gift you high quality software.”

Ignore the “*regression free*” claim which assumes that the test suite will always catch *all* regressions. Continuous Deployment essentially concedes the job of testing to your customers: you do some superficial reviews and regression tests, and leave the real work of finding problems to your customers. I can appreciate that this might be acceptable to some customers, for example people participating in online communities or online games, as they trade off the inconvenience of occasional glitches against the chance to try out cool new features quickly.

There’s nothing wrong with running experiments, trying out new ideas with part of your customer base through A/B split testing, seeing what sticks and what doesn’t. But this doesn’t mean you need to, or should, deploy every change directly to production. Again, if your customers are trusting you with financial transactions or sensitive personal information, you would be irresponsible if you took this approach, even if, like Facebook, you only push out changes incrementally to a small number of clients at a time.

Failure Isolation and Fail Safe

If you are going to continually roll changes, and anticipate that some of these changes will fail, the architecture of the system needs to isolate and contain failures, using internal firewalling, fast-fail techniques, timeouts and retries and so on to reduce the likelihood of a failure chaining through layers or cascading across servers and taking the cluster down.

Unfortunately, this doesn’t seem to be the case, at least in the example described here, where Alex, a programmer, is preparing to deploy code containing a 1-character typo which can cause a failure cascade and take out the site:

“Alex commits. Minutes later warnings go off that the cluster is no longer healthy. The failure is easily correlated to Alex’s change and her change is reverted. Alex spends minimal time debugging, finding the now obvious typo with ease. Her changes still caused a failure cascade, but the downtime was minimal."

So the development team is not only conceding that they cannot write good code, and that they are incapable of doing a decent job testing their work, but also that they cannot take the steps to fail safe at an architectural level and minimize the scope of any failures that they cause. This is not the same as conceding, as Google does, that at massive scale, failures will inevitably happen, and you will have to learn how to deal with them. This is simply giving up, and pushing risk out to customers, again.

A Deployment Model that Works

The deployment model at IMVU is cool: of course, if you are going to do something 50 times per day, you should be pretty good at it.

“The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed.”

This rollback model assumes that problems will be found in the first minute, or few minutes of operation. It does not account for race conditions, deadlocks and other synchronization faults, or time-dependent problems or statistical bugs or downstream integration conflicts or intermittent problems which might not show up for hours or even days, by which time dozens or hundreds of other changes have been applied. Good luck finding out where the problem came from by then.

The rollback approach also assumes that all that needs to be done to fix the problem is to rollback the code. It does not account for what needs to be done to track down and repair broken transactions and corrupt data, which might be ok in an online gaming environment, but would be CTO-suicide in a real system.

And you wonder why some web sites have serious software security issues?

Let’s put aside the arguments above about reliability and quality and responsibility, and just look at the problem of security: building secure software in an environment where developers push each change immediately to production.

First, try to imagine someone detecting unauthorized changes with all of this going on. Was every one of those changes intended and authorized? Would you be able to detect an attack in the middle of all of this noise?

Then there’s the problem of building secure software, which is hard enough even if you follow good practice, especially if you are moving fast. There has been a lot of work over the past couple of years defining how developers can build secure software in an agile way. Microsoft’s SDL for Agile maps secure software design and development practices to agile development models like Scrum, and cuts secure software development controls and practices down to fit rapid incremental development.

But with Continuous Deployment, at least as described so far, there is no time or opportunity to do even a minimal set of security checks and reviews before software changes are pushed out by developers.

It’s bad enough to build insecure software out of ignorance. But by following continuous deployment, you are consciously choosing to push out software before it is ready, before you have done even the minimum to make sure it is safe. You are putting business agility and cost savings ahead of protecting the integrity or privacy of customer data.

Continuous deployment sounds cool. In a world where safety and reliability and privacy and security aren’t important, it would be fun to try. But like a lot of other developers, I live in the real world. And I need to build real software.