Showing posts with label Continuous Deployment. Show all posts
Showing posts with label Continuous Deployment. Show all posts

Wednesday, June 17, 2015

Does DevOps Reduce Technical Debt – or Make it Worse?

DevOps can help reduce technical debt in some fundamental ways.

Continuous Delivery/Deployment

First, building a Continuous Delivery/Deployment pipeline, automating the work of migration and deployment, will force you to clean up inconsistencies and holes in configuration and code deployment, and inconsistencies between development, test and production environments.

And automated Continuous Delivery and Infrastructure as Code gets rid of dangerous one-of-a-kind snowflakes and configuration drift caused by making configuration changes and applying patches manually over time. Which makes systems easier to setup and manage, and reduces the risk of an un-patched system becoming the target of a security attack or the cause of an operational problem.

A CD pipeline also makes it easier, cheaper and faster to pay down other kinds of technical debt. With Continuous Delivery/Deployment, you can test and push out patches and refactoring changes and platform upgrades faster and with more confidence.

Positive Feedback

The Lean feedback cycle and Just-in-Time prioritization in DevOps ensures that you’re working on whatever is most important to the business. This means that bugs and usability issues and security vulnerabilities don’t have to wait until after the next feature release to get fixed. Instead, problems that impact operations or the users will get fixed immediately.

Teams that do Blameless Post-Mortems and Root Cause(s) Analysis when problems come up will go even further, and fix problems at the source and improve in fundamental and important ways.

But there’s a negative side to DevOps that can add to technical debt costs.

Erosive Change

Michael Feathers’ research has shown that constant, iterative change is erosive: the same code gets changed over and over, the same classes and methods become bloated (because it is naturally easier to add code to an existing method or a method to an existing class), structure breaks down and the design is eventually lost.

DevOps can make this even worse.

DevOps and Continuous Delivery/Deployment involves pushing out lots of small changes, running experiments and iteratively tuning features and the user experience based on continuous feedback from production use.

Many DevOps teams work directly on the code mainline, “branching in code” to “dark launch” code changes, while code is still being developed, using conditional logic and flags to skip over sections of code at run-time. This can make the code hard to understand, and potentially dangerous: if a feature toggle is turned on before the code is ready, bad things can happen.

Feature flags are also used to run A/B experiments and control risk on release, by rolling out a change incrementally to a few users to start. But the longer that feature flags are left in the code, the harder it is to understand and change.

There is a lot of housekeeping that needs to be done in DevOps: upgrading the CD pipeline and making sure that all of the tests are working; maintaining Puppet or Chef (or whatever configuration management tool you are using) recipes; disciplined, day-to-day refactoring; keeping track of features and options and cleaning them up when they are no longer needed, getting rid of dead code and trying to keep the code as simple as possible.

Microservices and Technology Choices

Microservices are a popular architectural approach for DevOps teams.

This is because loosely-coupled Microservices are easier for individual teams to independently deploy, change, refactor or even replace.

And a Microservices-based approach provides developers with more freedom when deciding on language or technology stack: teams don’t necessarily have to work the same way, they can choose the right tool for the job, as long as they support an API contract for the rest of the system.

In the short term there are obvious advantages to giving teams more freedom in making technology choices. They can deliver code faster, quickly try out prototypes, and teams get a chance to experiment and learn about different technologies and languages.

But Microservices “are not a free lunch”. As you add more services, system testing costs and complexity increase. Debugging and problem solving gets harder. And as more teams choose different languages and frameworks, it’s harder to track vulnerabilities, harder to operate, and harder for people to switch between teams. Code gets duplicated because teams want to minimize coupling and it is difficult or impossible to share libraries in a polyglot environment. Data is often duplicated between services for the same reason, and data inconsistencies creep in over time.

Negative Feedback

There is a potentially negative side to the Lean delivery feedback cycle too.

Constantly responding to production feedback, always working on what’s most immediately important to the organization, doesn’t leave much space or time to consider bigger, longer-term technical issues, and to work on paying off deeper architectural and technical design debt that result from poor early decisions or incorrect assumptions.

Smaller, more immediate problems get fixed fast in DevOps. Bugs that matter to operations and the users can get fixed right away instead of waiting until all the features are done, and patches and upgrades to the run-time can be pushed out more often. Which means that you can pay off a lot of debt before costs start to compound.

But behind-the-scenes, strategic debt will continue to add up. Nothing’s broke, so you don’t have to fix anything right away. And you can’t refactor your way out of it either, at least not easily. So you end up living with a poor design or an aging technology platform, slowly slowing down your ability to respond to changes, to come up with new solutions. Or forcing you to continue filling in security holes as they come up, or scrambling to scale as load increases.

DevOps can reduce technical debt. But only if you work in a highly disciplined way. And only if you raise your head up from tactical optimization to deal with bigger, more strategic issues before they become real problems.

Friday, June 5, 2015

Software Architecture in DevOps

A new book by Len Bass, Ingo Weber and Liming Zhu “DevOps: A Software Architect’s Perspective”, part of the SEI Series in Software Engineering, looks at how DevOps affects architectural decisions, and a software architect’s role in DevOps.

The authors focus on the goals of DevOps: to get working software into production as quickly as possible while minimizing risk, balancing time-to-market against quality.

“DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while insuring high quality”
These fundamental practices are:
  1. Engaging operations as a customer and partner, “a first-class stakeholder”, in development. Understanding and satisfying requirements for deployment, logging, monitoring and security in development of an application.
  2. Engaging developers in incident handling. Developers taking responsibility for their code, making sure that it is working correctly, helping (often taking the role of first responders) to investigate and resolve production problems.
    This includes the role of a “reliability engineer” on every development team, someone who is responsible for coordinating downstream changes with operations and for ensuring that changes are deployed successfully.
  3. Ensuring that all changes to code and configuration are done using automated, traceable and repeatable mechanisms – a deployment pipeline.
  4. Continuous Deployment of changes from check-in to production, to maximize the velocity of delivery, using these pipelines.
  5. Infrastructure as Code. Operations provisioning and configuration through software, following the same kinds of quality control practices (versioning, reviews, testing) as application software.

Culture and collaboration between developers and operations, shared values and organizational issues, the softer people-side of DevOps, are considered only insofar as they are factors that could affect time-to-market delivery velocity or quality.

Cloud Architecture and Microservices

As a reference for architects, the book focuses on architectural considerations for DevOps. It walks through how Cloud-based systems work, virtualization concepts and especially microservices.

While DevOps does not necessarily require making major architectural changes, the authors argue that most organizations adopting DevOps will find that a microservices-based approach, as pioneered at organizations like Netflix and Amazon, by minimizing dependencies between different parts of the system and between different teams, will also minimize the time required to get changes into production – the first goal of DevOps.

Conway’s Law also comes into play here. DevOps work is usually done by small agile cross-functional teams solving end-to-end problems independently, which means that they will naturally end up building small, independent services:

“Having an architecture composed of small services is a response to having small teams.”

But there are downsides and costs to a microservice-based approach.

As Martin Fowler and James Lewis point out, microservices introduce many more points of failure. Which means that resilience has to be designed and built into each service. Services cannot trust their clients or the other services that they call out to. You need to add defensive checking on data and anticipate failures of other services, implement time-outs and retries, and fall back alternatives or safe default behaviors if another service is unavailable. You also need to design your service to minimize the impact of failure on other services, and to make it easier and faster to recover/restart.

Microservices also increase the cost and complexity of end-to-end system testing. Run-time performance and latency degrade due to the overhead of remote calls. And monitoring and troubleshooting in production can be much more complicated, since a single action often involves many microservices working together (an example at LinkedIn, where a single user request may chain to as many as 70 services).

DevOps in Architecture: Monitoring

In DevOps, monitoring becomes a much more important factor in architecture and design, in order to meet operations requirements.

The chapter on monitoring explains what you need to monitor and why, DevOps metrics, challenges in monitoring systems under continuous change, monitoring microservices and monitoring in the Cloud, and common log management and monitoring tools for online systems.

Monitoring also becomes an important part of live testing in DevOps (Monitoring as Testing), and plays a key role in Continuous Deployment. The authors look at common kinds of live testing, including canaries, A/B testing, and Netflix’s famous Simian Army in terms of passive checking (Security Monkey, Compliance Monkey) and active live testing (Chaos Monkey and Latency Monkey).

DevOps in Architecture: Security

Security is another important cross-cutting concern in software architecture addressed in this book. It looks at security fundamentals including how to identify threats (using Microsoft’s STRIDE model) and the resources that need to be protected, CIA, identity management, access controls. It provides an overview of the security controls in NIST 800-53, and common security issues with VMs and in Cloud architectures (specifically AWS).

In DevOps, security needs to be wired into Continuous Deployment:

  1. Enforcing that all changes to code and configuration are done through the Continuous Deployment pipeline
  2. Security testing should be included in different stages of the Continuous Deployment pipeline
  3. Securing the pipeline itself, including the logs and the artifacts
and security checks need to be part of monitoring (such as Netflix’s Compliance Monkey and Security Monkey).

Continuous Deployment Pipeline and Gatekeepers

Developers – and architects – have to take responsibility for building their automated testing and deployment pipelines. The book explains how Continuous Deployment leverages Continuous Integration, and common approaches to code management and test automation. And it emphasizes the role of gatekeepers along the pipeline – manual decisions or automated checks at different points to determine if it is ok to go forward, from development to testing to staging to live production testing and then to production.

DevOps and Modern Software Architecture

“DevOps: A Software Architect’s Perspective” does a good job of explaining common DevOps practices, especially Continuous Deployment, in a development, instead of operations, context. It also looks at contemporary issues in software architecture, including virtualization and microservices.

It is less academic than Bass’s other book “Software Architecture in Practice”, and emphasizes the importance of real-world operations concerns like reliability, security and transparency (monitoring and live checks and testing) in architecture and deployment.

This is a book written mostly for enterprise software architects and managers who want to understand more about DevOps and Continuous Deployment and Cloud services.

If you’re already deep into DevOps and working with microservices in the Cloud, you probably won’t find much new here.

But if you are looking at how to apply DevOps at scale, or how to migrate legacy enterprise systems to microservices and the Cloud, or if you are a developer who wants to understand operations and what DevOps will mean to you and your job, this is worth reading.

Thursday, April 30, 2015

Can DevOps(Sec) make Software more Secure?

There was a lot of talk at RSA this year about DevOps and security: DevOpsSec or DevSecOps or Rugged DevOps or whatever people want to call it. This included a full-day seminar on DevOps before the conference opened and several talks and workshops throughout the conference which tried to make the case that DevOps isn’t just about delivering software faster, but making software better and more secure; and that DevOps isn't just for the Cloud, but that it can work in the enterprise.

Rugged DevOps

The Rugged DevOps story is based on a few core ideas:

Delivering smaller changes, more often, reduces complexity. Smaller, less complex changes are easier to code and test and review, and easier to troubleshoot when something goes wrong. And this should result in safer and more secure code: less complex code has fewer bugs, and code that has fewer bugs also has fewer vulnerabilities.

If you’re going to deliver code more often, you need to automate and streamline the work of testing and deployment. A standardized, repeatable and automated build and deployment pipeline, with built-in testing and checks, enables you to push changes out much faster and with much more confidence, which is important when you are trying to patch a critical vulnerability.

And using an automated deployment pipeline for all changes – changes to application code and configuration and changes to infrastructure – provides better change control. You know what was changed, by who and when, on every system, and you can track all changes back to your version control system.

But this means that you need to re-tool and re-think how you do deployment and configuration management, which is why so many vendors – not just Opscode and Puppet Labs, but classic enterprise vendors like IBM – are so excited about DevOps.

The DevOps Security Testing Problem

And you also need to re-tool and re-think how you do testing, especially system testing and security testing.

In DevOps, with Continuous Delivery or especially Continuous Deployment to production, you don’t have a “hardening sprint” where you can schedule a pen test or in-depth scans or an audit or operational reviews before the code gets deployed. Instead, you have to do your security testing and checks in-phase, as changes are checked-in. Static analysis engines that support incremental checking can work here, but most other security scanning and testing tools that we rely on today won’t keep up.

Which means that you’ll need to write your own security tests. But this raises a serious question. Who’s going to write these tests?

Infosec? There’s already a global shortage of people who understand application security today. And most of these people – the ones who aren’t working at consultancies or for tool vendors – are busy doing risk assessments and running scans and shepherding the results through development to get vulnerabilities fixed, or maybe doing secure code reviews or helping with threat modeling in a small number of more advanced shops. They don’t have the time or often the skills to write automated security tests in Ruby or whatever automated testing framework that you select.

QA? In more and more shops today, especially where Agile or DevOps methods are followed, there isn't anybody in QA, because manual testers who walk through testing checklists can’t keep up, so developers are responsible for doing their own testing.

When it comes to security testing, this is a problem. Most developers still don’t have the application security knowledge to understand how to write secure code, which means that they also don’t understand enough about security to know what security tests need to be written. And writing an automated attack in Gauntlt (and from what I can tell, more people are talking about Gauntlt than writing tests with it) is a lot different than writing happy path automated unit tests in JUnit or UI-driven functional tests in Selenium or Watir.

So we shouldn’t expect too much from automated security testing in DevOps. There’s not enough time in a Continuous Delivery pipeline to do deep scanning or comprehensive fuzzing especially if you want to deploy each day or multiple times per day, and we won’t get real coverage from some automated security tests written in Gauntlt or Mittn.

But maybe that’s ok, because DevOps could force us to change the way that we think about and the way that we do application security, just as Agile development changed the way that most of us design and build applications.

DevOpsSec – a Forcing Factor for Change

Agile development pushed developers to work more closely with each other and with the Customer, to understand real requirements and priorities and to respond to changes in requirements and priorities. And it also pushed developers to take more responsibility for code quality and for making sure that their code actually did what it was supposed to, through practices like TDD and relentless automated testing.

DevOps is pushing developers again, this time to work more closely with operations and infosec, to understand what’s required to make their code safe and resilient and performant. And it is pushing developers to take responsibility for making their code run properly in production:

“You build it, you run it”
Werner Vogels, CTO Amazon

When it comes to security, DevOps can force a fundamental change in how application security is done today, from "check-then-fix" to something that will actually work: building security in from the beginning, where it makes the most difference. But a lot of things have to change for this to succeed:

Developers need better appsec skills, and they need to work more closely with ops and with infosec, so that they can understand security and operational risks and understand how to deal with them proactively. Thinking more about security and reliability in requirements and design, understanding the security capabilities of their languages and frameworks and using them properly, writing more careful code and reviewing code more carefully.

Managers and Product Owners need to give developers the time to learn and build these skills, and the time to think through design and to do proper code reviews.

Infosec needs to become more iterative and more agile, to move out front, so that they can understand changing risks and threats as developers adopt new platforms and new technologies (the Cloud, Mobile, IoT, …). So that they can help developers design and write tools and tests and templates instead of preparing checklists – to do what Intuit calls “Security as Code”.

DevOps isn’t making software more secure – not yet. But it could, if it changes the way that developers design and build software and the way that most of us think about security.

Tuesday, April 7, 2015

Towards Compliance as Code

Infrastructure as Code is fundamental to DevOps. Automating the work of setting up and maintaining systems infrastructure. Making it defined, efficient, testable, auditable and standardized.

For the many of us who work in regulated environments, we need more. We need Compliance as Code.

Take regulatory constraints and policies and compliance procedures and the processes and constraints that they drive, and wire as much of this as possible into automated workflows and tests. Making it defined, efficient, testable, auditable and standardized.

DevOps Audit Defense Toolkit

Some big steps towards Compliance as Code are laid out in the Devops Audit Defense Toolkit, a freely-available document which explains how compliance requirements such as separation of duties between developers and operations, and detecting/preventing unauthorized changes, can be met in a DevOps environment, using some common, basic controls:

  1. Code Reviews. All code changes must be peer reviewed before check-in. Any changes to high-risk code must be reviewed a second time by an expert. Reviewers check code and tests for functional and operational correctness and consistency. They look for coding and design mistakes and gaps, operational dependencies, for back doors and for security vulnerabilities. Which means that developers must be trained and guided in how to do reviews properly. Peer reviews also ensure that changes can’t be pushed without at least one other person on the team understanding what is going on.
  2. Static analysis. Static analysis is run on all changes to catch security bugs and other problems. Any violations of coding rules will break the build.
  3. Automated testing is done in Continuous Integration/Continuous Delivery – unit and integration testing, and security testing. The Audit Toolkit assumes that developers follow TDD to ensure a high level of test coverage. All tests must pass.
  4. Traceability of all changes back to the original request, using a ticketing system like Jira (you can’t just use index cards on a wall to describe stories and throw them out when you are done).
  5. Operations checks/asserts after deployment and startup, and feedback from operations monitoring and especially from production failures. Metrics and post mortem review findings are used to drive improvements to testing and instrumentation, as well as deeper changes to policy definition, training and hiring – see John Allspaw’s presentation Ops Meta-Metrics: The Currency you use to pay for Change, from Velocity 2010, on how this can be done.
  6. All changes to code and infrastructure definitions, including bug fixes and patches, are deployed through the same automated, auditable Continuous Delivery pipeline.

A starting point

The DevOps Audit Defense Toolkit provides a starting point, an example to build on. You can add your own rules, checks, reviews, tests, and feedback loops.

It is also a work in progress. There are a few important problems still to be worked out:

Major Changes

The Audit Toolkit describes how standard changes can be handled in Continuous Delivery: small, well-defined, low-impact changes that are effectively pre-approved. Operations and management are notified as these changes are deployed (the changes are logged, information is displayed on screens and included in reports), but there is no upfront communication or coordination of these changes, because it shouldn’t be necessary. Developers can push changes out as soon as they are ready, and they get deployed immediately after all reviews and tests and other checks pass.

But the Audit Toolkit is silent on how to manage larger scale changes, including changes to data and databases, changes to interfaces with other systems, changes required to comply with new laws and regulations, major new customer features and technical upgrades. Changes that are harder to rollout, that have wider impact and higher risk, and require much more coordination. Which is, of course, the stuff that matters most.

You need clear and explicit hand-offs to operations and customer service for larger changes, so that all stakeholders understand the dependencies and risks and impact on how they work so that they can plan ahead. This can still be done in a DevOps way, but it does require meetings and planning, and some project management and paperwork. As an example, see how Etsy manages feature launches.

You also need to ensure that the policies for defining which changes are small enough and simple enough to be pre-approved, and for deciding which code changes are high risk and need additional review, are reasonable and unambiguous and consistent. You need to do frequent reviews to ensure that these policies are rigorously followed and that people don’t misunderstand or try to get away with pushing higher-risk, non-standard changes through without management/CAB oversight and explicit change approval.

Done properly, this means that the full weight of change control is only brought to bear when it is needed – for changes that have real operational or business risk. Then you want to find ways to minimize these risks, to break changes down into smaller pieces, to simplify, streamline and automate as much of the work required as possible, leveraging the same testing and delivery infrastructure.

Security testing

There is a lot of attention to responsible security testing in the Audit Toolkit. Because changes are made incrementally and iteratively, and pushed out automatically, you’ll need tools and tests that work automatically, incrementally and iteratively. Which is unfortunately not how most security tools work, and not how most security testing is done today.

There aren’t that many organizations using tools like Gauntlt or BDD-Security to write higher-level automated security tests and checks as part of Continuous Integration or Continuous Delivery. Most of us depend on dynamic and static scanners and fuzzers that can take hours to run and require manual review and attention, or expensive, time-consuming manual pen tests. This clearly can’t be done on every check-in.

But as more teams adopt Agile and now DevOps practices, the way that security testing is done is also changing, in order to keep up. Static analysis tools are getting speedier, and many tools can provide feedback directly to developers in their IDEs, or work against incremental change sets. Dynamic testing tools and services are becoming more scriptable and more scalable and simpler to use, with open APIs.

Interactive security testing tools like Contrast or Quotium Seeker can catch security errors at run-time as the system is being tested in Continuous Integration/Delivery. And companies like Signal Sciences are working on new ways to do agile security for online systems. But this is new ground: there’s still lots of digging and hoeing that needs to be done.

Do developers need access to production?

The Audit Toolkit assumes that developers will have read access to production logs, and that they may also need direct access to production in order to help with troubleshooting and support. Even if you restrict developers to read only access, this raises concerns around data privacy and confidentiality.

And what if read access is not enough? What if developers need to make a hot fix to code or configuration that can’t be done through the automated pipeline, or repair production data? Now you have problems with separation of duties and data integrity.

What should developers be able to do, what should they be able to see? And how can this be controlled and tracked? If you are allowing developers in production, you need to have solid answers for these questions.

Continuous Deployment or Continuous Delivery?

The Audit Toolkit makes the argument that with proper controls in place, developers should be able to push changes directly out to production when they are ready – provided that these changes are low-risk and only if the changes pass through all of the reviews and tests in the automated deployment pipeline.

But this is not something that you have to do or even can do – not because of compliance constraints necessarily, but because your business environment or your architecture won’t support making changes on the fly. Continuous Delivery does not have to mean Continuous Deployment. You can still follow disciplined Continuous Delivery through to pre-production, with all of the reviews and checks in place, and then bundle changes together and release them when it makes sense.

Selling to regulators and auditors

You will need to explain and sell this approach to regulators and auditors – to lawyers or wanna-be lawyers. Convincing them – and helping them – to look at code and logs instead of legal policies and checklists. Convincing them that it’s ok for developers to push low-risk, pre-approved changes to production, if you want to go this far.

Just as beauty is in the eye of the beholder, compliance is in the opinion of the auditor. They may not agree with or understand what you are doing. And even if one auditor does, the next one may not. Be prepared for a hard sell, and for set backs.

Disciplined, Agile and Lean

The DevOps Audit Defense Toolkit describes a disciplined, but Agile/Lean approach to managing software and system changes in a highly regulated environment.

This is definitely not easy. It’s not lightweight. It takes a lot of engineering discipline. And a lot of investment in automation and in management oversight to make it work.

But it’s still Agile. It supports the rapid pace and iterative, incremental way that development teams want to work today. And Lean. Because all of the work is clearly laid out and automated wherever possible. You can map the value chains and workflows, measure delays and optimize, review and improve.

Instead of detailed policies and procedures and checklists that nobody can be sure are actually being followed, you have automated delivery and deployment processes that you exercise all of the time, so you know they work. Policies and guidelines are used to drive decisions, which means that they can be simpler and clearer and more practical. Procedures and checklists are burned into automated steps and controls.

This could work. It should work. And it’s worth trying to make work. Instead of compliance theater and tedious and expensive overhead, it promises that changes to systems can be made simpler, more predictable, more efficient and safer. That’s something that’s worth doing.

Wednesday, February 25, 2015

DevOps is not a Race

Most of what we read about or hear about in DevOps emphases speed. Continuous Deployment. Fast feedback. Fail fast, fail often.

How many times do we have to hear about how many times Amazon or Facebook or Netflix or Etsy deploy changes every day or every hour or every minute?

Software Development at the Speed of DevOps

Security at the Speed of DevOps

DevOps at the Speed of Google

Devops Explained: A Philosophy of Speed, Not Momentum

It’s all about the Speed: DevOps and the Cloud

Even enterprise DevOps conferences are about speed and more speed.

Speed is Sexy, but...

Speed is sexy. Speed sells. But speed isn’t the point.

Go back to John Allspaw’s early work at Flickr, which helped kick off DevOps. Actually, look at all of Allspaw’s work. Most of it is about minimizing the operational and technical risk of change. Minimizing the chance of making mistakes. Minimizing the impact of mistakes. Minimizing the time needed to detect, understand and recover from mistakes. Learning from mistakes when they happen and improving so that you don't make the same kind of mistakes again or so that you can catch them and fix them quicker. Breaking down silos between dev and ops so that they can work together to solve problems.

La de da, everything’s fine … change happens….OMGWTF OUTAGES!!!!!

Infrastructure as Code and eliminating snowflake servers. Not about maximizing speed.

Checking everything into version control – code, application configuration, server and network configurations… not about maximizing speed.

Breaking releases down into small change sets with fewer moving parts and fewer dependencies, makes changes easier to understand, easier to review, easier to test, simpler and easier to deploy and simpler and easier to roll-back or fix. This is not about maximizing speed.

Executing automated tests in Continuous Integration…

Building out test environments to match production so that developers can test and learn how their system will work under real-world conditions…

Building automated integration and deployment pipelines to test and to production so that you can push out a change or a fix immediately is…

Change controls based on transparency and peer reviews and repeatable automated controls instead of CCB meetings…

Auditing all of this so that you know what was changed by who and when…

Developers talking to ops and learning and caring about run-time infrastructure and operations procedures….

Ops talking to developers and learning and caring about the application and how it is built and deployed and configured…

Wiring monitoring and metrics and alerting into the system from the beginning…

Running game days and testing your incident response capabilities with developers and ops…

Dev and ops working through Root Cause(s) Analysis in blameless post-mortems when something goes wrong so that they can learn and improve together…

Injecting automated security testing and checks into your build and deployment chain…

None of this is about speed. It is about building better communications paths and feedback loops between the business and developers and operations. About building a safe, open culture where people can confront mistakes and learn from them together. About building a repeatable, reliable deployment capability. Building better, more resilient software and a better, more resilient and responsive IT delivery and support organization.

DevOps is not a Race

Ignore the vendors who tell you that their latest “DevOps solution” will make your enterprise faster.

And unless you actually are an online consumer startup, ignore the hype about the Lean Startup and Continuous Deployment – this has nothing to do with running an enterprise.

DevOps is a lot of work. Don’t go into it thinking that it’s a race.

Tuesday, July 29, 2014

Devops isn't killing developers – but it is killing development and developer productivity

Devops isn't killing developers – at least not any developers that I know.

But Devops is killing development, or the way that most of us think of how we are supposed to build and deliver software. Agile loaded the gun. Devops is pulling the trigger.

Flow instead of Delivery

A sea change is happening in the way that software is developed and delivered. Large-scale waterfall software development projects gave way to phased delivery and Spiral approaches, and then to smaller teams delivering working code in time boxes using Scrum or other iterative Agile methods. Now people are moving on from Scrum to Kanban, and to One-Piece Continuous Flow with immediate and Continuous Deployment of code to production in Devops.

The scale and focus of development continues to shrink, and so does the time frame for making decisions and getting work done. Phases and milestones and project reviews to sprints and sprint reviews to Lean controls over WIP limits and task-level optimization. The size of deliverables: from what a project team could deliver in a year to what a Scrum team could get done in a month or a week to what an individual developer can get working in production in a couple of days or a couple of hours.

The definition of “Done” and “Working Software” changes from something that is coded and tested and ready to demo to something that is working in production – now (“Done Means Released”).

Continuous Delivery and Continuous Deployment replace Continuous Integration. Rapid deployment to production doesn't leave time for manual testing or for manual testers, which means developers are responsible for catching all of the bugs themselves before code gets to production – or do their testing in production and try to catch problems as they happen (aka “Monitoring as Testing").

Because Devops brings developers much closer to production, operational risks become more important than project risks, and operational metrics become more important than project metrics. System uptime and cycle time to production replace Earned Value or velocity. The stress of hitting deadlines is replaced by the stress of firefighting in production and being on call.

Devops isn't about delivering a project or even delivering features. It’s about minimizing lead time and maximizing flow of work to production, recognizing and eliminating junk work and delays and hand offs, improving system reliability and cutting operational costs, building in feedback loops from production to development, standardizing and automating steps as much as possible. It’s more manufacturing and process control than engineering.

Devops kills Developer Productivity too

Devops also kills developer productivity.

Whether you try to measure developer productivity by LOC or Function Points or Feature Points or Story Points or velocity or some other measure of how much code is written, less coding gets done because developers are spending more time on ops work and dealing with interruptions, and less time writing code.

Time learning about the infrastructure and the platform and understanding how it is setup and making sure that it is setup right. Building Continuous Delivery and Continuous Deployment pipelines and keeping them running. Helping ops to investigate and resolve issues, responding to urgent customer requests and questions, looking into performance problems, monitoring the system to make sure that it is working correctly, helping to run A/B experiments, pushing changes and fixes out… all take time away from development and pre-empt thinking about requirements and designing and coding and testing (the work that developers are trained to do and are good at).

The Impact of Interruptions and Multi-Tasking

You can’t protect developers from interruptions and changes in priorities in Devops, even if you use Kanban with strict WIP limits, even in a tightly run shop – and you don’t want to. Developers need to be responsive to operations and customers, react to feedback from production, jump on problems and help detect and resolve failures as quickly as possible. This means everyone, especially your most talented developers, need to be available for ops most if not all of the time.

Developers join ops on call after hours, which means carrying a pager (or being chased by Pager Duty) after the day’s work is done. And time wasted on support calls for problems that end up not being real problems, and long nights and weekends on fire fighting and tracking down production issues and helping to recover from failures, coming in tired the next day to spend more time on incident dry runs and testing failover and roll-forward and roll-back recovery and participating in post mortems and root cause analysis sessions when something goes wrong and the failover or roll-forward or roll-back doesn’t work.

You can’t plan for interruptions and operational problems, and you can’t plan around them. Which means developers will miss their commitments more often. Then why make commitments at all? Why bother planning or estimating? Use just-in-time prioritization instead to focus in on the most important thing that ops or the customer need at the moment, and deliver it as soon as you can – unless something more important comes up and pre-empts it.

As developers take on more ops and support responsibilities, multi-tasking and task switching – and the interruptions and inefficiency that come with it – increase, fracturing time and destroying concentration. This has an immediate drag on productivity, and a longer term impact on people’s ability to think and to solve problems.

Even the Continuous Deployment feedback loop itself is an interruption to a developer’s flow.

After a developer checks in code, running unit tests in Continuous Integration is supposed to be fast, a few seconds or minutes, so that they can keep moving forward with their work. But to deploy immediately to production means running through a more extensive set of integration tests and systems tests and other checks in Continuous Delivery (more tests and more checks takes more time), then executing the steps through to deployment, and then monitoring production to make sure that everything worked correctly, and jumping in if anything goes wrong. Even if most of the steps are automated and optimized, all of this takes extra time and the developer’s attention away from working on code.

Optimizing the flow of work in and out of operations means sacrificing developer flow, and slowing down development work itself.

Expectations and Metrics and Incentives have to Change

In Devops, the way that developers (and ops) work change, and the way that they need to be managed changes. It’s also critical to change expectations and metrics and incentives for developers.

Devops success is measured by operational IT metrics, not on meeting project delivery goals of scope, schedule and cost, not on meeting release goals or sprint commitments, or even meeting product design goals.

  • How fast can the team respond to important changes and problems: Change Lead Time and Cycle Time to production instead of delivery milestones or velocity
  • How often do they push changes to production (which is still the metric that most people are most excited about – how many times per day or per hour or minute Etsy or Netflix or Amazon deploy changes)
  • How often do they make mistakes - Change / Failure ratio
  • System reliability and uptime – MTBF and especially MTTD and MTTR
  • Cost of change – and overall Operations and Support costs

Devops is more about Ops than Dev

As more software is delivered earlier and more often to production, development turns into maintenance. Project management is replaced by incident management and task management. Planning horizons get much shorter – or planning is replaced by just-in-time queue prioritization and triage.

With Infrastructure as Code Ops become developers, designing and coding infrastructure and infrastructure changes, thinking about reuse and readability and duplication and refactoring, technical debt and testability and building on TDD to implement TDI (Test Driven Infrastructure). They become more agile and more Agile, making smaller changes more often, more time programming and less on paper work.

And developers start to work more like ops. Taking on responsibilities for operations and support, putting operational risks first, caring about the infrastructure, building operations tools, finding ways to balance immediate short-term demands for operational support with longer-term design goals.

None of this will be a surprise to anyone who has been working in an online business for a while. Once you deliver a system and customers start using it, priorities change, everything about the way that you work and plan has to change too.

This way of working isn't better for developers, or worse necessarily. But it is fundamentally different from how many developers think and work today. More frenetic and interrupt-driven. At the same time, more disciplined and more Lean. More transparent. More responsibility and accountability. Less about development and more about release and deployment and operations and support.

Developers – and their managers – will need to get used to being part of the bigger picture of running IT, which is about much more than designing apps and writing and delivering code. This might be the future of software development. But not all developers will like it, or be good at it.

Thursday, March 27, 2014

Secure DevOps - Seems Simple

The DevOps security story is deceptively simple. It’s based on a few fundamental, straight forward ideas and practices:

Smaller Releases are Safer

One of these ideas is that smaller, incremental and more frequent releases are safer and cause less problems than big bang changes. Makes sense.

Smaller releases contain less code changes. Less code means less complexity and fewer bugs. And less risk, because smaller releases are easier to understand, easier to plan for, easier to test, easier to review, and easier to roll back if something goes wrong.

And easier to catch security risks by watching out for changes to high risk areas of code: code that handles sensitive data, or security features or other important plumbing, new APIs, error handling. At Etsy for example, they identify this code in reviews or pen testing or whatever, hash it, and automatically alert the security team when it gets changed, so that they can make sure that the changes are safe.

Changing the code more frequently may also make it harder for the bad guys to understand what you are doing and find vulnerabilities in your system – taking advantage of a temporary “Honeymoon Effect” between the time you change the system and the time that the bad guys figure out how to exploit weaknesses in it.

And changing more often forces you to simplify and automate application deployment, to make it repeatable, reliable, simpler, faster, easier to audit. This is good for change control: you can put more trust in your ability to deploy safely and consistently, you can trace what changes were made, who made them, and when.

And you can deploy application patches quickly if you find a problem.

“...being able to deploy quick is our #1 security feature”
Effective Approaches to Web Application Security, Zane Lackey

Standardized Ops Environment through Infrastructure as Code

DevOps treats “Infrastructure as Code”: infrastructure configurations are defined in code that is written and managed in the same way as application code, and deployed using automated tools like Puppet or Chef instead of by hand. Which means that you always know how your infrastructure is setup and that it is setup consistently (no more Configuration Drift). You can prove what changes were made, who made them, and when.

You can deploy infrastructure changes and patches quickly if you find a problem.

You can test your configuration changes in advance, using the same kinds of automated unit test and integration test suites that Agile developers rely on – including tests for security.

And you can easily setup test environments that match (or come closer to matching) production, which means you can do a more thorough and accurate job of all of your testing.

Automated Continuous Security Testing

DevOps builds on Agile development practices like automated unit/integration testing in Continuous Integration, to include higher level automated system testing in Continuous Delivery/Continuous Deployment.

You can do automated security testing using something like Gauntlt to “be mean to your code” by running canned attacks on the system in a controlled way.

Other ways of injecting security into Devops include:

  1. Providing developers with immediate feedback on security issues through self-service static analysis: running Static Analysis scans on every check-in, or directly in their IDEs as they are writing code.
  2. Helping developers to write automated security unit tests and integration tests and adding them to the Continuous testing pipelines.
  3. Automating checks on Open Source and other third party software dependencies as part of the build or Continuous Integration, using something like OWASP’s Dependency Check to highlight dependencies that have known vulnerabilities.
Fast feedback loops using automated testing means you can catch more security problems – and fix them – earlier.

Operations Checks and Feedback

DevOps extends the idea of feedback loops to developers from testing all the way into production, allowing (and encouraging) developers visibility into production metrics and getting developers and ops and security to all monitor the system for anomalies in order to catch performance problems and reliability problems and security problems.

Adding automated asserts and health checks to deployment (and before start/restart) in production to make sure key operational dependencies are met, including security checks: that the configurations correct, ports that should be closed are closed, ports that should be opened are opened, permissions are correct, SSL is setup properly…

Or even killing system processes that don’t conform (or sometimes just to make sure that they failover properly, like they do at Netflix).

People talking to each other and working together to solve problems

And finally DevOps is about people talking together and solving problems together. Not just developers talking to the business/customers. Developers talking to ops, ops talking to developers, and everybody talking to security. Sharing ideas, sharing tools and practices. Bringing ops and security into the loop early. Dev and ops and security working together on planning and on incident response and learning together in Root Cause Analysis sessions and other reviews. Building teams across silos. Building trust.

Making SecDevOps Work

There’s good reasons to be excited by what these people are doing, the path that they are going down. It promises a new, more effective way for developers and security and ops to work together.

But there are some caveats.

Secure DevOps requires strong engineering disciplines and skills. DevOps engineering skills are still in short supply. And so are information security(and especially appsec) skills. People who are good at both DevOps and appsec are a small subset of these small subsets of the talent available.

Outside of configuration management and monitoring, the tooling is limited – you’ll probably have to write a lot of what you need yourself (which leads quickly back to the skills problem).

A lot more work needs to be done to make this apply to regulated environments, with enforced separation of duties and where regulators think of Agile as “the A Word” (so you can imagine what they think of developers pushing out changes to production in Continuous Deployment, even if they are using automated tools to do it). A small number of people are exploring these problems in a Google discussion group on DevOps for managers and auditors in regulated industries, but so far there are more people asking questions than offering answers.

And getting dev and ops and security working together and collaborating across development, ops and security might take an extreme makeover of your organization’s structure and culture.

Secure DevOps practices and ideas aren't enough by themselves to make a system secure. You still need all of the fundamentals in place. Even if they are releasing software incrementally and running lots of automated tests, developers still need to understand software security and design security in and follow good software engineering practices. Whether they are using "Infrastructure as Code" or not, Ops still has to design and engineer the datacenter and the network and the rest of the infrastructure to be safe and reliable, and run things in a secure and responsible way. And security still needs to train everyone and followup on what they are doing, run their scans and pen tests and audits to make sure that all of this is being done right.

Secure DevOps is not as simple as it looks. It needs disciplined secure development and secure ops fundamentals, and good tools and rare skills and a high level of organizational agility and a culture of trust and collaboration. Which is why only a small number of organizations are doing this today. It’s not a short term answer for most organizations. But it does show a way for ops and security to keep up with the high speed of Agile development, and to become more agile, and hopefully more effective, themselves.

Thursday, September 12, 2013

The Real Cost of Change in Software Development

There are two widely opposed (and often misunderstood) positions on how expensive it can be to change or fix software once it has been designed, coded, tested and implemented. One holds that it is extremely expensive to leave changes until late, that the cost of change rises exponentially. The other position is that changes should be left as late as possible, because the cost of changing software is – or at least can be – essentially flat (that’s why we call it “soft”ware).

Which position is right? Why should we care? And what can we do about it?

Exponential Cost of Change

Back in the early 1980s, Barry Boehm published some statistics (Software Engineering Economics, 1981) which showed that the cost of making a software change or fix increases significantly over time – you can see the original curve that he published here.

Boehm looked at data collected from Waterfall-based projects at TRW and IBM in the 1970s, and found that the cost of making a change increases as you move from the stages of requirements analysis to architecture, design, coding, testing and deployment. A requirements mistake found and corrected while you are still defining the requirements costs almost nothing. But if you wait until after you've finished designing, coding and testing the system and delivering it to the customer, it can cost up to 100x as much.

A few caveats here. First, the cost curve is much higher in large projects (in smaller projects, the cost curve is more like 1:4 instead of 1:100). Those cases where the cost of change rises up to 100x are rare, what Boehm calls Architecture-Breakers, where the team gets a fundamental architectural assumption wrong (scaling, performance, reliability) and doesn't find out until after customers are already using the system and running into serious operational problems. And this analysis was all done on a small data sample from more than 30 years ago, when developing code was much more expensive and time-consuming and paperworky, and the tools sucked.

A few other studies have been done since then which mostly back up Boehm's findings – at least the basic idea that the longer it takes for you to find out that you made a mistake, the more expensive it is to correct it. These studies have been widely referenced in books like Steve McConnell’s Code Complete, and used to justify the importance of early reviews and testing:

Studies over the last 25 years have proven conclusively that it pays to do things right the first time. Unnecessary changes are expensive. Researchers at Hewlett-Packard, IBM, Hughes Aircraft, TRW, and other organizations have found that purging an error by the beginning of construction allows rework to be done 10 to 100 times less expensively than when it's done in the last part of the process, during system test or after release (Fagan 1976; Humphrey, Snyder, and Willis 1991; Leffingwell 1997; Willis et al. 1998; Grady 1999; Shull et al. 2002; Boehm and Turner 2004).

In general, the principle is to find an error as close as possible to the time at which it was introduced. The longer the defect stays in the software food chain, the more damage it causes further down the chain. Since requirements are done first, requirements defects have the potential to be in the system longer and to be more expensive. Defects inserted into the software upstream also tend to have broader effects than those inserted further downstream. That also makes early defects more expensive.

There’s some controversy over how accurate and complete this data is, how much we can rely on it, and how relevant it is today when we have much better development tools and many teams have moved from heavyweight sequential Waterfall development to lightweight iterative, incremental development approaches.

Flattening the Cost of Changing Code

The rules of the game should change with iterative and incremental development – because they have to.

Boehm realized back in the 1980s that we could catch more mistakes early (and therefore reduce the cost of development) if we think about risks upfront and design and build software in increments, using what he called the Spiral Model, rather than trying to define, design and build software in a Waterfall sequence.

The same ideas are behind more modern, lighter Agile development approaches. In Extreme Programming Explained (the 1st edition, but not the 2nd) Kent Beck states that minimizing the cost of change is one of the goals of Extreme Programming, and that a flattened change cost curve is “the technical premise of XP”:

Under certain circumstances, the exponential rise in the cost of changing software over time can be flattened. If we can flatten the curve, old assumptions about the best way to develop software no longer hold…

You would make big decisions as late in the process as possible, to defer the cost of making the decisions and to have the greatest possible chance that they would be right. You would only implement what you had to, in hopes that the needs you anticipate for tomorrow wouldn't come true. You would introduce elements to the design only as they simplified existing code or made writing the next bit of code simpler.
It’s important to understand that Beck doesn't say that with XP the change curve is flat. He says that these costs can be flattened if teams work towards this, leveraging key practices and principles in XP, such as:
  • Simple Design, doing the simplest thing that works, and deferring design decisions as late as possible (YAGNI), so that the design is easy to understand and easy to change
  • continuous, disciplined Refactoring to keep the code easy to understand and easy to change
  • Test-First Development – writing automated tests upfront to catch coding mistakes immediately, and to build up a testing safety net to catch mistakes in the future
  • developers collaborating closely and constantly with the Customer to confirm their understanding of what they need to build and working together in Pairs to design solutions and solve problems, and catch mistakes and misunderstandings early
  • relying on working software over documentation to minimize the amount of paperwork that needs to be done with each change (write code, not specs)
  • the team’s experience working incrementally and iteratively – the more that people work and think this way, the better they will get at it.

All of this makes sense and sounds right, although there are no studies that back up these assertions, which is why Beck dropped this change curve discussion from the second edition of his XP book. But by then the idea that change could be flat with Agile development had already become accepted by many people.

The importance of Feedback

Scott Amber agrees that the cost curve can be flattened in Agile development, not because of Simple Design, but because of the feedback loops which are fundamental to iterative, incremental development. Agile methods optimize feedback within the team, developers working closely together with each other and with the Customer and relying on continuous face-to-face communications. And following technical practices like Test-First Development and Pair Programming and Continuous Integration makes these feedback loops even tighter.

But what really matters is getting feedback from the people using the system – it’s only then that you know if you got it right or what you missed. The longer that it takes to design and build something and get feedback from real users, the more time and work that is required to get working software into a real customer’s hands, the higher your cost of change really is.

Optimizing and streamlining this feedback loop is what is driving the Lean Startup approach to development: defining a Minimum Viable Product (something that just barely does the job), getting it out to customers as quickly as you can, and then responding to user feedback through Continuous Deployment and A/B testing techniques until you find out what customers really want.

Even flat change can still be expensive

Even if you do everything to optimize these feedback loops and minimize your overheads, this still doesn’t mean that change will come cheap. Being fast isn’t good enough if you make too many mistakes along the way.

The Post Agilist uses the example of painting a house: assume that it costs $1,000 each time you paint the house, whether you paint it blue or red or white. The cost of change is flat. But if you have to paint it blue first, then red, then white before everyone is happy, you’re wasting time and money.

“No matter how expensive or change the ‘cost of change’ curve may be, the fewer changes that are made, the cheaper and faster the result will be…Planning is not a four letter word.” [however, I would like to point out that “plan” is].
Spending too much time upfront in planning and design is waste. But not spending enough time upfront to find out what you should be building and how you should be building it before you build it, and not taking the care to build it carefully, is also a waste.

Change gets more expensive over time

You also have to accept that the incremental cost of change will go up over the life of a system, especially once a system is being used. This is not just a technical debt problem. The more people using the system, the more people who might be impacted by the change if you get it wrong, the more careful you have to be, which means you need to spend more time on planning and communicating changes, building and testing a roll-back capability, and roll changes out slowly using Canary Releases and Dark Launching – which add costs and delays to getting feedback.

There are also more operational dependencies that you have to understand and take care of, and more data that you have to change or fix up, making changes even more difficult and expensive. If you do things right, keep a good team together and manage technical debt responsibly, these costs should rise gently over the life of a system – and if you don’t, that exponential change curve will kick in.

What is the Real Cost of Change?

Is the real cost of change exponential, or is it flat? The truth is somewhere in between.

There’s no reason that the cost of making a change to software has to be as high as it was 30 years ago. We can definitely do better today, with better tools and better, cheaper ways of developing software. The keys to minimizing the costs of change seem to be:

  1. Get your software into customer hands as quickly as you can. I am not convinced that any organization really needs to push out software changes 10-50-100x a day, but you don’t want to wait months or years for feedback either. Deliver less, but more often. And because you’re going to deliver more often, it makes sense to build a Continuous Delivery pipeline so that you can push changes out efficiently and with confidence. Use ideas from Lean Software Development and maybe Kanban to identify and eliminate waste and to minimize cycle time.
  2. We know that even with lots of upfront planning and design thinking, we won’t get everything right upfront - this is the Waterfall fallacy. But it’s also important not to waste time and money iterating when you don’t need to. Spending enough time upfront in understanding requirements and in design to get it at least mostly right the first time can save a lot later on.
  3. And whether you’re working incrementally and iteratively, or sequentially, it makes good sense to catch mistakes early when you can, whether you do this through Test First Development and pairing, or requirements workshops and code reviews, whatever works for you.

Tuesday, September 3, 2013

This is how Facebook Develops and Deploys Software. Should you care?

A recently published academic paper by Prof. Dror Feitelson at Hebrew University, Eitan Frachtenberg a research scientist at Facebook, and Kent Beck (who is also doing something at Facebook), describes Facebook’s approach to developing and deploying their front-end software. While it would be more interesting to understand how back-end development is done (this is where the real heavy lifting is done scaling up to handle hundreds of millions of users), there are a few things in the paper that are worth knowing about.

Continuous Deployment at Facebook is not Continuous Deployment

Rather than planning work out in projects or breaking work into time boxed Sprints, Facebook developers do most of their work in independent, small changes which are released frequently. This makes sense in Facebook’s online business model, everyone constantly tuning the platform and trying out new options and applications in different user communities, seeing what sticks. It’s a credit to their architecture that so many small, independent changes can actually be done independently and cheaply.

Facebook says that they follow Continuous Deployment, but it’s not Continuous Deployment the way that IMVU made popular where every change is pushed out to customers immediately, or even how a company like Etsy does Continuous Deployment.

At Facebook, code can be released twice a day, but this is done mostly for bug fixes and internal code. New production code is released once per week: thousands of changes by hundreds of developers are packaged up by their small release team on Sundays, run through automated regression testing, and released on Tuesday if the developers who contributed the changes are present. Release engineers assess the risk of changes based on the size of the change, the amount of discussion done in code reviews (which is recorded through an internal code review tool), and on each developer’s “push karma”: how many problems they have seen from code by this developer before.

A tool called “Gatekeeper” controls what features are available to what customers to support dark launching, and all code is released incrementally – to staging, then a subset of users, and so on. Changes can be rolled-back if necessary – individually, or, as a last resort, an entire code release. However, like a lot of Silicon Valley devops shops, they mostly follow the “Real Men only Roll Forward” motto.

Code Ownership

A key to the culture at Facebook is that developers are individually responsible for the code that they wrote, for testing it and supporting it in production. This is reflected in their code ownership model:

Developers must also support the operational use of their software — a combination that’s become known as “devops.” This further motivates writing good code and testing it thoroughly. Developers’ personal stake in keeping the system running smoothly complements the engineering procedures and lets the system maintain quality at scale. Methodologies and tools aren’t enough by themselves because they can always be misused. Thus, a culture of personal responsibility is critical.

Consequently, most source files are modified by only a few engineers. Although at least one other engineer reviews all changes before they’re committed, a third of the source files have only been edited by one engineer, and another quarter by two. Only 10 percent of the files are handled by more than seven engineers. On the other hand, the distribution of engineers per file has a heavy tail, with the most widely shared file handled by no fewer than 870 distinct engineers. These widely shared files are predominantly library files and also include major configuration and top-level PHP files.

Testing? We don’t need no stinking testing…

Facebook doesn't have an independent test team, because, they say, they don’t need one.

First, they depend a lot on code reviews to find bugs:

At Facebook, code review occupies a central position. Every line of code that’s written is reviewed by a different engineer than the original author. This serves multiple purposes: the original engineer is motivated to ensure that the code is of high quality, the reviewer comes with a fresh mind and might find defects or suggest alternatives, and, in general, knowledge about coding practices and the code itself spreads throughout the company.
Developers are also responsible for writing unit tests and their own regression tests – they have “tens of thousands of regression tests” (which doesn't sound like near enough for 10+ million lines of mostly PHP code compiled into C++, both languages which are easy to make coding mistakes in) and automated performance tests.

And developers also test the software by using the development version of Facebook for their personal Facebook use. According to the authors, “this is just one aspect of the departure from traditional software development”. But Facebook developers using their own software internally (and passing this off as “testing”) is no different than the early days at Microsoft where employees were supposed to “eat their own dog food”, a practice that did little if anything to improve the quality of Microsoft products.

Facebook also depends on customers to test the software for them. Software is released in steps for A/B testing and “live experimentation” on subsets of the user base, whether customers want to participate in this testing or not. Because their customer base is so large, they can get meaningful feedback from testing with even a small percentage of users, which at least minimizes the risk and inconvenience to customers.

Security???

While performance is an important consideration for developers at Facebook, there is no mention of security checks or testing anywhere in this description of how Facebook develops and deploys software. No static analysis, dynamic analysis/scanning, pen testing or explanation of how the security team and developers work together, not even for “privacy sensitive code” – although this code is “held to a higher standard” they don’t explain what this “higher standard” is. Presumably they rely on use of libraries and frameworks to handle at least some appsec problems, and possibly look for security bugs in their code reviews, but they don't say.

There isn’t much information available on Facebook’s appsec program anywhere. The security team at Facebook seems to spend a lot of time educating people on how to use Facebook safely and how to develop Facebook apps safely and running their bug bounty program which pays outsiders to find security bugs for them.

A search on security on Facebook mostly comes back with a long list of public security failures, privacy violations and application security vulnerabilities found over the years and continuing up to the present day. Maybe the lack of an effective appsec program is the reason for this.

This is the way Facebook is Developed. Should you care?

While it’s interesting to get a look inside a high-profile organization like Facebook and how they approach development at scale, it’s not clear why this paper was written. There is little about what Facebook is doing (on their front-end development at least) that is unique or innovative, except maybe the way they use BitTorrent to push code changes out to thousands of servers like Twitter does, something that I already heard about a few years ago at Velocity and that has been written about before.

I like the idea of developers being responsible for their work, all the way into production, which is a principle that we also follow. Code reviews are good. Dark launching features is a good practice and has been a common practice in systems for a long time (even before it was called "dark launching"). Not having testers or doing appsec is not good. Otherwise, I'm not sure what the rest of us can learn from or would want to use from this.

Thursday, August 22, 2013

Getting Application Security Vulnerabilities Fixed

It’s a lot harder to fix application security vulnerabilities than it should be.

In their May 2013 security report, WhiteHat Security published some discouraging findings about how many application security vulnerabilities found in testing get fixed, and how long it takes to fix them. They found that only 61% of serious security vulnerabilities get fixed, and that on average, it takes 193 days to get them fixed.

Why some vulnerabilities get fixed, or don’t get fixed

Convincing management and the customers paying for software development work – and the developers that need to do the work – that security vulnerabilities really need to be fixed is one part of the problem. Proving that this can be done in a cost effective and safe way is another.

For most organizations, compliance – not operational risk, not customer requirements or other commercial considerations, and not concern for quality – determines whether security vulnerabilities get fixed. WhiteHat customers reported that the #1 reason that a vulnerability gets fixed is because it is required by compliance. And the #1 reason that people don’t fix a vulnerability is because it isn't required by compliance.

One of the other factors that influences how many security vulnerabilities get fixed and how fast they can be fixed is where the system is in its life. It’s a lot easier and much less expensive to fix security vulnerabilities found early in a project, before you’ve written a lot of code that needs to be reviewed, fixed and tested again; when the situation and the system are both plastic enough that you can make course corrections without a lot of time or cost.

Obviously it’s a much different story for legacy systems on life support, where nobody really understands the code or is confident that they can change it safely, and nobody is sure how long the system is going to be around (although these systems almost always hang on longer than anyone expects).

Everything in between is where decisions are difficult to make: the system is already in use and has been for a while, and the team maintaining and supporting it has a full book of committed work to deal with. It can be hard to make fixing security vulnerabilities a priority when things seem to be running fine, and everyone is busy trying to keep it this way, unless maybe compliance is standing over you holding a big enough hammer that you have to do something to show that you are taking them seriously.

What’s it going to cost?

If you can make the case that there are serious security problems that that need to be taken care of, where do you start? A security review could uncover hundreds or even thousands of vulnerabilities – the first time that you do a security scan or pen test of a big system can be overwhelming. How much work is it going to take to “make the system secure”, what is it going to cost?

Denim Group has done some interesting research on understanding how much work is involved in fixing security vulnerabilities.

Like any other bugs in code, some vulnerabilities are easier to find and fix than others. A XSS vulnerability can take anywhere between 10 minutes (stored XSS) to about an hour and a half (stored and reflected) to fix – and most web apps have at least one, often hundreds of these problems. Fixing a SQL Injection problem also takes an hour and a half on average. A missing authorization check? Only 7 minutes. And like any other bugs in code, the coding work is only a small part of the time taken to get the fix done (on average, 30% of the total time). Testing takes around half of the time, and the rest is in getting things setup for making the change, getting the new code built and deployed, and overhead.

Unlike a functional bug, the customer won’t see any immediate advantage in fixing a vulnerability – the code works fine right now as far as they can see. So it’s important that you can fix vulnerabilities without spending too much time or money doing it, and that you can do it without breaking whatever is already working.

This is why Nick Galbreath stresses the value of a Continuous Deployment capability as a pre-requisite to a successful software security program, leveraging Continuous Integration and Continuous Delivery tools and practices so that when developers check in code changes the system is automatically built, tested and it can be automatically deployed if all of the tests pass. It’s not about pushing every code change out immediately – it’s about having a proven pipeline in place for rolling changes out to production quickly and with minimal risk, knowing that you can make fixes and get them out cheaply and with confidence, and that you are able to respond to an emergency if you have to. This will pay dividends outside of security work, reducing the cost and risk of making any software change.

Getting security bugs fixed

Denim Group explains that remediating software security vulnerabilities has to be managed like any other software development project, and they provide some guidelines on how to do it in a waterfally kind of way, with upfront stakeholder engagement and planning: an approach that can work fine for many organizations, especially larger ones.

A more iterative, Agile approach could start with a short, time-boxed spike. Take a couple of smart developers and give them a couple of weeks to review the list of vulnerabilities (if possible with whoever found them), understand which ones are serious and filter out false positives, and pick some vulnerabilities to fix (a few each of different kinds). They should choose which vulnerabilities to work on by trading off what is easy to understand and fix, against the risk of not fixing them. Tools like the OWASP Top 10 or SANS/CWE Top 25 can help with understanding the issues and making these decisions.

SQL Injections would make a good first choice: a serious vulnerability that is easy to exploit and that can have serious consequences, but also easy for a developer to understand and fix. Or a missed authorization check: another potentially serious bug that should be easy to understand, trivial to fix and test. A problem like a mistake in secure password storage might be technically harder to solve, but still easy to isolate and test. Adding server-side validation (instead of validation only at the client) is another easy and good place to start.

It is important that the developers take the time to understand what they are doing and how to do it right: that they understand the vulnerability, why it is a problem, how to fix it correctly, and how to test it (test it to make sure that they actually closed the security hole, and regression test it to make sure that they didn't break anything else by accident). The important thing here isn't to make a few fixes – it’s to learn what’s involved in correctly solving these problems, and know that you can build and deploy the fixed code properly.

Then run another spike. Pick a few other bugs, maybe some that are harder to understand and fix, and some that are easy to fix but less serious (like missing error handling or information leaks), and run through the same steps again.

With a small investment of time like this, you should get an understanding of what work needs to be done, how to do it, what it’s going to cost, and you should also have the confidence that you can do it safely. You should have good enough information to estimate the amount of work that it will take to fix the remaining problems; and a good enough understanding of risk and cost trade-offs that need to be made, what problems need to be fixed – and can be fixed – sooner or later.

Now you can add the remaining bugs that you plan to fix to your backlog. You might decide to fix as many of them as you can all at once in a hardening sprint, or prioritize and fix them with the other work in your backlog.

You can’t fix, or effectively plan to fix, security vulnerabilities until you understand them. Once you understand the problems (what the bugs are, what needs to be fixed and why), and how much it is going to cost to fix them, and once you have the confidence that you can fix them properly, you can treat security vulnerabilities like other bugs – decide what needs to be fixed and when by trading off cost and risk with the other work that you have to get done. Remediation work becomes just another software development problem to be managed, something that developers and managers already know how to do.

Friday, March 8, 2013

Book Review: The Phoenix Project

Everyone who attended the “Rugged Devops” panel at RSA this year received a free copy of The Phoenix Project (by Gene Kim, Kevin Behr and George Spafford – the authors of Visible Ops), the fictional story of the education and transformation of an IT manager, an IT organization, and eventually of an entire company.

I'm not sure why they wrote the Phoenix Project as a novel. But they did. So I’ll review it first as a piece of story telling, and then look at the messaging inside.

The reason that I don’t like didactic fiction is that so much of it is so poorly written – generally forced and artificial. I was pleasantly surprised by the Phoenix Project. The first half of the novel tells a story about an IT manager forced into taking on responsibility for saving his company. Our well-meaning hero, an ex-marine sergeant (for some reason unclear to me, the hero, the CEO, and even the mysterious guru all have a military background) with an MBA but without any ambition except to quietly provide for his family. He has been successfully running his own little part of the IT group, so successfully that he is dramatically promoted to take over all of IT operations (and so successfully that his own group is never mentioned again in the story – it seems to run on auto-pilot without him). For a successful manager, our hero knows alarmingly little about how the rest of the IT organization works, or about the business, and so is unprepared for his new responsibility. He reluctantly accepts the big job, and then regrets it immediately as he realizes what a shit storm he has walked into.

It’s a compelling narrative that draws you in and is seems realistic even with the stock characters: the sociopathic SOB CEO, the unpopular everything-by-the-book CISO, the Software Development Manager who only cares about hitting deadlines (but not about delivering software that works), the Machiavellian Marketing executive, and the autistic genius that the entire IT organization of several hundred people all depend on to get all the important stuff done or fixed.

As a pure piece of story telling, things start to unravel in the second half with the emergence of the IT / Lean Manufacturing guru – when the story ends and the devops fable begins. From this point on, the plot depends on several miraculous conversions: everyone except the marketing exec sees the light and literally transform overnight. They even start to groom themselves nicely and dress better. Lifetime friendships are made over a few drinks, and everyone learns to trust and share: there’s a particularly cringe-inducing management meeting in which people bare their souls and weep. Conflicts and problems disappear magically usually because of the guru’s intervention – including an unnecessary scene where the guru shows up at a crucial audit meeting and helps convince the partner of the audit firm (an old buddy of his) that there aren't any real problems with the company’s messed up IT controls (“these aren't the droids you’re looking for”).

But the real heroes are the people running the manufacturing group: the one part of this spectacularly mismanaged company that somehow functions perfectly is a manufacturing plant where everyone can all go to learn about Kanban and Lean process management and so on. Without the help of the smug and smart-alecky guru - who apparently helped create this manufacturing success - and his tiresome Zen master teaching approach (and sometimes even with his help), our hero is unable to understand what’s wrong or what to do about it. He doesn't understand anything about demand management, how to schedule and prioritize project work, that firefighting is actual work, how to get out of firefighting mode, how to recognize and manage bottlenecks in workflow, or even how important it is to align IT with business priorities (where did he get that MBA any ways?). Luckily, the factory is right there for him learn from, if he only opens his eyes and mind.

What will you learn from The Phoenix Project?

The other problem with this story telling approach is that it takes so damn long to get to the point – it’s a 338 page book with about 50 pages of meat. Like Goldratt's The Goal (which is referenced a couple of times in this book), The Phoenix Project leads the reader to understanding through a kind of detective story. You have to be patient and watch as the hero stumbles and goes down blind alleys and ignores obvious clues and only with help eventually figures out the answers. Unfortunately, I'm not a patient reader.

This is a gentle introduction to Lean IT and devops. If you've read anything on Kanban and devops you won’t find anything surprising, although the authors do a good job of explaining how Lean manufacturing concepts can be applied to managing IT work. The ideas covered in the book are standard Lean and Theory of Constraints stuff, with a little of David Anderson’s Kanban and some devops – especially Continuous Deployment as originally described by John Allspaw and Continuous Delivery.

The guru’s lessons are mostly about visualizing and understanding and limiting demand – that you should stop taking on more work until you can get things actually working so that you’re not spending all of your time task-switching and firefighting; identifying workflow bottlenecks and protecting or optimizing around them; how reducing batch size in development will improve control and to get faster feedback; that in order to do this you have to work on simplifying and standardizing deployment; and how valuable it is to get developers and operations to work together.

My complaints aren't with the ideas – I buy into a lot of Devops and agree that Kanban and Lean have a lot to offer to IT ops and support teams (although I'm not sold on Kanban by itself for development, certainly not at scale). But I was disappointed with the unrealistic turnaround in the second half of the book. It’s all rainbows and unicorns at the end. IT, the business, development and security all start working together seamlessly. Management is completely aligned. Performance problems? No problem – just go the Cloud. And they even bring in the famous Chaos Monkey in the last couple of pages just because.

Spoiler Art: Everything goes so well in a few months that the company is back on track, plans to outsource IT and to break up the company are cancelled, the selfish head of marketing is canned, and our hero is promoted to CIO and put on the fast track to corporate second in command. Sorry: nothing this bad gets that good that easily. It is a fable after all, and too hard to swallow.

The Phoenix Project is a unique book – when was the last time that you read an actual novel about IT management?! It was worth reading, and if it introduces devops and Lean ideas to more people in IT, The Phoenix Project will have accomplished something useful. But it’s not a book that you can use to get things done. There are lessons and recipes and patterns but they take work to pull out of the story. There’s no index, no good way to go back to find things that you thought were useful or interesting. So I am looking forward to the Devops Cookbook: something practical that hopefully explains how these ideas can work in businesses that don’t look all like Facebook and Twitter.

Site Meter