DevOps – It is a culture shift, not a technology!
Capgemini

DevOps – It is a culture shift, not a technology!

“What is it the customer is paying for? The running service! No running service, no business!”

In this loose series of blog posts I would like to share some insights, lessons learned, success and failure stories of a large transformation project that we initiated at IBX/BPO three years ago and which is still running today. This project is about the journey of transforming a silo-based IT Organization into an agile DevOps organization.

Most of the posts in this series, like the one you are reading now, are not overly technical and more geared towards business people with an interest in technology. I will probably write one or two technical posts but these will be the exception rather than the rule.

So, let’s take the first step on our DevOps journey. When we started it, the job of running the service was split between the Development, Operations, and Infrastructure silos. Our service is a large P2P SaaS platform with, at that time, 1000+ virtual machines, two data centers, 30+ applications with hundreds of thousands of end users. And all this backed by a 24x7 availability SLA with a maximum 24 hours downtime in case of emergency (this is called Recovery Time Objective – RTO) and a maximum 15 minutes loss of data (Recovery Point Objective – RPO). With ISO 27001 and ISAE 3402 SOC 1 Type 2 certification, ITIL compliant processes. Not a small operation.

Work between the individual silos traveled in a waterfall model with long cycle teams and heavy-ceremony hand-overs. Even though Software Development was agile, the rest wasn’t.

Today, we only have integrated agile DevOps feature teams: People from Product Management, Development, QA, Security, Infrastructure and Operations work together in order to provide running services to our customers. Hand-overs are minimized and ceremonies are as light-weight as possible. The former silos still exist, however their new job is to provide shared services to the DevOps feature teams and maintain shared tooling and infrastructure.

So how did this journey get started? What was the trigger? Was it a magic tool that made this all possible? Or did we just wake up and knew that we had to trigger a revolution?

No, of course not! It all started very gradually - with a certain kind of feeling – the feeling you have been doing something quite successfully for the last few years, a lot of people agree that you are doing the right thing, it is aligned with “best practices” but still you have that itching feeling that something is amiss but you simply cannot put your finger on it nor clearly articulate it.

And then suddenly you have a Eureka!-moment where you finally realize what has been going wrong all the time. You don’t have the solution but at least you have an idea what to look at. And you realize that you have to start doing something about it, right now!

Let me explain how we got to this moment. And explain a little bit about software development processes and tool chains at the same time.

For a long period of time, our software development department has been practicing continuous integration. The underlying idea is that every code change a software developer commits to the source code storage immediately gets checked if it breaks the existing code. This job is performed by a Continuous Integration System – probably the most well-known open source one is Jenkins and this also the one we are using. The output of this system is a build with a specific build number - a “CD” that contains the software. This is the first step in our service delivery tool chain.

The next major step in the chain is that the build needs to get installed on the Quality Assurance (QA) environment servers where the QA people can test it. This activity is called a deployment and doing this process continuously with every new build is called continuous deployment. This process is executed multiple times per day every day of the year. A typical application could have hundreds of deployments per year and we have more than 20.000 deployments a year across all our apps.

As you can imagine, it is not possible to do this manually so the software development department had invested a lot of effort of implementing a continuous deployment system. This system takes the various builds from the continuous integration server and deploys them automatically to various QA environments. The QA guys simply click a “deploy” button in the continuous deployment web interface and the rest happens “automagically”. No requests to the software development team to provide the “CD” and no request to Operations to deploy the software on the servers needed. If it doesn’t work the Software Development people triage with the Operations people in order to determine what went wrong, fix it together and make sure it works next time. The percentage of successful deployments increase dramatically while at the same time the amount of time it takes from code change to deployment is dramatically reduced.

Of course there are some more steps and tools in the tool chain like automated unit, integration and functional testing, automated security testing and vulnerability scanning, etc. but I will come back to these in a later post.

The net result: Beautiful, everybody loves it – including Software Development and Operations!

Of course, this deployment process (installing the “CD”) also needs to get performed on the production environments as well. And it needs to work there – flawlessly and every time.

Because, and I might start sounding like a broken record, it is the running service the customer is paying for!

It is probably not a surprise to you know when I say now, that this is NOT what happened on release weekends.

When we deployed in productions, we often had huge amount of deployment failures and very often it took a whole weekend, day and night and we still didn’t manage to deploy it in the end. The net results was anger and disappointment all around. Customers which were expecting and sometimes were dependent on new functionality, exhausted and frustrated team members, my wife asking why I had to spend the whole weekend on the phone, the dog complaining that I didn’t take him out, you get the idea …

Lesson learned: Failed deployments are bad for everyone!

So why did we have these problems? Why did something that we were doing a hundred times in QA suddenly become such a huge problem in Production?

And then three years ago, our software development manager, our operations manager and I had our Eureka!-moment I was speaking about earlier: The root causes of all these problems were very simple and obvious (in hindsight): 

  1. In production, we didn’t use the automated continuous deployment process and we didn’t use these automated systems.
  2. More importantly, we didn’t even use the same people that so successfully performed this hundred times per year. Instead we were using a completely different set of people that were doing this only two to three times a year and did everything manually based on deployment instructions – Word documents.

My next post will talk a bit more in detail why we had this situation and of course, what we did to change this. And I will also introduce the first two tenets of our DevOps transformation:

  1. It is about people and transformation, not tools!
  2. It is about continuous improvement, not about big bang change.

Stay tuned & please feel free to contact me for feedback, questions, improvements, criticism, etc.!

This is the first post in my DevOps series. The second post is called When Two Worlds Collide.



Uday Kumar

Business Orchestrator (Technology B2B Biz) - - My views are personal

10y

The other problem is QA deployment is used for functionality testing. It is even for testing the deployment process (as long as QA / Staging env is Production Equivalent). The idea of Continuous Deployment automation shall be consistent and repeatable process across QA and Prod. If we do this differently on different environments then errors are not surprising. According to me keeping the Production Equivalent system is also very critical.

Like
Reply

Great summary of the shift Stefan!

Like
Reply
Surinderpal S Kumar

Partner in client’s digital transformation journey to help achieve Operational Delivery Excellence

10y

DevOps – It is a culture shift, not "just" a technology!

Like
Reply
Tudor C.

Doing great work for awesome companies

10y

this sounds very similar to the "Continuous Delivery" book written by Jez Humble and David Farley back in 2010

Like
Reply
Maja Panova

Service Manager IT - at SJ AB

10y

Very interestin! I am eager to read the next post.

Like
Reply

To view or add a comment, sign in

More articles by Stefan Brönner

  • DevOps - When Two Worlds Collide

    (I know that it was a while since my last post, but I promised to explain why we had our silo-organization in the…

    3 Comments

Others also viewed

Explore content categories