Introduction -Site Reliability Engineering Series -01
Birth Story of SRE ........
If you are interested to read this article, that means you are either heard about SRE, or implementing SRE or trying to explore what SRE is all about. Yes.. it pretty curious buzz word in the services industry right now, everyone wanted to convert their existing TechOps into SRE mode or SRE enabled... But what is SRE is all about? What is the need for SRE now? What SRE makes more attractive? why what, how, etc..etc.. Hold on all your questions, and keep reading :-) and.. Thanks very much for reading..
Remember...., my way of understanding and implementing SRE may be completely or partially different from famous Google SRE model of deployment, It is my own viewpoint may completely differ from others, and doesn't reflect any organization strategy, goal and vision, Take it as a pinch of salt just for understanding what SRE is all about, But.. I won't deviate from the core pillars of SRE -I promise :-)
Lets start our Deep Dive,
What is SRE: It all started when Ben (Ben Treynor, VP- Engineering - founder of Google's Site Reliability Team,) did something really smart and asked challenging question himself "what happens when a software engineer is tasked to do IT operations?" . I am guessing why he was tasked to challenge himself with that question for the following reasons, I may be wrong!... Not sure.
- He mightn't have sufficient people or IT operations technicians to support his product and/or service in production, Since he is from engineering vertical, all he had were developers, So he might have been forced to ask the developers to support the product and/or service in production?!
- When a software developer (App-Dev) guy gives birth to a product and/or services in the industry, why he/she is not bringing it up, Why do he/she leave the baby once it is in production? He/She gave birth to his baby(product and/or service) who knows all the internals, architectures, codebase than anyone else, then why he/she is not part of the product operations team when it is alive and kicking in production, Why do we really introduce a new set of baby-sitters called Tech Operators (TechOps) team to support the application and/service in production?
- You may argue, maybe TechOps guys are less expensive than coders and we can't waste coder's valuable time to support production application and/or service, But at what cost? At the cost of client satisfaction? at the cost of client trust on the product and/service provider? Again asking the same question, when your baby is sick who will better take care of the baby, parents or babysitters? The obvious answer is parents, everyone knows .. it is coders who can better understand the product and/or service and production and give better remediation than anyone else when it is performing in the production
But the major problem, is many coders don't really like to do production support, many feel it is a low-level job, Not interesting as coding.. ( no offense TechOps guys -:-) So Ben came with a beautiful idea of making and treating " IT Operations as a Software Problem". The moment you say "IT Operations as a Software Problem", Coders jump into solve the curious software problem, be it in production or in development .. coding is their DNA. Fixing software is their pass-time..:-)
Ben didn't really want to play with words and mesmerize the developers to do Production support as a fancy job. But he really meant " Treat IT Operations as a Software Problem" . So Ben was very much convinced with two mantra's
- Engage Developers to do IT operations of course not by force, by making it interesting.
- Treat Every IT Operations as a Software problem.
By base-setting these two mantras, Ben also promised his developers to devote 50% of their time in Development and 50% of their time in Production Support. Going back to the old question what Ben had in his mind? " What happens when I employ my developers to do production support? - the answer is "magic".. and it is called SRE..
It gave a completely new methodology of IT support and Operations called "Site Reliability Engineering.. " ... End of the day coder is a coder, If he/she sees any repeated, mundane task, He will write a piece of code to automate it . So Automation becomes the first and core pillar of SRE. in SRE world, We use Automation to reduce repeated mundane tasks, which is called as Toil in SRE Terminology ..
So you may be curious to know what are the other pillars, concepts of SRE, What SRE really means, How is it different from DevOps? Is SRE is advanced DevOps? Is it different from DevOps.. you may have "n" number of questions in your mind.. Thanks for being curious .. Watch out this space for further updates.. Stay tuned.
SRE -To Be Continued.............
Associate Director - Delivery Partner at Kyndryl Solutions Private Limited
5yVery nice article, Arun! Keep continuing..
Digital Transformation Leader | I Empower my clients to Accelerate their performance and Elevate their Leadership | Best Selling Author
5yGood Context Setting
Data/HPC/AI Consultant | Trusted Advisor | Design Thinker | Technology Enabler
5yCross-pollination of skills is good for employees and employers- Win-Win.
Director Data analytics at Fidelity Investments | Ex-Kyndryl Ex-IBM |AIOPS|GenAI|Storage|Backup|Cloud
5yGood Start nicely articulated answered many basic questions:)