SlideShare a Scribd company logo
NICTA Copyright 2012 From imagination to impact
Supporting Operations
Personnel: A Software
Engineering
Perspective
Len Bass
NICTA Copyright 2012 From imagination to impact
2
About NICTA
National ICT Australia
• Federal and state funded research
company established in 2002
• Largest ICT research resource in
Australia
• National impact is an important
success metric
• ~700 staff/students working in 5 labs
across major capital cities
• 7 university partners
• Providing R&D services, knowledge
transfer to Australian (and global) ICT
industry
NICTA technology is
in over 1 billion mobile
phones
NICTA Copyright 2012 From imagination to impact
Traditional View from Software Engineers
3
Application
Cloud
Environment
Traditionally, the software engineering community
has viewed systems as being developed for users
and existing in an environment. The motivating
questions have been: With this world view: how can
development costs be reduced and run time quality
improved?
End users
Developers
NICTA Copyright 2012 From imagination to impact
A Broader View
4
Application
Cloud
Environment
Applications are not only affected by the behavior of the
end users but also by actions of operators who control
the environment for a consumer’s application.
Consumer
Operator
End users
Developers
NICTA Copyright 2012 From imagination to impact
My Message: Consider the Operator in this
Picture
5
Application
Cloud
Environment
Consumer
Operator
End users
Developers
Computer operations is a domain that impacts every
application that operates in an enterprise environment. As
such, Software Engineers need to be aware of how
actions of operators can affect their application and how
actions of their application can simplify life for operators..
NICTA Copyright 2012 From imagination to impact
Business Context
“Through 2015, 80% of outages impacting mission-critical
services will be caused by people and process issues, and
more than 50% of those outages will be caused by
change/configuration/release integration and hand-off
issues.”
Change/configuration/release integration and hand off are
all operations issues.
Gartner - http://www.rbiassets.com/getfile.ashx/42112626510
"I&O [Infrastructure and operations] represents
approximately 60 percent of total IT spending worldwide, "
http://www.gartner.com/it/page.jsp?id=1807615
6
NICTA Copyright 2012 From imagination to impact
Outline
• Overview of operations domain
– What do operators do?
– What can go wrong with what they do?
• Some results NICTA has achieved or activities
we have ongoing
7
NICTA Copyright 2012 From imagination to impact
What Do Operators Do?
8
Akamai’s NOC in Cambridge, Massachusetts
• Monitor and control data center/network/system
activity
– Install new/upgraded
applications/middleware/configurations/hardware
• Support business continuity through back ups
and disaster recovery
NICTA Copyright 2012 From imagination to impact
Monitor and Control
• Data Center
– Total number and type of resources (may be virtual)
• Processors
• Storage
• Network
• Network
– Intrusion detection
– Routing
– Loading
• System
– Allocation to resources
– Install/uninstall
– Configure 9
NICTA Copyright 2012 From imagination to impact
What can go Wrong with Monitor and
Control?
Everything that was on previous slide.
• Failure
• Installations can fail
• Resources fail and must be replaced
• Overload
– Resources are over/under loaded and must be
supplemented/removed
– Networks get overloaded and routing must be changed
• Error
– Routing may be incorrectly specified
– Allocation of systems to resources may be incorrect
– Configurations can be incorrectly specified
10
NICTA Copyright 2012 From imagination to impact
Install New/Upgraded Applications
• Specifying configuration for applications
• Synchronizing state for upgraded applications
• Testing new/upgraded applications in target
environment
• Allocating resources for new version
11
NICTA Copyright 2012 From imagination to impact
What Can go Wrong with Installation?
• Again its everything.
– Configuration can be misspecified
– Cut over to new version may leave inconsistent state
– Upgrade to level N of the stack may break software in
level >N of the stack
– Testing environment may not appropriately mirror real
environment
– Configuration of one level of the stack may be
inconsistent with requirements of another level.
12
NICTA Copyright 2012 From imagination to impact
Supporting Business Continuity
• Disasters happen – natural or human causes
• Backing up data provides recovery possibility
– Lag between last version backed up and when
disaster happens
– In the Cloud, backing up large amounts of data to
different geographic regions takes time.
13
NICTA Copyright 2012 From imagination to impact
Hand Offs
• Problems can arise when a shift changes
– What problems did old shift deal with?
– What problems were totally solved?
– What problems were partially solved?
– What operations activities are currently ongoing?
14
NICTA Copyright 2012 From imagination to impact
Operations is a Target Rich Environment
• There are many existing tools. Operation of data
centers would not work without tools
• Much room for improvement (see Gartner quote)
• Some general approaches for improvement
– Make software systems operations and tools process and
incident aware. E.g. make them aware of upgrade or shift
change
– Model operations processes and systems using a single model.
• Model analysis will provide opportunities for detecting trade offs between
human and automated activities.
• Model might also enable smoother error detection
15
NICTA Copyright 2012 From imagination to impact
Outline
• Overview of operations domain
• Some results we have achieved or activities
we have ongoing
– Disaster Recovery product
– Upgrade
– Operator undo
– Installation process.
16
NICTA Copyright 2012 From imagination to impact
Disaster Recovery
• Clouds fail – Amazon had three outages in 2011
that affected whole availability zones or regions.
• NICTA has a subsidiary (Yuruware) with a non-
intrusive disaster recovery product (Bolt).
• Bolt copies data periodically to a back up region.
• Bolt utilizes sophisticated data movement
techniques to reduce time required to back up
• This is an insurance policy.
17
NICTA Copyright 2012 From imagination to impact
Next Problem – Upgrade
• Upgrades are a very common occurrence
• Upgrade frequency of some common systems
• Some systems have multiple releases per day,
driven by developers – continuous deployment
18
Application Average release interval
Facebook (platform) < 7 days
Google Docs <50 days
Media Wiki 21 days
Joomla 30 days
NICTA Copyright 2012 From imagination to impact
Various Upgrade Strategies
• How many at once?
– One at a time (rolling upgrade)
– Groups at a time (staged upgrade, e.g. canaries. This
is using production environment for testing)
– All at once (big flip)
• How long are new versions tested to determine
correctness?
– Period based – for some period of time
– Load based – under some utilization assumptions
• What happens to old versions?
– Replaced en masse
– Maintained for some period for compatibility purposes
19
NICTA Copyright 2012 From imagination to impact
Having Multiple Versions Simulaneously
Active May Lead to Mixed Version Race
Condition
20
Server 2 (new
version
3
4
X ERROR
Initial request
Client (browser)
Server 1 (old
version
1
2
5
Start rolling
upgrade
HTTP reply with
embedded JavaScript
AJAX callback
NICTA Copyright 2012 From imagination to impact
One Method for Preventing Mixed Version Race Condition
is to Make Load Balancers Version Aware
Client may
request
particular version
of a service
External facing
Router (wrt to
cloud)
Internal Router
Server for
Version A
Server for
Version A
Server for
Version B
Internal Router
Server for
Version A
Server for
Version B
21
At each level of the routing hierarchy
there are two possibilities for each
request
• Request is neutral with respect to
version
• Request specifies version
Routing must
• Be fast to ensure rapid response
• Satisfy “goodness” criteria for
scheduling
• Conform to client request wrt
version.
In addition:
• Servers are being
upgraded to a later
version while servicing
client requests
• Load variation may
trigger elasticity rules
NICTA Copyright 2012 From imagination to impact
What is Criterion for Measuring Load
Balancer Scheduling?
• What is “goodness” with respect to routing
decisions within the constraints of scheduling
strategy and version awareness?
– Uniform distribution of requests?
– Keeping utilization within bounds?
– Utilizing wide variety of clients?
– Other?
• Main result so far. Version awareness is
incompatible with any of the above “goodness”
criteria for the staged upgrade strategy.
22
NICTA Copyright 2012 From imagination to impact
Canary or Staged Strategy
• Upgrade one or several servers to new version
and leave them for some time.
• Formulation:
– Staged upgrade
• M version A servers (constant number)
• N version B servers (constant number)
• Fixed number of clients
– Version aware
• Once a client has had a request serviced by a version B
server it cannot subsequently have any requests serviced by
any version A server.
23
NICTA Copyright 2012 From imagination to impact
Bifurcation of Clients
• Clients are bifurcated into version A clients and
version B clients after some time
– Intuitively, for each client, either it is serviced by a
server with version B and consequently never served
by any server with version A or never served by a
server with version B. So each client ends in up the
Server A class or the Server B class but not both.
• We call clients that end up being serviced by
services with version A, class A clients.
Similarly, for class B clients.
• Allowing additional clients does not
fundamentally change result.
24
NICTA Copyright 2012 From imagination to impact
Bifurcation of Clients Implies
• Cannot control for utilization unless create new instances
of version B in response to demand
– There are a fixed number of clients sending requests to a fixed
number of servers with version B. Cannot vary the number of
servers to reflect the load generated by the fixed set of clients.
Consequently cannot control the utilization by servers with
version B.
• Cannot control for uniform distribution.
– Uniform distribution means that every request has an equal
change of being sent to any server. If a client is in class A, then it
has 0% chance of being sent to a server with Version B.
• Difficult to control for wide variety of clients.
– Variations among the clients must be mirrored within class A and
class B clients since the classes are fixed after the bifurcation.
This is difficult to accomplish since types of variations that are
important are usually not known.
25
NICTA Copyright 2012 From imagination to impact
Questions to Answer.
– How long does it take to reach bifurcated state under
what assumptions?
– How can the goals of staged upgrade be achieved
within the constraints of version awareness?
26
NICTA Copyright 2012 From imagination to impact
Next Problem
• Operators use scripts to perform actions such as
update
• Scripts may fail
– May be result of API failure (more on this later)
– May be desire to set up testing environment
– May be result of failure of underlying virtual machine.
• When a script fails, the operator may wish to
return to a known state (undo several
operations)
27
NICTA Copyright 2012 From imagination to impact
Operator Undo
• Not always that straight-forward:
– Attaching volume is no problem while the instance is
running, detaching might be problematic
– Creating / changing auto-scaling rules has effect on
number of running instances
• Cannot terminate additional instances, as the rule would
create new ones!
– Deleted / terminated / released resources are gone!
28
NICTA Copyright 2012 From imagination to impact
Undo for System Operators
29
+ commit
+ pseudo-delete
begin-
transaction
rollback
do
do
do
Administrator
NICTA Copyright 2012 From imagination to impact
Approach
30
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
NICTA Copyright 2012 From imagination to impact
Approach
31
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
NICTA Copyright 2012 From imagination to impact
begin-
transaction
rollback
do
do
do
Sense cloud
resources states
Sense cloud
resources states
PlanGenerate codeExecute
Administrator
Undo System
Goal
state
Goal
state
Initial
state
Initial
state
Set of
actions
Set of
actions
Approach
32
NICTA Copyright 2012 From imagination to impact
What about API Failures?
• Operator scripts make heavy use of checking or
controlling state of resources
– Start/stop VM
– Is VM active?
• These scripts becomes calls to the cloud
provider’s API.
• Calls may fail
– Underlying VM has failed
– Eventual consistency.
33
NICTA Copyright 2012 From imagination to impact
We Have Performed an Empirical Study of
API Failures in EC2
• 922 cases out of 1109 reported API-related
cases in the EC2 forum from 2010 to 2012 are
API failures (rather than feature requests or
general inquiries).
• We classified the extracted API failures into four
types of failures:
– content failures,
– late timing failures,
– halt failures, and
– erratic failures.
34
NICTA Copyright 2012 From imagination to impact
Results
• A majority (60%) of the cases of API failure are related to
stuck API calls or unresponsive API calls.
• A large portion (12%) of the cases are about slow
responsive API calls.
• 19% of the cases are related to the output issues of API
calls, including failed calls with unclear error
messages, as well as missing output, wrong output, and
unexpected output of API calls.
• 9% of the cases reported that their calls were pending
for a certain time and then returned to the original state
without informing the caller properly or the calls were
reported to be successful first but failed later.
35
NICTA Copyright 2012 From imagination to impact
Next Problem - Operations Processes
• We are looking at the process of installing new
software
– Error Prone
– Potential process improvements.
36
NICTA Copyright 2012 From imagination to impact
Motivating Scenario
• You change the operating environment for an
application
– Configuration change
– Version change
– Hardware change
• Result is degraded performance
• When the software stack is deep with portions
from different suppliers, the result is frequently:
37
NICTA Copyright 2012 From imagination to impact
Why is Installation Error Prone?
• Installation is complicated.
– Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for
Linux are ~250 pages each
– Apache description of addresses and ports (one out of 16
descriptions) has following elements:
• Choosing and specifying ports for the server to listen to
• IPv4 and IPv6
• Protocols
• Virtual Hosts
– The number of configuration options that must be set can be
large
• Hadoop has 206 options
• HBase has 64
– Many dependencies are not visible until execution
38
NICTA Copyright 2012 From imagination to impact
Installation Processes
• Processes may be
– Undocumented
– Out of date
– Insufficiently detailed
• Our goal is to build process model including
error recovery mechanisms
39
NICTA Copyright 2012 From imagination to impact
Our Activities
40
• Create up to date process models for installation
processes. Information sources are
– Process discovery from logs
– Process formalization from existing written
descriptions.
• Process descriptions can be used to
– Make trade offs
– Make recommendations in real time to operations
staff
– Recommend setting checkpoints for potential later
undo, before a risky part of a process is entered
– Assist in the detection of errors
NICTA Copyright 2012 From imagination to impact
Hard Problems
41
• Creating accurate process models
– Exception handling mechanisms are not well
documented
– Labor intensive.
– Our approach
• Top down modeling using process modeling formalism
• Bottom up process mining from error logs
• Diagnosing errors
NICTA Copyright 2012 From imagination to impact
Why is Error Diagnosis Hard?
In a distributed computing
environment, when an error
occurs during operations, it is
difficult and time consuming to
diagnosis it.
Diagnosis involves correlating
messages from
• different distributed servers
• different portions of the
software stack
and determining the root
cause of the error.
The root cause, in turn, may
be within a portion of the stack
that is different from where the
error is observed.
NICTA Copyright 2012 From imagination to impact
Test Bed
43
Our current test bed is the Hbase stack
NICTA Copyright 2012 From imagination to impact
Currently Performing Analysis of
Configuration Errors
44
• Cross stack errors may take hours to diagnose
– Log files are inconsistent
– Error message may not give context necessary to
determine root cause.
NICTA Copyright 2012 From imagination to impact
Where to Find Information about Operations
Domain?
• Every open source program requires a variety of
configuration parameters.
• Every modern application depends on a variety
of middleware so cross domain examples should
be readily available.
• Most organizations have extensive processes for
their operations personnel. Use these processes
as a framework for investigating process/product
interactions.
45
NICTA Copyright 2012 From imagination to impact
Summary
• Operations problems will account for the majority
of outages and IT costs in the next several
years.
• The operations space is a rich source of
research problems that has been insufficiently
mined.
• Best way to determine what problems to attack
is to monitor or interview operators
46
NICTA Copyright 2012 From imagination to impact
NICTA Team
• Anna Liu
• Alan Fekete
• Min Fu
• Jim Zhanwen Li
• Qinghua Lu
• Hiroshi Wada
• Ingo Weber
• Xiwei Xu
• Liming Zhu
47

More Related Content

PDF
Dev ops for software architects
PPTX
The quality attribute of upgradability
PPT
Remedy rapid deployment 1
PDF
Ibm innovate adoption of continuous delivery at scale at a large telco - pr...
PPTX
Sdlc models
PDF
Nilesh Profile 2016
PDF
IBM Collaborative Lifecycle Management Solution for DevOps v6
PPTX
ITIL Best Practice for Software Companies
Dev ops for software architects
The quality attribute of upgradability
Remedy rapid deployment 1
Ibm innovate adoption of continuous delivery at scale at a large telco - pr...
Sdlc models
Nilesh Profile 2016
IBM Collaborative Lifecycle Management Solution for DevOps v6
ITIL Best Practice for Software Companies

What's hot (20)

PDF
All About Jazz Team Server Technology
PDF
Software/System Development Life Cycle
PDF
Engineering Systems For The Cloud
PDF
IT Ops Mgmt in the New Virtualized, Software-defined World
 
PPTX
Virtual Private Data Center Solution Overview
PPTX
Software Lifecycle
PDF
DevOps for Enterprise Systems : Innovate like a Startup
DOCX
Chuck_Roden_Resume
PPTX
Beit 381 se lec 2 - 27 - 12 feb08
PPTX
The Changing Role of IT: From Service Managers to Advisors
PDF
Fifteen Years of DevOps -- LISA 2012 keynote
PPTX
Federating Subversion and Git
PPTX
Netpod - The Merging of NPM & APM
PDF
Real Cost of Software Remediation
PPTX
Building The Agile Enterprise - LSSC '12
DOCX
Chuck_Roden_Resume
PDF
Four Essential Steps for Removing Risk and Downtime from POWER9 Migration
PPTX
From monolith to resilient microservices
All About Jazz Team Server Technology
Software/System Development Life Cycle
Engineering Systems For The Cloud
IT Ops Mgmt in the New Virtualized, Software-defined World
 
Virtual Private Data Center Solution Overview
Software Lifecycle
DevOps for Enterprise Systems : Innovate like a Startup
Chuck_Roden_Resume
Beit 381 se lec 2 - 27 - 12 feb08
The Changing Role of IT: From Service Managers to Advisors
Fifteen Years of DevOps -- LISA 2012 keynote
Federating Subversion and Git
Netpod - The Merging of NPM & APM
Real Cost of Software Remediation
Building The Agile Enterprise - LSSC '12
Chuck_Roden_Resume
Four Essential Steps for Removing Risk and Downtime from POWER9 Migration
From monolith to resilient microservices
Ad

Viewers also liked (16)

PDF
Architecting for the cloud scability-availability
PDF
Architecting for the cloud cloud providers
PDF
Error in hadoop
PPTX
Architecture patterns for continuous deployment
PPTX
WICSA 2012 tutorial
PDF
Architecture for the cloud deployment case study future
PDF
My first deployment pipeline
PDF
Introduction to dev ops
PDF
Deployability
PDF
Architecting for the cloud elasticity security
PDF
Architecting for the cloud intro, virtualization, iaa s
PDF
Architecting for the cloud storage build test
PPTX
Architectural Tactics for Large Scale Systems
PDF
Packaging tool options
PDF
Architecting for the cloud storage misc topics
PDF
Principles of software architecture design
Architecting for the cloud scability-availability
Architecting for the cloud cloud providers
Error in hadoop
Architecture patterns for continuous deployment
WICSA 2012 tutorial
Architecture for the cloud deployment case study future
My first deployment pipeline
Introduction to dev ops
Deployability
Architecting for the cloud elasticity security
Architecting for the cloud intro, virtualization, iaa s
Architecting for the cloud storage build test
Architectural Tactics for Large Scale Systems
Packaging tool options
Architecting for the cloud storage misc topics
Principles of software architecture design
Ad

Similar to Supporting operations personnel a software engineers perspective (20)

PDF
Eliciting Operations Requirements for Applications
PPTX
Dependable Operation - Performance Management and Capacity Planning Under Con...
PPT
Dependable Operations
PPTX
Challenges in Practicing High Frequency Releases in Cloud Environments
PPTX
issues with the use of canaries in upgrade
PPT
Modelling and Analysing Operation Processes for Dependability
PDF
Automatic Undo for Cloud Management via AI Planning
PPTX
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
PPT
Cloud API Issues: an Empirical Study and Impact
PPTX
Fosec2011 keynote address
PDF
Cloud Computing Conf 1209
PPTX
Cloud Computing basics Introduction, Startup
PDF
Rise of devops
PPT
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
PDF
Evolving to Cloud-Native - Anand Rao
PDF
Monthly Technology Brief
PDF
Orchestration Panel at Cloud Connect 2010
PPTX
HP CTO Summit, New Jersey, March 24, 2010
PPTX
PuppetConf2012GeneKim
PPTX
Cloud Computing: What it Means for Libraries, Library Staff, Training and Skills
Eliciting Operations Requirements for Applications
Dependable Operation - Performance Management and Capacity Planning Under Con...
Dependable Operations
Challenges in Practicing High Frequency Releases in Cloud Environments
issues with the use of canaries in upgrade
Modelling and Analysing Operation Processes for Dependability
Automatic Undo for Cloud Management via AI Planning
POD-Diagnosis: Error Detection and Diagnosis of Sporadic Operations on Cloud ...
Cloud API Issues: an Empirical Study and Impact
Fosec2011 keynote address
Cloud Computing Conf 1209
Cloud Computing basics Introduction, Startup
Rise of devops
Cloud Computing Outages - Analysis of Key Outages 2009 - 2012
Evolving to Cloud-Native - Anand Rao
Monthly Technology Brief
Orchestration Panel at Cloud Connect 2010
HP CTO Summit, New Jersey, March 24, 2010
PuppetConf2012GeneKim
Cloud Computing: What it Means for Libraries, Library Staff, Training and Skills

More from Len Bass (20)

PDF
Devops syllabus
PDF
DevOps Syllabus summer 2020
PDF
11 secure development
PDF
10 disaster recovery
PDF
9 postproduction
PDF
8 pipeline
PDF
7 configuration management
PDF
6 microservice architecture
PDF
5 infrastructure security
PPTX
4 container management
PDF
3 the cloud
PDF
1 virtual machines
PDF
2 networking
PDF
Quantum talk
PDF
Icsa2018 blockchain tutorial
PDF
Experience in teaching devops
PDF
Understanding blockchains
PDF
What is a blockchain
PDF
Dev ops and safety critical systems
PDF
Securing deployment pipeline
Devops syllabus
DevOps Syllabus summer 2020
11 secure development
10 disaster recovery
9 postproduction
8 pipeline
7 configuration management
6 microservice architecture
5 infrastructure security
4 container management
3 the cloud
1 virtual machines
2 networking
Quantum talk
Icsa2018 blockchain tutorial
Experience in teaching devops
Understanding blockchains
What is a blockchain
Dev ops and safety critical systems
Securing deployment pipeline

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation theory and applications.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
TLE Review Electricity (Electricity).pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
project resource management chapter-09.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A comparative study of natural language inference in Swahili using monolingua...
Tartificialntelligence_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
Encapsulation theory and applications.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
TLE Review Electricity (Electricity).pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Hindi spoken digit analysis for native and non-native speakers
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
project resource management chapter-09.pdf
NewMind AI Weekly Chronicles - August'25-Week II
WOOl fibre morphology and structure.pdf for textiles
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Supporting operations personnel a software engineers perspective

  • 1. NICTA Copyright 2012 From imagination to impact Supporting Operations Personnel: A Software Engineering Perspective Len Bass
  • 2. NICTA Copyright 2012 From imagination to impact 2 About NICTA National ICT Australia • Federal and state funded research company established in 2002 • Largest ICT research resource in Australia • National impact is an important success metric • ~700 staff/students working in 5 labs across major capital cities • 7 university partners • Providing R&D services, knowledge transfer to Australian (and global) ICT industry NICTA technology is in over 1 billion mobile phones
  • 3. NICTA Copyright 2012 From imagination to impact Traditional View from Software Engineers 3 Application Cloud Environment Traditionally, the software engineering community has viewed systems as being developed for users and existing in an environment. The motivating questions have been: With this world view: how can development costs be reduced and run time quality improved? End users Developers
  • 4. NICTA Copyright 2012 From imagination to impact A Broader View 4 Application Cloud Environment Applications are not only affected by the behavior of the end users but also by actions of operators who control the environment for a consumer’s application. Consumer Operator End users Developers
  • 5. NICTA Copyright 2012 From imagination to impact My Message: Consider the Operator in this Picture 5 Application Cloud Environment Consumer Operator End users Developers Computer operations is a domain that impacts every application that operates in an enterprise environment. As such, Software Engineers need to be aware of how actions of operators can affect their application and how actions of their application can simplify life for operators..
  • 6. NICTA Copyright 2012 From imagination to impact Business Context “Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues.” Change/configuration/release integration and hand off are all operations issues. Gartner - http://www.rbiassets.com/getfile.ashx/42112626510 "I&O [Infrastructure and operations] represents approximately 60 percent of total IT spending worldwide, " http://www.gartner.com/it/page.jsp?id=1807615 6
  • 7. NICTA Copyright 2012 From imagination to impact Outline • Overview of operations domain – What do operators do? – What can go wrong with what they do? • Some results NICTA has achieved or activities we have ongoing 7
  • 8. NICTA Copyright 2012 From imagination to impact What Do Operators Do? 8 Akamai’s NOC in Cambridge, Massachusetts • Monitor and control data center/network/system activity – Install new/upgraded applications/middleware/configurations/hardware • Support business continuity through back ups and disaster recovery
  • 9. NICTA Copyright 2012 From imagination to impact Monitor and Control • Data Center – Total number and type of resources (may be virtual) • Processors • Storage • Network • Network – Intrusion detection – Routing – Loading • System – Allocation to resources – Install/uninstall – Configure 9
  • 10. NICTA Copyright 2012 From imagination to impact What can go Wrong with Monitor and Control? Everything that was on previous slide. • Failure • Installations can fail • Resources fail and must be replaced • Overload – Resources are over/under loaded and must be supplemented/removed – Networks get overloaded and routing must be changed • Error – Routing may be incorrectly specified – Allocation of systems to resources may be incorrect – Configurations can be incorrectly specified 10
  • 11. NICTA Copyright 2012 From imagination to impact Install New/Upgraded Applications • Specifying configuration for applications • Synchronizing state for upgraded applications • Testing new/upgraded applications in target environment • Allocating resources for new version 11
  • 12. NICTA Copyright 2012 From imagination to impact What Can go Wrong with Installation? • Again its everything. – Configuration can be misspecified – Cut over to new version may leave inconsistent state – Upgrade to level N of the stack may break software in level >N of the stack – Testing environment may not appropriately mirror real environment – Configuration of one level of the stack may be inconsistent with requirements of another level. 12
  • 13. NICTA Copyright 2012 From imagination to impact Supporting Business Continuity • Disasters happen – natural or human causes • Backing up data provides recovery possibility – Lag between last version backed up and when disaster happens – In the Cloud, backing up large amounts of data to different geographic regions takes time. 13
  • 14. NICTA Copyright 2012 From imagination to impact Hand Offs • Problems can arise when a shift changes – What problems did old shift deal with? – What problems were totally solved? – What problems were partially solved? – What operations activities are currently ongoing? 14
  • 15. NICTA Copyright 2012 From imagination to impact Operations is a Target Rich Environment • There are many existing tools. Operation of data centers would not work without tools • Much room for improvement (see Gartner quote) • Some general approaches for improvement – Make software systems operations and tools process and incident aware. E.g. make them aware of upgrade or shift change – Model operations processes and systems using a single model. • Model analysis will provide opportunities for detecting trade offs between human and automated activities. • Model might also enable smoother error detection 15
  • 16. NICTA Copyright 2012 From imagination to impact Outline • Overview of operations domain • Some results we have achieved or activities we have ongoing – Disaster Recovery product – Upgrade – Operator undo – Installation process. 16
  • 17. NICTA Copyright 2012 From imagination to impact Disaster Recovery • Clouds fail – Amazon had three outages in 2011 that affected whole availability zones or regions. • NICTA has a subsidiary (Yuruware) with a non- intrusive disaster recovery product (Bolt). • Bolt copies data periodically to a back up region. • Bolt utilizes sophisticated data movement techniques to reduce time required to back up • This is an insurance policy. 17
  • 18. NICTA Copyright 2012 From imagination to impact Next Problem – Upgrade • Upgrades are a very common occurrence • Upgrade frequency of some common systems • Some systems have multiple releases per day, driven by developers – continuous deployment 18 Application Average release interval Facebook (platform) < 7 days Google Docs <50 days Media Wiki 21 days Joomla 30 days
  • 19. NICTA Copyright 2012 From imagination to impact Various Upgrade Strategies • How many at once? – One at a time (rolling upgrade) – Groups at a time (staged upgrade, e.g. canaries. This is using production environment for testing) – All at once (big flip) • How long are new versions tested to determine correctness? – Period based – for some period of time – Load based – under some utilization assumptions • What happens to old versions? – Replaced en masse – Maintained for some period for compatibility purposes 19
  • 20. NICTA Copyright 2012 From imagination to impact Having Multiple Versions Simulaneously Active May Lead to Mixed Version Race Condition 20 Server 2 (new version 3 4 X ERROR Initial request Client (browser) Server 1 (old version 1 2 5 Start rolling upgrade HTTP reply with embedded JavaScript AJAX callback
  • 21. NICTA Copyright 2012 From imagination to impact One Method for Preventing Mixed Version Race Condition is to Make Load Balancers Version Aware Client may request particular version of a service External facing Router (wrt to cloud) Internal Router Server for Version A Server for Version A Server for Version B Internal Router Server for Version A Server for Version B 21 At each level of the routing hierarchy there are two possibilities for each request • Request is neutral with respect to version • Request specifies version Routing must • Be fast to ensure rapid response • Satisfy “goodness” criteria for scheduling • Conform to client request wrt version. In addition: • Servers are being upgraded to a later version while servicing client requests • Load variation may trigger elasticity rules
  • 22. NICTA Copyright 2012 From imagination to impact What is Criterion for Measuring Load Balancer Scheduling? • What is “goodness” with respect to routing decisions within the constraints of scheduling strategy and version awareness? – Uniform distribution of requests? – Keeping utilization within bounds? – Utilizing wide variety of clients? – Other? • Main result so far. Version awareness is incompatible with any of the above “goodness” criteria for the staged upgrade strategy. 22
  • 23. NICTA Copyright 2012 From imagination to impact Canary or Staged Strategy • Upgrade one or several servers to new version and leave them for some time. • Formulation: – Staged upgrade • M version A servers (constant number) • N version B servers (constant number) • Fixed number of clients – Version aware • Once a client has had a request serviced by a version B server it cannot subsequently have any requests serviced by any version A server. 23
  • 24. NICTA Copyright 2012 From imagination to impact Bifurcation of Clients • Clients are bifurcated into version A clients and version B clients after some time – Intuitively, for each client, either it is serviced by a server with version B and consequently never served by any server with version A or never served by a server with version B. So each client ends in up the Server A class or the Server B class but not both. • We call clients that end up being serviced by services with version A, class A clients. Similarly, for class B clients. • Allowing additional clients does not fundamentally change result. 24
  • 25. NICTA Copyright 2012 From imagination to impact Bifurcation of Clients Implies • Cannot control for utilization unless create new instances of version B in response to demand – There are a fixed number of clients sending requests to a fixed number of servers with version B. Cannot vary the number of servers to reflect the load generated by the fixed set of clients. Consequently cannot control the utilization by servers with version B. • Cannot control for uniform distribution. – Uniform distribution means that every request has an equal change of being sent to any server. If a client is in class A, then it has 0% chance of being sent to a server with Version B. • Difficult to control for wide variety of clients. – Variations among the clients must be mirrored within class A and class B clients since the classes are fixed after the bifurcation. This is difficult to accomplish since types of variations that are important are usually not known. 25
  • 26. NICTA Copyright 2012 From imagination to impact Questions to Answer. – How long does it take to reach bifurcated state under what assumptions? – How can the goals of staged upgrade be achieved within the constraints of version awareness? 26
  • 27. NICTA Copyright 2012 From imagination to impact Next Problem • Operators use scripts to perform actions such as update • Scripts may fail – May be result of API failure (more on this later) – May be desire to set up testing environment – May be result of failure of underlying virtual machine. • When a script fails, the operator may wish to return to a known state (undo several operations) 27
  • 28. NICTA Copyright 2012 From imagination to impact Operator Undo • Not always that straight-forward: – Attaching volume is no problem while the instance is running, detaching might be problematic – Creating / changing auto-scaling rules has effect on number of running instances • Cannot terminate additional instances, as the rule would create new ones! – Deleted / terminated / released resources are gone! 28
  • 29. NICTA Copyright 2012 From imagination to impact Undo for System Operators 29 + commit + pseudo-delete begin- transaction rollback do do do Administrator
  • 30. NICTA Copyright 2012 From imagination to impact Approach 30 begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states Administrator Undo System
  • 31. NICTA Copyright 2012 From imagination to impact Approach 31 begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states Administrator Undo System Goal state Goal state Initial state Initial state
  • 32. NICTA Copyright 2012 From imagination to impact begin- transaction rollback do do do Sense cloud resources states Sense cloud resources states PlanGenerate codeExecute Administrator Undo System Goal state Goal state Initial state Initial state Set of actions Set of actions Approach 32
  • 33. NICTA Copyright 2012 From imagination to impact What about API Failures? • Operator scripts make heavy use of checking or controlling state of resources – Start/stop VM – Is VM active? • These scripts becomes calls to the cloud provider’s API. • Calls may fail – Underlying VM has failed – Eventual consistency. 33
  • 34. NICTA Copyright 2012 From imagination to impact We Have Performed an Empirical Study of API Failures in EC2 • 922 cases out of 1109 reported API-related cases in the EC2 forum from 2010 to 2012 are API failures (rather than feature requests or general inquiries). • We classified the extracted API failures into four types of failures: – content failures, – late timing failures, – halt failures, and – erratic failures. 34
  • 35. NICTA Copyright 2012 From imagination to impact Results • A majority (60%) of the cases of API failure are related to stuck API calls or unresponsive API calls. • A large portion (12%) of the cases are about slow responsive API calls. • 19% of the cases are related to the output issues of API calls, including failed calls with unclear error messages, as well as missing output, wrong output, and unexpected output of API calls. • 9% of the cases reported that their calls were pending for a certain time and then returned to the original state without informing the caller properly or the calls were reported to be successful first but failed later. 35
  • 36. NICTA Copyright 2012 From imagination to impact Next Problem - Operations Processes • We are looking at the process of installing new software – Error Prone – Potential process improvements. 36
  • 37. NICTA Copyright 2012 From imagination to impact Motivating Scenario • You change the operating environment for an application – Configuration change – Version change – Hardware change • Result is degraded performance • When the software stack is deep with portions from different suppliers, the result is frequently: 37
  • 38. NICTA Copyright 2012 From imagination to impact Why is Installation Error Prone? • Installation is complicated. – Installation guides for SAS 9.3 Intelligence, IBM i, Oracle 11g for Linux are ~250 pages each – Apache description of addresses and ports (one out of 16 descriptions) has following elements: • Choosing and specifying ports for the server to listen to • IPv4 and IPv6 • Protocols • Virtual Hosts – The number of configuration options that must be set can be large • Hadoop has 206 options • HBase has 64 – Many dependencies are not visible until execution 38
  • 39. NICTA Copyright 2012 From imagination to impact Installation Processes • Processes may be – Undocumented – Out of date – Insufficiently detailed • Our goal is to build process model including error recovery mechanisms 39
  • 40. NICTA Copyright 2012 From imagination to impact Our Activities 40 • Create up to date process models for installation processes. Information sources are – Process discovery from logs – Process formalization from existing written descriptions. • Process descriptions can be used to – Make trade offs – Make recommendations in real time to operations staff – Recommend setting checkpoints for potential later undo, before a risky part of a process is entered – Assist in the detection of errors
  • 41. NICTA Copyright 2012 From imagination to impact Hard Problems 41 • Creating accurate process models – Exception handling mechanisms are not well documented – Labor intensive. – Our approach • Top down modeling using process modeling formalism • Bottom up process mining from error logs • Diagnosing errors
  • 42. NICTA Copyright 2012 From imagination to impact Why is Error Diagnosis Hard? In a distributed computing environment, when an error occurs during operations, it is difficult and time consuming to diagnosis it. Diagnosis involves correlating messages from • different distributed servers • different portions of the software stack and determining the root cause of the error. The root cause, in turn, may be within a portion of the stack that is different from where the error is observed.
  • 43. NICTA Copyright 2012 From imagination to impact Test Bed 43 Our current test bed is the Hbase stack
  • 44. NICTA Copyright 2012 From imagination to impact Currently Performing Analysis of Configuration Errors 44 • Cross stack errors may take hours to diagnose – Log files are inconsistent – Error message may not give context necessary to determine root cause.
  • 45. NICTA Copyright 2012 From imagination to impact Where to Find Information about Operations Domain? • Every open source program requires a variety of configuration parameters. • Every modern application depends on a variety of middleware so cross domain examples should be readily available. • Most organizations have extensive processes for their operations personnel. Use these processes as a framework for investigating process/product interactions. 45
  • 46. NICTA Copyright 2012 From imagination to impact Summary • Operations problems will account for the majority of outages and IT costs in the next several years. • The operations space is a rich source of research problems that has been insufficiently mined. • Best way to determine what problems to attack is to monitor or interview operators 46
  • 47. NICTA Copyright 2012 From imagination to impact NICTA Team • Anna Liu • Alan Fekete • Min Fu • Jim Zhanwen Li • Qinghua Lu • Hiroshi Wada • Ingo Weber • Xiwei Xu • Liming Zhu 47