Executive’s guide to
developing AI at scale
Developing artificial intelligence and analytics applications typically
involves different processes, technology, and talent than those for
traditional software solutions. Executives who possess a solid under-
standing of the basics can ensure they’re making the right investments
in their tech stacks and teams to build reliable solutions at scale.
by Nayur Khan, Brian McCarthy, and Adi Pradhan
© Getty Images
October 2020
Overview Ways of working
Lab MLOps
Due to its experimental nature, analytics MLOps refers to DevOps as applied to machine
development work—including data exploration, learning and artificial intelligence. Short for
experimentation with predictive models, and “software development” and “IT operations,”
development of prototypes through rapid DevOps is the application of software
iterations—must be performed in a “lab” engineering practices to IT operations, such as
environment that’s separate from other systems so packaging and deploying production software.
that it doesn’t hinder normal business operations.
Lab technologies must be flexible and scalable MLOps aims to shorten the analytics
to handle the changing demands of the analytical development life cycle and increase model
approach (eg, new data, new modeling techniques) stability by automating repeatable steps in the
and modular to enable developed solutions to port workflows of software practitioners (including
to the factory through MLOps. data engineers and data scientists). While
MLOps practices vary significantly, they typically
Factory involve automating integration (the frequent
After development in the lab, analytics models checking in and testing of code) and deployment
move into the “factory,” which provides an (packaging code and using it in a production
environment for running analytics jobs 24 hours setting).
a day, 7 days a week, 52 weeks a year. In order
to put a solution into production at scale (ie, Roles
making it regularly and reliably accessible to The work in the lab and factory is carried out
users), it has to be robust (able to handle typical by cross-functional teams made up of data
errors, including variances in incoming real-world and software professionals (eg, data scientists,
data), maintainable, executed efficiently through machine learning engineers, cloud architects) as
continuous deployment processes, and integrated well as business professionals with varying levels
with core systems, and it must include performance of data science expertise (eg, subject-matter
management and risk controls to avoid any experts, translators).
detrimental impact on operations.
Lab
Stage 1: Assemble data sets
Process Definition Why it’s important
Data ingestion The process of assembling, Ensuring that a wide range of data sets can be
cleaning, and combining data ingested quickly, easily, and frequently is important
from various source systems to for model development. The quality of analytics
create data sets for analytics models is dependent on both the quantity of data
available and the variety of relevant data sets. Most
Data sources can be both labs start out with manual data extracts and rapidly
internal and external, and data automate the ingestion process so that it can occur
are often combined into a single regularly (eg, daily) in order to ensure data available
cleaned data set for analytics for analytics model development are fresh.
model development.
2 Executive’s guide to developing AI at scale
Technology Definition Why it’s important
Source systems Systems containing data (eg, ERP, CRM, A deep understanding of the source-
transaction systems) that need to be data location, storage formats, and data
extracted, cleaned, and consolidated for structures is needed to develop data-
analytics purposes extraction processes (eg, pipelines) to
source and consolidate data for analytics.
Data APIs An application programming interface (API) APIs have become the de facto standard
is a software intermediary that allows two for integrating software components. In
applications to talk to each other (REST API, this case, they enable access to data in
gPRC, and GraphQL are examples of APIs organizational and third-party systems.
used to access data).
Batch Batches are extractions of larger amounts Batches allow for efficient processing of
of data on a less frequent basis (eg, monthly large amounts of data.
batches of account openings).
Batch or micro-batch data replication will
Micro batches are extractions of small depend on the type of data (eg, master vs
amounts of data, often frequently (eg, daily transactional) and frequency of the data
batches of hourly transactions). being updated (eg, daily sales).
Data pipeline Manages the sequence of ELT steps (extract- Data pipelines manage the sequencing
load-transfer, which is the process by which (or “orchestration”) of the different
data are moved from source systems to data collection/transformation steps.
applications) that automate the movement, Automating data pipelines ensures data
transformation, and serving of data can be moved in a timely and reliable
manner.
It can be used to move data from source
systems to the data lake, within the data lake,
or from the data lake to other systems.
Data lake Single centralized storage of all enterprise Data lakes enable enterprises to pool data
data in raw format from different sources and allow analytics
teams to generate data sets for analytics
Data are stored as is, without transformation, by combining raw data in different formats
so they can be transformed after being loaded from a variety of sources.
(ie, moved using ELT).
Because data lakes have effectively
infinite storage capacity, enterprises can
store all the data they generate, even
if they’re not immediately applicable to
analytics use cases.
Executive’s guide to developing AI at scale 3
Technology Definition Why it’s important
Data catalog Tool to store metadata (data Data catalogs provide a valuable source
that explains characteristics of information about the data for data
of a data set) that enables scientists, data engineers, and business
easy searching through analysts.
a directory to find what
data elements exist across
the enterprise, how they
are linked, and what the
characteristics of the data are
(eg, date formats)
Stage 2: Develop models
Process Definition Why it’s important
Build Stage of analytics development during Modeling approaches must match the
which data scientists choose a few problem being solved as well as the
algorithms and techniques that might required level of transparency of model
be best suited for the task at hand (eg, outputs (eg, simpler techniques make
regression, classification, reinforcement it easier to understand why the model
learning) to test so they can build mod- makes certain predictions).
els based on the type of problem (eg,
prediction or optimization) and the type of
data available
The initial user interface is also created at
this stage.
Train Providing an algorithmic model with data A trained machine learning model is
so it can learn from the data to perform its required to provide predictions.
task (eg, prediction)
Typically, data scientists train models using
software frameworks and libraries (eg,
scikit-learn, PyTorch) that make it easy to
leverage high-performing algorithms in a
standardized way.
4 Executive’s guide to developing AI at scale
Process Definition Why it’s important
Test Evaluating performance of the trained Testing helps data scientists
model by comparing the predictions the understand how accurately the model
model generates from new data to known will perform in the real world. It allows
values them to tune the model to improve its
performance.
Training data are usually divided into a
ratio of 60:20:20: independent training The performance on unseen data is a
on 60 percent of the data, validation on measure of “generalizability” of how
20 percent of the data, and 20 percent on well the model is likely to perform once
testing data sets. released and used in a live production
setting.
For the different model options that are
trained and tested, validation measures
performance on unseen data.
Deploy Promoting the trained model into a Stable, repeatable deployment
production environment mechanisms enable businesses to
update models with lower lead times
The model is typically “packaged” into after training models on new data sets.
a unit of software that can be deployed
and easily used to generate predictions
in a repeatable manner (AWS SageMaker,
Azure Machine Learning services, Kedro-
Server are examples of deployment tools).
Executive’s guide to developing AI at scale 5
Technology Definition Why it’s important
Processing unit Performs computations, processes data, The type of processing unit can impact
and trains machine learning models model training and performance.
Processing units come in three major GPUs are more expensive but provide
types: central processing units (CPUs), better performance (faster to train and
graphic processing units (GPUs), or custom predict) than CPUs.
hardware:
Custom hardware (eg, TPUs) can offer
CPU: Most common and commoditized superior performance to CPUs and
general-purpose processor GPUs but are typically relevant only for
research or companies at the leading
GPU: Initially built for graphics display edge of AI development.
cards, it has been used extensively in the
past decade by machine learning algorithms
for its ability to parallelize and reduce
computation time
Custom hardware: Processing chips
specifically useful for neural networks (eg,
tensor processing unit, or TPU)
Cluster Collection of computing processing Clusters are deployed to leverage
resources used to train analytics models, multiple machines simultaneously
which can be autoscaled (meaning the and have them work in parallel,
number of processing units can be reducing the total time to complete a
increased) to fit computing demands calculation.
Data scientists (or machine learning
engineers) will usually have to make
a trade-off between the modeling
technique they use and its ability to be
distributed or parallelized. Clusters can
be used only when the workload can
be distributed (ie, parallelizable).
6 Executive’s guide to developing AI at scale
Factory
Stage 3: Integrate analytics output
Process Definition Why it’s important
Monitoring The process of checking a model’s Model performance can “drift” over time,
predictions—as well as the data flowing whereby the predictions become less accurate
into it—while it is in use to ensure it is due to some real-world complexity not
performing as expected captured in the model. Data scientists also
monitor for unexpected bias or fairness issues
that the model could exhibit. They can tweak
the model architecture or provide new training
or new data sources to correct these issues.
Technology Definition Why it’s important
Container Standard unit of software-code Containers enable a “build once, run
packaging and all its dependencies anywhere” approach. Similar to shipping
containers that are able to be used regardless
Containers (eg, Docker, rkt) are used of shipping mode (ship, train, truck), containers
to package and deploy production allow code to be packaged in a standardized
applications (eg, a completed format to run on different kinds of underlying
model that is ready to perform daily operating systems and hardware (eg, on cloud
predictions as part of a business or on-premises).
process).
Kubernetes Open-source system that automates As the number of models in a company
deployment, scaling, and management increases, managing a high number of
of containers containers can be challenging. Kubernetes is
designed to simplify this challenge.
Kubernetes is often seen as a “vendor-neutral”
deployment platform for containers.
Microservice A software architecture style to Microservices allow reusability. Code can be
develop and maintain code as invoked by other applications or microservices.
independently deployable services, Microservices are loosely coupled, allowing
which are often run in containers teams to remain relatively independent
and avoid complex dependencies between
Rather than building large software software.
“monoliths” that are tightly integrated,
microservices are small and Microservices also enable developers to write
independent software components code in the language that is most convenient
that integrate with other components to them. As long as the external interface (API)
using APIs. remains consistent, changes can be made to
the underlying software as needed.
Executive’s guide to developing AI at scale 7
Technology Definition Why it’s important
API An application programming interface An API can be accessed by many applications,
(API) is a software intermediary that enabling reusability of code. In this case, for
allows two applications to talk to each example, many analytics applications can
other (REST API, gPRC, and GraphQL access prediction results from a single model
are examples of APIs used to access through an API.
data).
An API also allows technical teams to make
“under the hood” changes to analytics without
impacting users (eg, the model can change
as long as results are provided in the same
format).
Consumption Consists of interfaces that are used Interfaces are important to enable consumers
layer to expose the results of models to to act upon the data and predictions gen-
analytics consumers (either users or erated by the underlying model.
other applications), API services that
serve or push information to other
systems, and process interfaces that
trigger business processes
8 Executive’s guide to developing AI at scale
Ways of working
MLOps
Process Definition Why it’s important
Hardening A collection of techniques and configurations As models are released into production
used to make the software more robust, and put to use, the inputs cannot be
reducing vulnerabilities to security issues and predicted and, in some cases, may be
errors that could happen in production and subjected to users or programs with
ensuring that the analytics solution’s results malicious intent. Hardening reduces
cannot be manipulated the potential negative impact by
reducing the “surface area” that could
be subject to attack. For example,
hardening may ensure that all requests
and responses to and from an API are
encrypted even within an internal
network.
Continuous The process supporting code collaboration Automating integration can speed up
integration developer productivity and help to
When developers share back the code with rapidly identify issues.
the team, testing steps take place to ensure
the “build” is not broken.
Continuous integration automates the
process by which team members share back
their working copies of code, enabling it to
occur frequently (multiple people sharing
multiple times per day).
Mature teams leverage test automation and
an “integration pipeline” to enable continuous
integration.
Continuous The process in which every time a software Continuous delivery aims to build, test,
delivery “build” passes the automated tests, it and release software with greater
is deployable and can be released to speed and frequency. This reduces
production at any time deployment risk, since the changes
between each version are small.
Environments are usually staged, going first
through a development environment, then a
user-acceptance environment, and then the
production environment.
Mature teams have “deployment pipelines”
that automate delivery.
Executive’s guide to developing AI at scale 9
Process Definition Why it’s important
Issue tracking Software (eg, Jira) used to plan work for Planning software-development
software developers, particularly for teams tasks can be complex and requires
working in an agile manner collaboration.
Work is broken down from user stories
to features to tasks/issues that are
implemented by software developers.
Logging The creation of a detailed record of events Logging allows teams to troubleshoot
that occurred within an application during its issues by reviewing changes that
operation, which are typically stored in log happened to applications and systems
files via their respective logs.
Monitoring The observation and management of For jobs that need to run at particular
software components to ensure they are times or services/applications that
available and performing normally (as need to be available, application moni-
opposed to the monitoring of the model’s toring is crucial to ensure that software
predictions that takes place while the model is performing as intended.
is in use)
This is particularly important in
production, when a model is no longer
an experiment and is adopted by
business users as part of normal busi-
ness operations.
10 Executive’s guide to developing AI at scale
Lab roles
Role Responsibilities
Product owner Typically, a business owner who takes on part-time or full-time responsibility as the
product owner
— Provides the “voice of the customer” to define success criteria for the squad
— Serves as the first point of contact with any external stakeholders
— Reviews and prioritizes the short- and mid-term objectives
— Brings users into the development process (eg, for feedback) as needed
Data scientist — Frames the business problem and identifies analytics techniques to address it
— Collaborates closely with the engineering team to prioritize data transformations for
training data and features for prediction
— Programs advanced analytics algorithms
— Develops visuals to illustrate model mechanisms and performance
Data engineer — Scopes data available and identifies major source systems to consolidate data for
analytics
— Develops data pipelines that simplify and automate data movement
— Sets up data architecture (data storage, data layers)
— Collaborates with data scientists to transform data (eg, create new data features for
prediction) based on model requirements
Translator — Acts as the interface between business and technical stakeholders
— Helps the product owner to make trade-offs between business requirements and
technical complexity
— Prepares materials to support integration activities, such as process maps and
change-management stories
UX/visualization — Focuses on the interaction between end users and the analytics solution output
design — Develops dashboard concepts and the user-interface controls that will ensure users
can consume the analytics output
Delivery — Responsible for all aspects of delivery of the analytics solution (eg, secures licenses,
manager handles solution access requests) to meet the lab team’s goal
— Can play the role of scrum master (facilitator of the agile-development process)
— In complex settings, the delivery-manager and scrum-master role can be split into two
Executive’s guide to developing AI at scale 11
Factory roles
Role Responsibilities
Product owner — Serves as the first point of contact with external stakeholders
— Defines solution success criteria
— Reviews and prioritizes the short- and mid-term objectives
— Participates in major team activities such as reviews and retrospectives
— Manages budget and resources
Machine — Optimizes machine learning models for performance and scalability and deploys
learning them into production to ensure repeatable and dependable operation
engineer — Automates the machine learning pipeline, from data ingestion to prediction
generation
MLOps — Automates underlying technology infrastructure
engineer — Develops continuous-integration (CI) and continuous-deployment (CD) pipelines to
automate parts of the software-deployment pipeline
Cloud or — Designs new infrastructure components for analytics in accordance with enterprise
infrastructure guidelines (eg, data lake, streaming platforms)
architects — Makes necessary technology-platform decisions and advises the teams on what
platforms to leverage
— Provides advice on integration patterns with various enterprise-data sources
Cloud — Drives solution adoption with broad stakeholder group
management — Develops communication strategy and adoption plans
lead
— Leads trainings and management processes
— Typically, one of the lead users with both subject-matter and technical expertise
Others Possible additional factory roles include the following:
— Software engineers
— Quality assurance and testing automation experts
— Subject-matter experts
Nayur Khan is a senior expert in McKinsey’s London office, Brian McCarthy is a partner in the Atlanta office,
and Adi Pradhan is a consultant in the Montreal office.
The authors wish to thank Mayur Chougule, Joe Christman, and Maxime Delvaux for their contributions to this
content.
Copyright © 2020 McKinsey & Company. All rights reserved.
12 Executive’s guide to developing AI at scale