Topic Cheatsheet for GCP’s Professional Machine Learning Engineer Beta Exam
Authors/contributors: David Chen, PhD
Credits & disclaimers can be found in README section of the source repository. Some references are included as %comments or hyperlinks the source file.
Abbreviations Exploratory Data Analysis (EDA) II. ML Model Development
Common abbreviations. ML, machine learning; DL, deep learn- • Evaluation of data quality (domain- and organization- Model Development At-a-Glance
ing; AI, artificial intelligence, CV, computer vision; GC(P), Google specific knowledge/information may be needed) Generic ML Workflow
Cloud (Platform); CI/CD: continuous integration / continuous de- • Data visualization (descriptive statistics)
livery; SDK, software development kit; API, application program- • Inferential statistics (e.g. t-test to compare means, KS-tests 1. Training
ming interface; K8s, Kubernetes; GKE, Google Kubernetes Engine. to compare distributions) as needed, scale as needed • Choose a model framework
MLE, maximum likelihood estimation; ROC, receiver-operation – Supervised
curve; AU(RO)C, area under the (receiver-operation) curve Feature Engineering – Unsupervised
• Consider Transfer Learning (if applicable)
• Necessary (e.g. time series) or beneficial in many ML tasks: • Monitoring / tracking metrics
I. Preparation for ML • Encoding structured data types • Strategies to handle overfitting (e.g. regularization,
• Feature Crosses: used to define a synthetic feature when data ensemble learning, drop-out) & underfitting (increase
Understanding the ”Data Science Steps for ML” cannot be linearly separated (e.g. feature cross products model complexity)
1. Data extraction x1 × x2 • Interpretability
2. Exploratory data analysis • Feature selection, e.g. 2. Validation
3. Data preparation for the ML Task – Univariate statistical methods (e.g. 𝜒 2 test, t- • Check overfitting & underfitting
4. Model training test/linear model) • Compare trained model against pre-defined baseline
5. Model evaluation – Recursive Feature Elimination (RFE) (e.g. simple model or benchmark)
6. Model validation • Unit tests
7. Model serving Special considerations
3. Scale-up & Serving**
• Microservices with REST API • Imbalanced class distributions • Unit tests
• Deployment on mobile devices – Needs to be known, at minimum • Cloud AI model explainability
• Batch predictions – Affects the metrics to employ (e.g. F1 score, AUC • Distributed training
8. Model monitoring would be superior to crude accuracy in imbalanced bi- • Scalable Model Analysis
nary classification)
Defining an ML Problem – Can affect optimization choices: modify objective ML Models
function; oversampling the minority class(es) Gradient descent is used to optimize the objective functions of a
ML as Solution to Business Problems • Data Leakage machine-learning model:
• (Re)define your business problems – Certain features available in your training data might Gradient Descent n Resolution
• Consider whether the problem could be solved without ML not be available in the unknowns to predict!
– When training, be careful not to include raw or engi- Full-batch all (N) complete
• Define/anticipate utility of the ML output
neered features that are computed from the classifica- Mini-batch 1<n<N intermediate
• Identify data sources
tion/regression label Stochastic 1 noisy approximation
• Pre-define ”success” to solving the business challenge
– Metric(s) used to define success An epoch is the number of passes through the entire training dataset,
– Key results (product or deliverables)
Data Pipelines
and is a hyperparameter to be defined/tuned by the user.
– Incorrect or low-quality output (i.e. ”unsuccessful” • Should be designed & built in advance for at-scale applica-
models) tions Supervised Learning (with related concepts)
• Batching vs. Streaming • Naive Bayes (flavors: Gaussian, Bernoulli, Multinomial)
Components of an ML Solution – Batching: Use of data stored in data lakes, processed in • Decision trees (concept of entropy)
periodic intervals • Support Vector Machine (SVM)
• Define Predictive Outcome – Streaming (data streams): Use of data from live streams.
• Identify Problem Type: Supervised (Classification or Regres- – Linearly vs. non-linearly separable
Unique challenge due to 3𝑉𝑠: Volume, Velocity (real- – Kernels
sion), Unsupervised, Reinforcement time), Variety (esp. unstructred data) (useful tool:
• Identify Input Feature Format Cloud Dataflow) Unsupervised Learning
• Feasibility and implementation • Monitoring • Clustering
– ”Four Golden Signals” of your cloud-based service: la- – K-means
Data Preparation tency, traffic, error, saturation – Hierarchical Clustering
– Dashboards (Stackdriver Cloud Monitoring Dash- – DBSCAN
Data Ingestion
boards API) can be a powerful tool in displaying mul- • Dimensionality reduction
• Obtaining & importing data for use or storage tiple metrics – Principal Component Analysis (PCA)
• File input types • Privacy, compliance, legal issues: Know what the restrictions – t-SNE
• Database maintenance, migration are and plan ahead (e.g. privacy-preserving ML/AI, corrupt- • Gaussian Mixture Model (GMM), optimized by Expectation-
• Streaming data (from IoT devices, databases, or enduser) ing input, ...)(useful tool: Cloud IAM) Maximization (EM):
1. E step ∗ Undercomplete autoencoder – Multi Cloud: multiple clouds designated for different
2. M step ∗ De-noising autodencoders tasks (*but unlike parallel computing, synchroniza-
Repeat until convergence ∗ Sparse autencoders tion across different ventors is NOT essential)
– Application Procedures during Implementing a Training Pipeline
Overfitting ∗ Data representation (feature engineering) • Perform data validation (e.g. via Cloud Dataprep)
Bias-variance trade-off ∗ Dimensionality reduction / data compression • Decouple components with Cloud Build (fully server-less
CI/CD platform supporting any language)
• Characteristics of Loss vs. iteration curves, separately plot-
ted for
III. Production-level ML with Cloud – Add layer of technical abstraction
– Separate content producer & end users
– Training set MLOps: CI/CD in an ML System
– Validation and/or test set – Ensures software components are not tightly depen-
• Underfitting vs. overffiting patterns DevOps Data Engineering MLOps dent on one another
• Construct & test parametrized pipeline definition in SDK (e.g.
Ways to address overfitting Version ctrl. Code Code Code, data, model gcloud ml-engine)
Pipeline - Data, ETL Training, serving • Tune compute performance
1. Get more high-quality, well-labeled training data Validation Unit tests Unit tests Model valid.
2. Regularization • Store data & generated artifacts (e.g. binaries, tarballs) via
CI/CD Production Data pipeline (both) Cloud Storage
• L2 penalty
• L1 (LASSO) penalty Type Transac.? Complex Q? Cap.
Tools for virtualization:
• Elastic net
3. Ensemble learning
• Virtual Machines (VMs) Cloud Datastore NoSQL ✓ 7 Terabytes+
• Bagging
• Containers Bigtable NoSQL (limited) 7 Petabytes+
– Random Forest: Only the randomly chosen 1 ≤
– Clusters Cloud Storage Blobstore 7 7 Petabytes+
– Pods Cloud SQL SQL ✓ ✓ Terabytes
𝑚 < 𝑀 features used in split
• Kubernetes (K8s) Cloud Spanner SQL ✓ ✓ Petabytes
– Bagged Trees: all 𝑀 features available used in
split BigQuery SQL 7 ✓ Petabytes+
GCP tools:
• Boosting (e.g. Gradient Boosted Trees/XGBoost) • BigQuery Considerations for Implementing the Serving Pipeline
• Model binary options
Recommendation Systems – Google-managed data warehouse
• Google Cloud serving options
– Highly scalable, fast, optimized
User info Domain knowledge – Suitable for analysis & storage of structured data • Testing for target performance
– Multi-processing enabled • Setup of trigger & pipeline schedule
Content-based ✓ Deployment with CI/CD (final step in MLOps), along with
Collaborative Filtering ✓ • Cloud Dataprep
– Managed cloud service for quick data exploration & • A/B testing: Google Optimize
Knowledge-based ✓ • Canary testing, automated by GKE with Spinnaker
transformation
A hybrid recommendation systems uses more than one of the above, – Auto-scalable, eases data-preparation process
though not 100% possible at all times, it is generally the preferred ML Solution Monitoring
• Cloud Dataflow: provides serverless, parallel, distributed in-
solution. frastructure for both batch & stream data processing by mak- Considerations in monitoring ML solutions:
ing use of Apache Beam TM 1. Monitor performance/quality of ML model predictions on
Deep Learning an ongoing-basis (via Cloud Monitoring (Compute Engine)
• Cloud ML APIs
Subtypes of Neural Networks – Cloud Vision AI with a metric model), and then debug with Cloud Debugger
• Feed forward neural network – Cloud Natural Language 2. Use robust logging strategies (e.g. Cloud Logging, espe-
• Convolutional Neural Network (CNN) & computer vision – Cloud Speech to Text cially Stackdriver (aka Cloud Operations) with beautiful
• Recurrent Neural Network (RNN) – Cloud Video Intelligence dashboards)
– Sequence data (speech/text, time series) 3. Establish continuous evaluation metrics
– Vanishing gradient problem ML Pipeline Design Troubleshoot ML Solutions:
– Gated Recurrent Units (GRU) The ML code is only a small part of a production-level ML system • Permission issues (IAM)
– Long-short term memory (LSTM) • Identify components, parameters, triggers, compute needs • Training error
– Application to Natural Language Processing (NLP) • Orchestration Framework • Serving error
∗ Language models – Cloud Composer (based on Apache Airflow deploy- • ML system failures/biases (at production)
∗ Embeddings ment) Tune performance of ML solutions in production
∗ Architectures (e.g. transformers) – GCP App Engine • Simplify (optimize) of input pipeline
• Autoencoders (deep learning) – Cloud Storage – Reduce data redundancy in NLP model
– General architecture – Cloud Kubernetes Engine – Utilize Cloud Storage (e.g. object storage)
∗ Encoding layers – Cloud Logging & Monitoring – Simplification can take place in various places during
∗ Lower-dimensional representation (returned or • Strategies beyond single cloud: the pipeline
used as input for subsequent autoencoder in a – Hybrid Cloud: blend of public & private cloud for • Identify of appropriate retraining policy
stack) mixed computing, storage, & services, allowing for – Under what circumstance(s)? How often? (e.g. when
∗ Decoding layers agility (i.e. quick adaptation during business digital significant deviation or drift identified; periodically)
– Flavors to address trival solutions: transformation) – How? (e.g. by batch vs. online learning)