Big Data and Machine Learning
Fundamentals with Google Cloud Platform
An explosion of data
“By 2020, some 50 billion smart
devices will be connected, along with
additional billions of smart sensors,
ensuring that the global supply of data
will continue to more than double
every two years”
https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/straight-talk-about-big-data
An explosion of data
… and only about 1% of the data generated
today is actually analyzed
https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/straight-talk-about-big-data
There is a great demand for data skills
Data Analyst Applied ML Engineer Tech Lead
Analyst Data Scientist Ethicist Analytics Manager
Statistician Social Scientist Decision Maker
Data Engineer
Applied ML
Data Engineer Researcher
Engineer
Big Data Challenges
Migrating existing
Analyzing large
data workloads
datasets at scale
(ex: Hadoop, Spark jobs)
Building streaming Applying machine
data pipelines learning to your data
Agenda
Google Cloud Platform infrastructure
● Compute
● Storage
● Networking
● Security
Big data and ML products
● Google innovation timeline
● Choosing the right approach
Lab: Exploring Public Datasets in BigQuery
Activity: Explore a customer use case
Seven cloud products with
Google’s mission one billion users
Organize the world’s information
and make it universally accessible
and useful.
Big Data and ML Products
Compute Power Storage Networking
Security
Big Data and ML Products
Compute Power Storage Networking
Security
Machine Learning Models require
significant compute resources
Shown: Automatic
Video Stabilization
for Google Photos
Data sources:
1. Image frames
(stills from video)
2. Phone gyroscope
3. Lens motion
A single high-res image represents millions of data points to learn
← 3264px Width →
8 Megapixel resolution
←2448px Height →
3264 (w) x 2448 (h) x 3 (RGB) =
23,970,816
data points per image*
* More data = longer model training
times + more storage needed
3 “Layers” in depth for Red Blue Green
Google trains on its infrastructure and
deploys ML to phone hardware
Build on Google infrastructure
This is what makes Google Google: its
physical network … and those many
thousands of servers that, in aggregate,
add up to the mother of all clouds.”
- Wired
Simply scaling the raw number of servers in Google’s
data centers isn’t enough
“If everyone spoke to their
phone for 3 minutes, we’d
exhaust all available
computing resources”
— Jeff Dean, 2014
Will Moore’s Law save us?
https://cacm.acm.org/magazines/2018/9/230571-a-domain-specific-architecture-for-deep-neural-networks/fulltext
Tensor Processing Units (TPUs) are specialized ML hardware
Cloud TPU v2 Cloud TPU v3
180 teraflops 420 teraflops
64-GB High Bandwidth 128-GB HBM
Memory (HBM
TPUs enable faster models and more iterations
“Cloud TPU Pods have transformed
our approach to visual shopping by
delivering a 10X speedup over our
previous infrastructure.
+
We used to spend months training a
single image recognition model,
whereas now we can train much
more accurate models in a few
days on Cloud TPU Pods.
— Larry Colagiovanni
VP of New Product Development
Creating a customizable virtual machine on Google Cloud
Create with the Google Cloud Compute Engine Web UI
Or with the command line interface
1. // CREATE INSTANCE WITH 4 vCPUs and 5 GB MEMORY
2. gcloud compute instances create my-vm --custom-cpu 4 --custom-memory 5
Customize for speed and
Runs on Google’s private
workload type (e.g. include
fiber network
GPUs and TPUs)
Optional Creating a VM on Google
Demo Cloud Platform
Processing and visualizing earthquake data on GCP
Big Data and ML Products
Compute Power Storage Networking
Security
1.2 billion photos and
videos are uploaded to
Google Photos every day.
Total size of over 13 PB
of photo data.
1PB or 400 hours of video
uploaded every minute
Leverage Google’s 99.999999999% durability storage
Cloud
Storage
Creating a Cloud Storage bucket for your data is easy
UI Cloud
Storage
gsutil mb -p [PROJECT_NAME] -c [STORAGE_CLASS]
CLI -l [BUCKET_LOCATION] gs://[BUCKET_NAME]/
Typical big data analytics workloads run in Regional Storage
Organization
Organization Cloud Storage buckets are one of
many resources of the Google
Cloud Platform
Team A Team B
Folders
You can collaborate with many
Product 1 Product 2
other teams in your organization
across many projects
Projects
Dev Test project Production
Resources
BigQuery Cloud Storage Compute Engine BigQuery
dataset bucket instance dataset
Got data? Quickly migrate your data to the cloud using gsutil tool
Google Cloud Platform Project
Bucket
Copy
Objects Cloud
Data and Storage
metadata
gsutil cp sales*.csv gs://acme-sales/data/
gsutil = google storage utility, cp = copy
Big Data and ML Products
Compute Power Storage Networking
Security
Google’s private network carries as much as 40% of the
world’s internet traffic every day
Google’s data center network speed enables
the separation of compute and storage
1 Petabit/sec of total bisection bandwidth
Servers doing compute Data can be “shuffled”
tasks don’t need to have the between compute workers
data on their disks at over 10GBs
Google’s cable network spans the globe
FASTER (US, JP, TW) 2016
Havfrue (US,IE, DK) 2019
SJC (JP, HK, SG) 2013
HK-G (HK, GU) 2019
Unity (US, JP) 2010
Curie (CL, US) 2019
PLCN (HK, LA) 2019
Monet (US, BR) 2017
Google Network
Edge points of Junior (Rio, Santos) 2017
presence >100
Edge node locations Tannat (BR, UY, AR) 2017
>7500
Indigo (SG, ID, AU) 2019
Big Data and ML Products
Compute Power Storage Networking
Security
On-premise → you manage all security layers
On-
Responsibility premises
Content
Access policies
Usage
Deployment
Web app security
Identity
Operations
Access and authentication
Network security
OS, data, and content
Audit logging
Network
Storage and encryption
Hardware
Google Cloud Platform offers fully-managed services
On- IaaS PaaS Managed
Responsibility premises services
Content
You Manage Access policies
Usage
Deployment
Google Managed
Web app security
Identity
Operations
Access and authentication
Network security
OS, data, and content
Audit logging
Network
Storage and encryption
Hardware
Agenda
Google Cloud Platform infrastructure
● Compute
● Storage
● Networking
● Security
Big data and ML products
● Google innovation timeline
● Choosing the right approach
Lab: Exploring Public Datasets in BigQuery
Activity: Explore a customer use case
Google invented new data processing methods as it grew
GFS MapReduce Bigtable Dremel Flume Spanner TensorFlow TPU
Colossus Megastore Millwheel
Pub/Sub F1
http://research.google.com/pubs/papers.html
Google Cloud opens up that innovation and infrastructure to you
Cloud Dataproc Bigtable BigQuery Dataflow Dataflow ML Engine AutoML
Storage
Cloud
Cloud Datastore Pub/Sub
Spanner
Storage
The suite of big data products on Google Cloud Platform
Storage Ingestion Analytics Machine Learning Serving
Cloud Compute Kubernetes Cloud Data Studio
Storage BigQuery Cloud ML
Engine Engine TPU Dashboards/BI
Cloud Cloud Cloud Cloud Cloud Cloud Dialogflow
SQL Spanner TensorFlow
Dataflow Composer Dataproc AutoML
Cloud Cloud
Cloud Cloud Cloud
Datastore Bigtable
Functions ... App
Pub/Sub Datalab Engine
ML APIs
Agenda
Google Cloud Platform infrastructure
● Compute
● Storage
● Networking
● Security
Big data and ML products
● Google innovation timeline
● Choosing the right approach
Lab: Exploring Public Datasets in BigQuery
Activity: Explore a customer use case
BigQuery has over 130 Public Datasets to explore
Demo Query 2 billion lines of code in
less than 30 seconds
Github on BigQuery
Lab Exploring Public Datasets using
the BigQuery Web UI
● Open BigQuery
● Query a Public Dataset
● Create a custom table
● Load data into a new table
● Querying basics
Open Qwiklabs
Open an incognito window
1 4 Launch the course from My Learning
(or private/anonymous window)
2 Go to events.qwiklabs.com
Sign In with existing account or Join with
3 new account (with email you used to
register for the bootcamp)
Don’t remember what email you
used to register with or don’t
have access to it?
https://goo.gl/xrVBpM
If so, use the link or QR code to
add your email address for
access to the course material.
View your lab
Note: You can access the course
PDFs under Lecture Notes
Launch the lab and start on Lab 1
Labs will last for 40
minutes or so (and
materials can be
accessed for 2 years)
Tip: Track your
progress with Score
X/15
Pro tip: Use the table of
contents on the right to
quickly navigate
Do not click
End Lab until you are
done with that lab
(note: each lab is
independent)
Do Ask Questions
(we have a talented
team of experts to
help!)
Agenda
Google Cloud Platform infrastructure
● Compute
● Storage
● Networking
● Security
Big data and ML products
● Google innovation timeline
● Choosing the right approach
What you can do with GCP
Activity: Explore a customer use case
GO-JEK brings goods and services to over
2 million families in 50 cities in Indonesia
GO-JEK’s footprint nationwide
GO-JEK manages 5 TB+ per day for analysis
GO-JEK soon faced data scale and latency challenges
“Most of the reports are Day +1, so
we couldn’t identify the problems
as soon as possible.”
GO-JEK migrated their data pipelines to GCP
● High performance scalability with
minimal operational maintenance
● More granular data with high
velocity and less latency (stream
processing)
● The ability to solve business
problems with real time data
insights
GO-JEK architecture review
GO-JEK ride-share supply/demand use case
Business question
Which locations have mismatched supply and demand in real time.
Challenge
We ping every one of our drivers every 10 seconds, which means 6
million pings per minute and 8 billion pings per day. How do we stream
and report on such a volume of data?
Autoscale streaming pipelines with Cloud Dataflow
Driver location ping Autoscaling
Dataflow autoscale based Automatically adjust the number of workers based on demand
on throughput data
Read Driver location fr...
Running
29 sec
Parse Json to booking
16,258 elements/s
56 sec
Current workers
Map Drive location to...
16,258 element/s Target Workers
3 sec
Visualize demand/supply mismatches with GIS data
Your turn: Analyze real customer big data use cases
1. Navigate to
cloud.google.com/
customers/.
2. Filter Products &
Solutions for Big Data
Analytics.
3. Find an interesting
customer use case.
4. Identify the key
challenges, how they
were solved with the
cloud, and impact.