0% found this document useful (0 votes)
74 views15 pages

Google Cloud Computing Foundations: Data, ML, and AI in Google Cloud

Uploaded by

kaustubhwani155
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views15 pages

Google Cloud Computing Foundations: Data, ML, and AI in Google Cloud

Uploaded by

kaustubhwani155
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Google Cloud Computing Foundations: Data, ML, and AI in Google Cloud

[email protected]
|
The Google Cloud Computing Foundations courses help build cloud
literacy for individuals who have little to no background or experience in
cloud computing. They provide an overview of concepts central to cloud
infrastructure, application development, big data, and machine learning,
and where and how Google Cloud fits in.

This is the fourth and final course in a four-course series called Google
Cloud Computing Foundations. The courses should be completed in the
following order:

1. Google Cloud Computing Foundations: Cloud Computing


Fundamentals
2. Google Cloud Computing Foundations: Infrastructure in Google
Cloud
3. Google Cloud Computing Foundations: Networking and Security in
Google Cloud
4. Google Cloud Computing Foundations: Data, ML, and AI Google
Cloud

This final course reviews managed big data services, and machine learning
and its value. By the end of this course, learners will be able to articulate
these concepts and demonstrate some hands-on skills.

Prerequisites: Google Cloud Computing Foundations: Networking and Security in


Google Cloud

Audience: New-to-Introductory learners

The course features differ based on whether you're enrolled in the audit (free) or
verified (paid) track of a course.

Audit (Free) track: With this track, you will have access to all course materials
except graded assignments. You will not earn a verified certificate at the end of the
course. You will be able to access the free content for the expected course length
that you see on the Course About Page (4 weeks). After this duration, you will no
longer be able to access that course material.
If you decide that you do want to earn a certificate, you can pay to switch to the
verified track by the upgrade deadline and take the graded assignments required to
earn the certificate. As always, financial assistance is available for learners who need
it.

Verified (Paid) track: With this track, you will have access to all course materials
including graded assignments. You will have unlimited access to this course material
until the course ends. After the course ends, you will still have access to the material,
but you will no longer be able to submit graded assignments or earn a certificate.

You will pay a fee when you select the Verified track. Your fee allows us to fund the
running of the course, grade your work, and award you a certificate upon successful
completion. You can add the certificate to your LinkedIn profile or resume, or stack it
towards the Professional Certificate program.

Complete, pass and earn a Verified Certificate in all four courses to receive
your Professional Certificate in the Google Cloud Computing Foundations
program. Learners must be verified and attain a pass mark of at least 75%
in each course to attain the Certificate.

 Google Cloud Computing Foundations: Cloud Computing


Fundamentals
 Google Cloud Computing Foundations: Infrastructure in Google
Cloud
 Google Cloud Computing Foundations: Networking and Security in
Google Cloud
 Google Cloud Computing Foundations: Data, ML, and AI Google
Cloud

 If you have any questions or problems while studying this course


there are a number of places you can go for help.
 edX Demo course
 If this is your first edX course, you may like to enroll in and complete
the short edX DemoX course. This demo course explains how to
navigate the edX platform, how to work with videos and answer
different problem types. You can also find out about your Progress
and Grades, in addition to some of the different online tools available
to you.
 Technical problems (including login or certificate issues)
 To get help with a technical problem, click on the Help link (top right
hand corner next to your profile picture) to send a message to edX
Student Support. You can choose to search through their existing
help articles or submit a request. Simply follow the prompts and edX
Student Support will be in touch.
 Alternatively, you may wish to also access additional help information
via the edX Help Centre.
 Recommended browsers
 For the best user experience, we recommend viewing this course on
the latest versions of browsers such as Chrome, Firefox or Safari.
 Please note - We have encountered an issue when viewing
embedded tweets (from Twitter) within the edX app on Apple
devices. It will exit the app and open a browser window. If this
occurs, simply return to the app by clicking on the (edX) link in the
top left hand corner on your browser.
 Course content queries
 If you have any queries about the content of this course and how it
works, please use the Discussions area to ask the Course Team.
Please remember to post your query into the "General" forum when
you create your new post.
 Email us
 If you experience any other issues or want to contact us directly, feel
free to send a personal email to the Course Team: Contact Qwiklabs
Support.
 Please identify which course you are studying in the subject line.
 As this course is self-paced, please allow up to 5 working days
(excluding weekends) for us to respond.

Learners will need to gain 75% or above to successfully pass this course.
The assessment breakdown is as follows.
Assessment type % of Final Grade Due Date

Labs 100% By course finish

Please note:

 There are 10 labs in this course and each lab contributes 10%
towards the final score.
 The labs are for verified users only and there is a set time limit for
each lab.
 You can complete the labs at any time while the course is open,
however we do recommend that you complete them sequentially,
after you complete the relevant module.

After completing this course you should be able to:

 Discover a variety of managed big data services in the cloud.


 Explain what machine learning is, the terminology used, and its value
proposition.


Lecture 1

You Have the Data, but What Are You Doing with It?
1. Welcome to module 9 of the Google Cloud Computing Foundations course: You have the
data, but
2. what are you doing with it?
3. In this module, you’ll look at some managed services that Google offers to process your
4. big data.
5. This means that you’ll:
6. Explore big data managed services in the cloud.
7. Examine using Dataproc to run Apache Hadoop, Apache Spark, and other big data
technologies
8. as a managed service in the cloud.
9. Learn about building ETL pipelines as a managed service by using Dataflow.
10. Explore BigQuery as a managed data warehouse and analytics engine.
11. The module agenda follows the objectives; starting with an introduction to big data
12. managed services in the cloud, before moving on to how big data operations can be used
13. through Dataproc.
14. You’ll then complete two labs by using the Google Cloud console and then the gcloud
CLI
15. to create a Dataproc cluster and perform various tasks.
16. After the labs, you’ll explore the use of Dataflow to perform extract, transform, and
17. load operations.
18. The following two labs provide an opportunity to learn more about Dataflow.
19. In the first one, you’ll use a Dataflow template to create a streaming pipeline.
20. In the second, you’ll set up a Python development environment, get the Dataflow SDK for
Python,
21. and use the Google Cloud console to run an example pipeline.
22. In the final section, you’ll learn about the role of BigQuery as a data warehouse before
23. it finishes with another hands-on lab to learn about Dataprep.
24. The module concludes with a final hands-on lab, followed by a short quiz and a recap
25. of the topics covered.
Introduction to Big Data Managed Services in the Cloud
1. In this first section, we'll discuss big data managed services in the cloud.
2. Before we explore this in detail, let's take a moment to conceptualize big data.
3. Enterprise storage systems are leaving the terabyte behind as a measure of data
4. size with petabytes becoming the norm.
5. We know that one petabyte is 1 million gigabytes or 1000
6. terabytes, but how big is that? From one perspective,
7. a petabyte of data might seem like more than you'll ever need. For example,
8. you need a stack of floppy discs,
9. higher than 12 Empire State buildings to store one petabyte.
10. If you wanted to download one petabyte over a 4G network,
11. you'd have to sit and wait for 27 years.
12. You'd also need one petabyte of storage for ever tweeted multiplied by 50.
13. So one petabyte is pretty big. If we look at it from a different perspective,
14. though,
15. one petabyte is only enough to store two micrograms of DNA or one day's
16. worth of video uploaded to YouTube. So for some industries,
17. a petabyte of data might not be much at all.
18. Every company stores data in some way,
19. and now they're trying to use that data to gain some insight into their business
20. operations. This is where big data comes in.
21. Big data architectures allow companies to analyze their stored data to learn
22. about their business. In this module,
23. we'll focus on three managed services that Google offers for the processing of
24. data.
25. For companies that have already invested in Apache Hadoop and Apache Spark and
26. want to continue using these tools dataproc provides a great way to run
27. opensource software in Google Cloud. However,
28. companies looking for a streaming data solution might be more interested in data
29. flow as a managed service.
30. Data flow is optimized for large scale batch processing or long running
31. streaming processing of structured and unstructured data.
32. The third managed service that we'll look at is BigQuery,
33. which provides a data analytics solution optimized for getting answers rapidly
34. over petabyte scale data sets.
35. BigQuery allows for fast SQL unstructured data.
Leverage Big Data Operations with Cloud Dataproc
In this section, we’ll learn how Dataproc provides a fast, easy, cost-
effective way
to run Apache Hadoop and Apache Spark.
Apache Hadoop and Apache Spark are open source technologies that
often are the foundation
of big data processing.
Apache Hadoop is a set of tools and technologies which enables a
cluster of computers to store
and process large volumes of data.
It intelligently ties individual computers together in a cluster to
distribute the storage
and processing of data.
Apache Spark is a unified analytics engine for large-scale data
processing and achieves
high performance for both batch and stream data.
Dataproc is a managed Spark and Hadoop service that lets you use
open source data tools for
batch processing, querying, streaming, and machine learning.
Dataproc automation helps you create clusters quickly, manage them
easily, and because clusters
are typically run ephemerally, you save money as they are turned off
when you don't need
them.
Let’s look at the key features of Dataproc.
Cost effective: Dataproc is priced at 1 cent per virtual CPU per cluster
per hour, on top
of any other Google Cloud resources you use.
In addition, Dataproc clusters can include preemptible instances that
have lower compute
prices.
You use and pay for things only when you need them.
Fast and scalable: Dataproc clusters are quick to start, scale, and shut
down, and each of
these operations takes 90 seconds or less, on average.
Clusters can be created and scaled quickly with many virtual machine
types, disk sizes,
number of nodes, and networking options.
Open source ecosystem: You can use Spark and Hadoop tools,
libraries, and documentation
with Dataproc.
Dataproc provides frequent updates to native versions of Spark,
Hadoop, Pig, and Hive,
so learning new tools or APIs is not necessary, and you can move
existing projects or ETL
pipelines without redevelopment.
Fully managed: You can easily interact with clusters and Spark or
Hadoop jobs, without
the assistance of an administrator or special software, through the
Cloud Console, the Google
Cloud SDK, or the Dataproc REST API.
When you're done with a cluster, simply turn it off, so money isn’t
spent on an idle
cluster.
Image versioning: Dataproc’s image versioning feature lets you
switch between different
versions of Apache Spark, Apache Hadoop, and other tools.
Built-in integration: The built-in integration with Cloud Storage,
BigQuery, and Cloud Bigtable
ensures that data will not be lost.
This, together with Cloud Logging and Cloud Monitoring, provides a
complete data platform
and not just a Spark or Hadoop cluster.
For example, you can use Dataproc to effortlessly extract, transform,
and load terabytes of
raw log data directly into BigQuery for business reporting.
Let’s look at a few Dataproc use cases.
In this first example, a customer processes 50 gigabytes of text log
data per day from
several sources.
The objective is to produce aggregated data that is then loaded into
databases from which
metrics are gathered for daily reporting, management dashboards, and
analysis.
Until now, they have used a dedicated on-premises cluster to store and
process the logs with
MapReduce.
So what’s the solution?
First, Cloud Storage can act as a landing zone for the log data at a low
cost.
A Dataproc cluster can then be created in less than 2 minutes to
process this data with
their existing MapReduce.
Once completed, the Dataproc cluster can be removed immediately.
In terms of value, instead of running all the time and incurring costs
even when not
used, Dataproc only runs to process the logs, which reduces cost and
complexity.
Now, let’s analyze a second example.
In this organization, analysts rely on—and are comfortable using—
Spark Shell.
However, their IT department is concerned about the increase in
usage, and how to scale
their cluster, which is running in standalone mode.
The solution is for Dataproc to create clusters that scale for speed and
mitigate any single
point of failure.
Since Dataproc supports Spark, Spark SQL, and PySpark, they could
use the web interface,
Cloud SDK, or the native Spark Shell through SSH.
The value is Dataproc’s ability to quickly unlock the power of the
cloud for anyone without
added technical complexity.
Running complex computations would take seconds instead of
minutes or hours.
In this third example, a customer uses the Spark machine learning
libraries (MLlib) to
run classification algorithms on very large datasets.
They rely on cloud-based machines where they install and customize
Spark.
Because Spark and the MLlib can be installed on any Dataproc
cluster, the customer can
save time by quickly creating Dataproc clusters.
Any additional customizations can be applied easily to the entire
cluster through initialization
actions.
To monitor workflows, they can use the built-in Cloud Logging and
Cloud Monitoring.
In terms of value, resources can be focused on the data with Dataproc,
not spent on cluster
creation and management.
Integrations with new Google Cloud products also unlock new
features for Spark clusters.
Lab Intro: Dataproc: Qwik Start – Console
1. Start of transcript. Skip to the end.
2. This is the first of two labs where we’ll work with Dataproc.
3. In this lab, we’ll use the Cloud Console.
4. In this lab titled “Dataproc: Qwik Start - Console,”, you’ll create a Dataproc
5. cluster, run a simple Apache Spark job in the cluster, and modify the number of workers
6. in the cluster by using the Cloud Console.
7. During this lab you’ll create a cluster, submit a job, and then view the job output.

The Qwiklabs feature is provided by Qwiklabs, a Google company. By


proceeding, you will be leaving edX and entering a new environment
hosted by Qwiklabs.

In order to utilize the Qwiklabs feature, you will need a Qwiklabs account,
to facilitate account creation, edX will send your first name, last name, and
email address. We will always use your primary email address as
registered in the platform.Use of the Qwiklabs product will be subject to
additional terms and conditions and the Qwiklabs privacy policy. By
proceeding to use this service you are accepting Qwiklabs Terms of Use
and Privacy Policy.Your access to Qwiklabs is made possible by a
partnership between edX and Google. By continuing, certain individual
usage statistics generated from your viewing of the Google Cloud
authorized Courses and completions of related courses may be shared with
Google.Please contact learner support
at https://courses.edx.org/support/contact_us if you have any questions.

You might also like