0% found this document useful (0 votes)
9 views

Session 5

Uploaded by

lacaygelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Session 5

Uploaded by

lacaygelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

SECTION 5

What is BigQuery?

BigQuery is a fully managed, serverless data warehouse that allows for superfast SQL
queries using Google's infrastructure. It lets you analyze large datasets in real time
with SQL-like syntax, making it ideal for quick querying, real-time analytics, and
machine learning.

How BigQuery Works

 Data Storage: Data is stored in tables with schemas that define field names
and types.
 SQL Queries: You can run SQL-like queries to join tables, aggregate data,
and perform complex calculations.
 Resource Management: BigQuery automatically handles resources based on
load, so you don't need to manage infrastructure.
 Distributed Architecture: This allows it to process large datasets quickly.
 Data Loading: You can load data from sources like Google Cloud Storage
and export data to various destinations.

Getting Started with BigQuery

1. Accessing BigQuery:
o Click on the navigation menu and select BigQuery.

2. Creating a Dataset:
o Click on the three dots next to your project ID, select "Create dataset,"
name it (e.g., "my_dataset"), choose the location (e.g., US Central 1),
and optionally enable table expiration.
3. Creating a Table:
o Within your dataset, click on the three dots, select "Create table,"
choose to create an empty table, and name it (e.g., "employees").

o Add fields to the schema, such as "employee_id" (integer),


"first_name" (string), and "last_name" (string).

o Click "Create table."


What is Cloud Composer?

Cloud Composer acts like a conductor for your data workflows. It helps automate and
manage tasks such as data pipelines, ETL processes, and data transformations across
different services in Google Cloud. Imagine it as a tool that ensures your data
operations run smoothly and on time, much like how a conductor manages various
instruments in a musical performance.

Key Features:

 Workflow Definition: Workflows are defined using Directed Acyclic Graphs


(DAGs), which outline tasks and their dependencies.
 Integration with Google Cloud: Found under the Data Analytics section in
Google Cloud Platform, it seamlessly integrates with other Google services.
 Environment Setup: You can create environments (like Composer 2) to

configure machine types, locations, and networking specifics for optimal


performance.

Example Workflow:

 Task 1 Extract Customer Data:

 Queries customer data from Google BigQuery using SQL to fetch details like
customer IDs and emails of recent purchasers.
 Task 2 Generate Email Content:

  Processes retrieved data to create personalized email content thanking


customers for their purchases.

 Task 3 Send Email:

 Sends out personalized emails using SMTP server settings, ensuring each
email reaches the correct customer.
Creating a DAG:

To automate this workflow, you'd define a DAG named "send_emails_DAG.py" in


Python. This script specifies tasks, their sequence, and scheduling intervals (e.g., daily
execution). Tasks are executed in the order defined, ensuring dependencies are met.

Monitoring and Execution:

Once configured, Cloud Composer handles the execution of these workflows. It


provides monitoring and logging capabilities to track task completion, errors, and
overall workflow performance.

What is Google Cloud Data Fusion?

Google Cloud Data Fusion is a managed service that helps organizations integrate
data without needing extensive coding. It provides a graphical interface to design,

deploy, and manage ETL (Extract, Transform, Load) pipelines. These pipelines move
and transform data across Google Cloud services efficiently.

Example Use Case:


Imagine you have customer data stored in a MySQL database and want to move it to
Google BigQuery for analysis. With Data Fusion, you can extract the data, transform
it to fit BigQuery's format, and then load it into BigQuery seamlessly.

Using Data Fusion:

1. Creating an Instance: Start by creating an instance in Data Fusion, which


provides an environment to work on your data integration tasks.
2. Using the Interface: Data Fusion offers different tools like "Wrangler" for
transforming data without coding. You can upload data from sources like
Google Cloud Storage, apply transformations (e.g., masking sensitive

information), and prepare it for loading into BigQuery.


3. Deploying Pipelines: Once transformations are defined, you can create and
deploy ETL pipelines to automate these processes. Data Fusion handles the
execution and scaling of these pipelines.

What is Pub/Sub?
Google Cloud Pub/Sub is a messaging service designed for asynchronous
communication between different parts of an application or between services. It
enables scalable and decoupled communication through a publish-subscribe model.
Imagine it like a bulletin board where friends share updates: anyone interested can
read messages without waiting for direct communication.

How Pub/Sub Works

In Pub/Sub, messages are published to topics by publishers (like applications or


services) without knowing who subscribes or how many there are. Subscribers
express interest by creating subscriptions to topics, receiving all messages sent to that
topic. This setup allows for efficient, real-time communication across systems.

Publishers: Send messages to topics without needing to know about subscribers.

Subscribers: Receive and process messages from topics they subscribe to.

Topics: Named resources where messages are sent by publishers and received by
subscribers.

Subscriptions: Linked to topics, subscriptions receive messages from topics and have
unique identifiers.

Pub/Sub is ideal for:

Event-Driven Systems: Like microservices, where services publish events and others
react.

Real-Time Data Processing: Ingesting high volumes of data from diverse sources for
analysis or storage.

Messaging Applications: Where users or systems publish to their own topics and
others subscribe to receive messages.

1. Creating Topics and Subscriptions:

 Create a topic for "news updates".


 Establish subscriptions like "sports", "tech", and "general news" to receive
specific updates.

2. Configuring Subscriptions:

 Choose between push (real-time delivery to an endpoint) or pull (subscribers


fetch messages when ready).
 Specify endpoints like URLs for push subscriptions to receive messages
immediately.

3. Writing Subscriber Scripts:

 Develop scripts for each subscription (e.g., sports, tech, general) to process
and respond to news updates.
4. Running the System:

 Execute scripts to see how messages flow from publishers (sending news
updates to the "news updates" topic) to subscribers (receiving and processing
updates based on their interests).

You might also like