Session 5
Session 5
What is BigQuery?
BigQuery is a fully managed, serverless data warehouse that allows for superfast SQL
queries using Google's infrastructure. It lets you analyze large datasets in real time
with SQL-like syntax, making it ideal for quick querying, real-time analytics, and
machine learning.
Data Storage: Data is stored in tables with schemas that define field names
and types.
SQL Queries: You can run SQL-like queries to join tables, aggregate data,
and perform complex calculations.
Resource Management: BigQuery automatically handles resources based on
load, so you don't need to manage infrastructure.
Distributed Architecture: This allows it to process large datasets quickly.
Data Loading: You can load data from sources like Google Cloud Storage
and export data to various destinations.
1. Accessing BigQuery:
o Click on the navigation menu and select BigQuery.
2. Creating a Dataset:
o Click on the three dots next to your project ID, select "Create dataset,"
name it (e.g., "my_dataset"), choose the location (e.g., US Central 1),
and optionally enable table expiration.
3. Creating a Table:
o Within your dataset, click on the three dots, select "Create table,"
choose to create an empty table, and name it (e.g., "employees").
Cloud Composer acts like a conductor for your data workflows. It helps automate and
manage tasks such as data pipelines, ETL processes, and data transformations across
different services in Google Cloud. Imagine it as a tool that ensures your data
operations run smoothly and on time, much like how a conductor manages various
instruments in a musical performance.
Key Features:
Example Workflow:
Queries customer data from Google BigQuery using SQL to fetch details like
customer IDs and emails of recent purchasers.
Task 2 Generate Email Content:
Sends out personalized emails using SMTP server settings, ensuring each
email reaches the correct customer.
Creating a DAG:
Google Cloud Data Fusion is a managed service that helps organizations integrate
data without needing extensive coding. It provides a graphical interface to design,
deploy, and manage ETL (Extract, Transform, Load) pipelines. These pipelines move
and transform data across Google Cloud services efficiently.
What is Pub/Sub?
Google Cloud Pub/Sub is a messaging service designed for asynchronous
communication between different parts of an application or between services. It
enables scalable and decoupled communication through a publish-subscribe model.
Imagine it like a bulletin board where friends share updates: anyone interested can
read messages without waiting for direct communication.
Subscribers: Receive and process messages from topics they subscribe to.
Topics: Named resources where messages are sent by publishers and received by
subscribers.
Subscriptions: Linked to topics, subscriptions receive messages from topics and have
unique identifiers.
Event-Driven Systems: Like microservices, where services publish events and others
react.
Real-Time Data Processing: Ingesting high volumes of data from diverse sources for
analysis or storage.
Messaging Applications: Where users or systems publish to their own topics and
others subscribe to receive messages.
2. Configuring Subscriptions:
Develop scripts for each subscription (e.g., sports, tech, general) to process
and respond to news updates.
4. Running the System:
Execute scripts to see how messages flow from publishers (sending news
updates to the "news updates" topic) to subscribers (receiving and processing
updates based on their interests).