Pythons Basics
Pythons Basics
Course Introduction
OBJECTIVES
● Data-maturity model
● dbt and data architectures
● Data warehouses, data lakes, and lakehouses
● ETL and ELT procedures
● dbt fundamentals
● Analytics Engineering
TOP-DOWN
1 Theory
Practice 2
3 Project
BOTTOM-UP
3 Theory
Project 2
1 Practice
#
Data Maturity Model
Maslow's Hierarchy of Needs
5 Self-actualization
Esteem Needs 4
Safety Needs 2
1 Physiological Needs
Data-Maturity Model
5 Artificial Intelligence
BI and Analytics 4
3 Data Integration
Data Wrangling 2
1 Data Collection
Typical Data Architecture
4 ML/AI/BI
3 Data Integration
Data Wrangling 2
1 Data Collection
Data Collection
1 Data Collection
Data Wrangling
Data Wrangling 2
1 Data Collection
Data Integration
3 Data Integration
Data Wrangling 2
1 Data Collection
#
ETL - ELT
ETL
ELT
#
Normalization
1NF
#
1NF
1
1NF
1
2
1NF
1
3
1NF
3
4
2NF
#
2NF
1
2NF
1
2
2NF
3
3NF
#
3NF
1
3NF
1
2
#
Data Warehouse
Data Warehouse
On-Premise Cloud
External Tables
Data Lake
#
Data Lake
Data Lake
Source DWH
SCD Type 1
Updating the DWH table when a Dimension changes, overwriting the original data
Source DWH
No Air-conditioning
Installed Air-conditioning
DWH updated
SCD Type 2
Keeping full history - Adding additional (historic data) rows for each dimension change
Source DWH
DWH updated
SCD Type 2
Keeping full history - Adding additional (historic data) rows for each dimension change
Source DWH
DWH updated
SCD Type 3
Keeping limited data history - adding separate columns for original and current value
Source DWH
Listed as Private
Syntax
CTE
Example
PROJECT OVERVIEW
Analytics Engineering with Airbnb
ANALYTICS ENGINEERING WITH AIRBNB
● Simulating the life of an Analytics Engineer in Airbnb
That’s it :)
SNOWFLAKE
Registering a Trial account
DATA FLOW
Overview
INPUT DATA MODEL
DATA FLOW OVERVIEW
DATA FLOW OVERVIEW
Raw layer Staging Layer Core Layer Presentation
Layer
src_listings Hosts:
raw_listings
basic checks cleansed Listings mart
table
src_hosts Listings:
raw_hosts
basic checks cleansed
Reviews mart
table
src_reviews: Reviews:
raw_reviews
basic checks cleansed
DBT SETUP
Mac
VIRTUALENV SETUP
● Install Python 3.11 and the Python virtualenv package
● Create a virtualenv
● Activate virtualenv
DBT SETUP
Windows
INSTALLING DBT
● Install Python3
● Create a virtualenv
● Activate virtualenv
src_listings
raw_listings
basic checks
src_hosts
raw_hosts
basic checks
src_reviews
raw_reviews
basic checks
GUIDED EXERCISE
src_reviews.sql
Create a new model in the `models/src/` folder called
`src_reviews.sql`.
● Use a CTE to reference the AIRBNB.RAW.RAW_REVIEWS table
● SELECT every column and every record, and rename the following columns:
○ date to review_date
○ comments to review_text
○ sentiment to review_sentiment
● Execute `dbt run` and verify that your model has been created
(You can find the solution among the resources)
MATERIALIZATIONS
LEARNING OBJECTIVES
● Understand how models can be connected
project level
Use it Use it
- You want a - You read from this Use it Use it
lightweight model repeatedly - Fact tables - You merely want an
representation - Appends to tables alias to your date
- You don’t reuse data Don’t use it
too often - Building single-use Don’t use it Don’t use it
models - You want to update - You read from the
Don’t use it - Your model is historical records same model several
- You read from the populated times
same model several incrementally
times
DATA FLOW PROGRESS
Raw layer Staging Layer Core Layer
src_listings dim_listings_cleansed
raw_listings
basic checks cleansing
dim_listings_with_hosts
final dimension table
src_hosts dim_hosts_cleansed
raw_hosts
basic checks cleansing
src_reviews fct_reviews
raw_reviews
basic checks incremental
GUIDED EXERCISE
dim_hosts_cleansed.sql
Create a new model in the `models/dim/` folder called
`dim_hosts_cleansed.sql`.
● Use a CTE to reference the `src_hosts` model
● SELECT every column and every record, and add a cleansing step to
host_name:
○ If host_name is not null, keep the original value
○ If host_name is null, replace it with the value ‘Anonymous’
○ Use the NVL(column_name, default_null_value) function
● Execute `dbt run` and verify that your model has been created
(You can find the solution among the resources)
SOURCES & SEEDS
LEARNING OBJECTIVES
● Understand the difference between seeds and sources
● Understand source-freshness
● Learn how to create snapshots on top of our listings and hosts models
SNAPSHOTS
Overview
TYPE-2 SLOWLY CHANGING DIMENSIONS
1 Alice [email protected]
2 Bob [email protected]
TYPE-2 SLOWLY CHANGING DIMENSIONS
1 Alice [email protected]
2 Bob [email protected]
TYPE-2 SLOWLY CHANGING DIMENSIONS
host_id host_name email dbt_valid_from dbt_valid_to
● Strategies:
○ Check: Any change in a set of columns (or all columns) will be picked
up as an update.
GUIDED EXERCISE
scd_raw_hosts.sql
Create a new snapshot in the `snapshots/` folder called
`scd_raw_hosts.sql`.
● Singular tests are SQL queries stored in tests which are expected to return an empty resultset
○ unique
○ not_null
○ accepted_values
○ Relationships
● You can define your own custom generic tests or import tests from dbt packages (will discuss later)
GUIDED EXERCISE
TEST dim_hosts_cleansed
Create a generic tests for the `dim_hosts_cleansed` model.
● A special macro, called test, can be used for implementing your own generic tests
● dbt packages can be installed easily to get access to a plethora of macros and tests
DOCUMENTATION
LEARNING OBJECTIVES
● Understand how to document models
● You can add your own assets (like images) to a special project folder
ANALYSES, HOOKS
AND EXPOSURES
LEARNING OBJECTIVES
● Understand how to store ad-hoc analytical queries in dbt
● Hook types:
Cloud
DAGSTER - SIMILAR DATA CONCEPTS