0% found this document useful (0 votes)
85 views

Pythons Basics

Pythons Basics

Uploaded by

snow.proj11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Pythons Basics

Pythons Basics

Uploaded by

snow.proj11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

#

Course Introduction
OBJECTIVES
● Data-maturity model
● dbt and data architectures
● Data warehouses, data lakes, and lakehouses
● ETL and ELT procedures
● dbt fundamentals
● Analytics Engineering
TOP-DOWN

1 Theory

Practice 2

3 Project
BOTTOM-UP

3 Theory

Project 2

1 Practice
#
Data Maturity Model
Maslow's Hierarchy of Needs

5 Self-actualization

Esteem Needs 4

3 Belongingness & Love Needs

Safety Needs 2

1 Physiological Needs
Data-Maturity Model

5 Artificial Intelligence

BI and Analytics 4

3 Data Integration

Data Wrangling 2

1 Data Collection
Typical Data Architecture

4 ML/AI/BI

3 Data Integration

Data Wrangling 2

1 Data Collection
Data Collection

1 Data Collection
Data Wrangling

Data Wrangling 2

1 Data Collection
Data Integration

3 Data Integration

Data Wrangling 2

1 Data Collection
#
ETL - ELT
ETL
ELT
#
Normalization
1NF
#
1NF
1
1NF
1

2
1NF
1

3
1NF
3

4
2NF
#
2NF
1
2NF
1

2
2NF

3
3NF
#
3NF
1
3NF
1

2
#
Data Warehouse
Data Warehouse

On-Premise Cloud

PNG Credits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png


#
External Tables
External Tables

External Tables

Data Lake
#
Data Lake
Data Lake

Unstructured / Structured / Semi-Structured Data

PNG Credits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png


#
Data Lakehouse
Data Lakehouse

PNG Credits: https://databricks.com/wp-content/uploads/2020/01/data-lakehouse-new-1024x538.png


#
Slowly Changing Dimensions
SCD Type 0
Not updating the DWH table when a Dimension changes

Source DWH
SCD Type 1
Updating the DWH table when a Dimension changes, overwriting the original data

Source DWH

No Air-conditioning

Installed Air-conditioning

DWH updated
SCD Type 2
Keeping full history - Adding additional (historic data) rows for each dimension change

Source DWH

Current rental price ($300)

Change in the rental price ($450)

DWH updated
SCD Type 2
Keeping full history - Adding additional (historic data) rows for each dimension change

Source DWH

Current rental price ($300)

Change in the rental price ($450)

DWH updated
SCD Type 3
Keeping limited data history - adding separate columns for original and current value

Source DWH

Listed as Private

Host changed Private to Entire

Host changed Entire to Shared


#
dbt Overview
dbt Overview

PNG Credits: https://data.solita.fi/wp-content/uploads/2021/06/dbt-transform.png


#
Analytics Engineering
#
Common Table Expression (CTE)
CTE

Syntax
CTE
Example
PROJECT OVERVIEW
Analytics Engineering with Airbnb
ANALYTICS ENGINEERING WITH AIRBNB
● Simulating the life of an Analytics Engineer in Airbnb

● Loading, Cleansing, Exposing data

● Writing test, automations and documentation

● Data source: Inside Airbnb: Berlin


TECH STACK
REQUIREMENTS
● Modeling changes are easy to follow and revert
● Explicit dependencies between models
● Explore dependencies between models
● Data quality tests
● Error reporting
● Incremental load of fact tables
● Track history of dimension tables
● Easy-to-access documentation
NEXT STEPS - SETUP
● Snowflake registration
● Dataset import
● dbt installation
● dbt setup, snowflake connection

That’s it :)
SNOWFLAKE
Registering a Trial account
DATA FLOW
Overview
INPUT DATA MODEL
DATA FLOW OVERVIEW
DATA FLOW OVERVIEW
Raw layer Staging Layer Core Layer Presentation
Layer
src_listings Hosts:
raw_listings
basic checks cleansed Listings mart
table
src_hosts Listings:
raw_hosts
basic checks cleansed
Reviews mart
table
src_reviews: Reviews:
raw_reviews
basic checks cleansed
DBT SETUP
Mac
VIRTUALENV SETUP
● Install Python 3.11 and the Python virtualenv package

● Create a virtualenv

● Activate virtualenv
DBT SETUP
Windows
INSTALLING DBT
● Install Python3

● Create a virtualenv

● Activate virtualenv

● Install dbt and the dbt-snowflake connector


VSCODE SETUP
Installing the dbt power user extension
DBT SETUP
dbt init and connecting to Snowflake
MODELS
LEARNING OBJECTIVES
● Understand the data flow of our project
● Understand the concept of Models in dbt
● Create three basic models:
○ src_listings
○ src_reviews: guided exercises
○ src_hosts: individual lab
MODELS OVERVIEW
● Models are the basic building block of your business logic
● Materialized as tables, views, etc…
● They live in SQL files in the `models` folder
● Models can reference each other and use templates and macros
DATA FLOW PROGRESS
Raw layer Staging Layer

src_listings
raw_listings
basic checks

src_hosts
raw_hosts
basic checks

src_reviews
raw_reviews
basic checks
GUIDED EXERCISE

src_reviews.sql
Create a new model in the `models/src/` folder called
`src_reviews.sql`.
● Use a CTE to reference the AIRBNB.RAW.RAW_REVIEWS table
● SELECT every column and every record, and rename the following columns:
○ date to review_date
○ comments to review_text
○ sentiment to review_sentiment
● Execute `dbt run` and verify that your model has been created
(You can find the solution among the resources)
MATERIALIZATIONS
LEARNING OBJECTIVES
● Understand how models can be connected

● Understand the four built-in materializations

● Understand how materializations can be configured on the file and

project level

● Use dbt run with extra parameters


MATERIALIZATIONS
MATERIALISATIONS OVERVIEW

View Table Incremental Ephemeral


(table appends) (CTEs)

Use it Use it
- You want a - You read from this Use it Use it
lightweight model repeatedly - Fact tables - You merely want an
representation - Appends to tables alias to your date
- You don’t reuse data Don’t use it
too often - Building single-use Don’t use it Don’t use it
models - You want to update - You read from the
Don’t use it - Your model is historical records same model several
- You read from the populated times
same model several incrementally
times
DATA FLOW PROGRESS
Raw layer Staging Layer Core Layer

src_listings dim_listings_cleansed
raw_listings
basic checks cleansing

dim_listings_with_hosts
final dimension table
src_hosts dim_hosts_cleansed
raw_hosts
basic checks cleansing

src_reviews fct_reviews
raw_reviews
basic checks incremental
GUIDED EXERCISE

dim_hosts_cleansed.sql
Create a new model in the `models/dim/` folder called
`dim_hosts_cleansed.sql`.
● Use a CTE to reference the `src_hosts` model
● SELECT every column and every record, and add a cleansing step to
host_name:
○ If host_name is not null, keep the original value
○ If host_name is null, replace it with the value ‘Anonymous’
○ Use the NVL(column_name, default_null_value) function
● Execute `dbt run` and verify that your model has been created
(You can find the solution among the resources)
SOURCES & SEEDS
LEARNING OBJECTIVES
● Understand the difference between seeds and sources

● Understand source-freshness

● Integrate sources into our project


SOURCES AND SEEDS OVERVIEW
● Seeds are local files that you upload to the data warehouse from dbt

● Sources is an abstraction layer on the top of your input tables

● Source freshness can be checked automatically


SNAPSHOTS
LEARNING OBJECTIVES
● Understand how dbt handles type-2 slowly changing dimensions

● Understand snapshot strategies

● Learn how to create snapshots on top of our listings and hosts models
SNAPSHOTS
Overview
TYPE-2 SLOWLY CHANGING DIMENSIONS

host_id host_name email

1 Alice [email protected]

2 Bob [email protected]
TYPE-2 SLOWLY CHANGING DIMENSIONS

host_id host_name email

1 Alice [email protected]

2 Bob [email protected]
TYPE-2 SLOWLY CHANGING DIMENSIONS
host_id host_name email dbt_valid_from dbt_valid_to

1 Alice [email protected] 2022-01-01 00:00:00 null

2 Bob [email protected] 2022-01-01 00:00:00 2022-03-01 12:53:20

3 Bob [email protected] 2022-03-01 12:53:20 null


CONFIGURATION AND STRATEGIES
● Snapshots live in the snapshots folder

● Strategies:

○ Timestamp: A unique key and an updated_at field is defined on the

source model. These columns are used for determining changes.

○ Check: Any change in a set of columns (or all columns) will be picked

up as an update.
GUIDED EXERCISE

scd_raw_hosts.sql
Create a new snapshot in the `snapshots/` folder called
`scd_raw_hosts.sql`.

● Set the target table name to scd_raw_hosts


● Set the output schema to dev
● Use the timestamp strategy, figure out the unique key and updated_at
column to use
● Execute `dbt snapshot` and verify that your snapshot has been created

(You can find the solution among the resources)


TESTS
LEARNING OBJECTIVES
● Understand how tests can be defined

● Configure built-in generic tests

● Create your own singular tests


TESTS OVERVIEW
● There are two types of tests: singular and generic

● Singular tests are SQL queries stored in tests which are expected to return an empty resultset

● There are four built-in generic tests:

○ unique

○ not_null

○ accepted_values

○ Relationships

● You can define your own custom generic tests or import tests from dbt packages (will discuss later)
GUIDED EXERCISE

TEST dim_hosts_cleansed
Create a generic tests for the `dim_hosts_cleansed` model.

● host_id: Unique values, no nulls


● host_name shouldn’t contain any null values
● Is_superhost should only contain the values t and f.
● Execute `dbt test` to verify that your tests are passing
● Bonus: Figure out which tests to write for `fct_reviews` and implement
them

(You can find the solution among the resources)


MACROS, CUSTOM
TESTS AND
PACKAGES
LEARNING OBJECTIVES
● Understand how macros are created

● Use macros to implement your own generic tests

● Find and install third-party dbt packages


MACROS, CUSTOM TESTS AND PACKAGES
● Macros are jinja templates created in the macros folder

● There are many built-in macros in DBT

● You can use macros in model definitions and tests

● A special macro, called test, can be used for implementing your own generic tests

● dbt packages can be installed easily to get access to a plethora of macros and tests
DOCUMENTATION
LEARNING OBJECTIVES
● Understand how to document models

● Use the documentation generator and server

● Add assets and markdown to the documentation

● Discuss dev vs. production documentation serving


DOCUMENTATION OVERVIEW
● Documentations can be defined two ways:

○ In yaml files (like schema.yml)

○ In standalone markdown files

● Dbt ships with a lightweight documentation web server

● For customizing the landing page, a special file, overview.md is used

● You can add your own assets (like images) to a special project folder
ANALYSES, HOOKS
AND EXPOSURES
LEARNING OBJECTIVES
● Understand how to store ad-hoc analytical queries in dbt

● Work with dbt hooks to manage table permissions

● Build a dashboard in Preset

● Create a dbt exposure to document the dashboard


HOOKS
● Hooks are SQLs that are executed at predefined times

● Hooks can be configured on the project, subfolder, or model level

● Hook types:

○ on_run_start: executed at the start of dbt {run, seed, snapshot}

○ on_run_end: executed at the end of dbt {run, seed, snapshot}

○ pre-hook: executed before a model/seed/snapshot is built

○ post-hook: executed after a model/seed/snapshot is built


HERO
ORCHESTRATION
THE ORCHESTRATION LANDSCAPE

Cloud
DAGSTER - SIMILAR DATA CONCEPTS

Image source: https://docs.dagster.io/integrations/dbt


GREAT UI

Image source: https://docs.dagster.io/integrations/dbt


EASY TO DEBUG

Image source: https://docs.dagster.io/integrations/dbt

You might also like