0% found this document useful (0 votes)
7 views18 pages

19.1 - Data Pipelines

Uploaded by

pistatechco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

19.1 - Data Pipelines

Uploaded by

pistatechco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Hello!

Data Pipelines
Agenda:
● Motivation
● Data Systems for Storage: DB, Data Lake, Data
Warehouse
● ELT Pipeline
● ETL Pipeline

anyoneai.co
m
Motivation

So far we've have been using local CSV files as data sources but in reality, our
data can come from many disparate sources.

Most the companies out there likely deal with massive amounts of data. To
analyze all of that data, you need a single view of the entire data set. When that
data resides in multiple systems and services, it needs to be combined in ways
that make sense for in-depth analysis.

It is also common that new data is generated at every day, hour, minute, second
and for a lot of use cases we want to quickly store, process and feed other
applications or Machine Learning models from that.

anyoneai.com
Motivation
For example, you work as a Machine Learning Engineer for a Food Delivery App. They ask you
for help with the release of a new feature consisting in showing the most popular products.

To implement this, you will need to:


● Access to sales data and product information maybe stored in different databases.
● Write a script that aggregates the thousands or millions of sales by product and gives
you the best-selling products.
● Store this information in a new database to be consumed by the Back-end engineering
team and show results in the App.
● Run this job again every day, week, or month because is known that users’ interests and
interactions change over time.

The request seems to be simple but if we look at the steps needed for implementing and
taking into account the massive volume of data we must handle, it would be tough to
implement and maintain without the proper data engineering tools and processes.

During the rest of the lessons, we're going to learn about the fundamentals of data
engineering and modern data stacks.

anyoneai.com
Data Systems for Storage

Before we start working with our data, it's important to understand the different
types of systems that our data can live in. So far we've worked with files, APIs, etc.
but there are several types of data systems that are widely adopted in industry for
different purposes.

We will briefly introduce the concepts of:


● Database
● Data lake
● Data Warehouse

anyoneai.com
Data Systems for Storage
A popular storage option is a database, which is an organized
collection of structured data that adheres to either:

● Relational schema (tables with rows and columns) often


referred to as SQL database.
● Non-relational (key/value, graph, etc.), often referred to as
a non-relational database or NoSQL database.

A database is an online transaction processing (OLTP) system


because it's typically used for day-to-day CRUD (create, read,
update, delete) operations where typically information is
accessed by rows. However, they're generally used to store data
from one application and is not designed to hold data from
across many sources for the purpose of analytics.

Popular database options include PostgreSQL, MySQL,


MongoDB, Cassandra, etc.

anyoneai.com
Data Systems for Storage
A data lake is a flat data management system that stores raw
objects. It's a great option for inexpensive storage and has the
capability to hold all types of data (unstructured, semi-structured
and structured).

Object stores are becoming the standard for data lakes with
default options across the popular cloud providers.
Unfortunately, because data is stored as objects in a data lake,
it's not designed for operating on structured data.

Popular data lake options include Amazon S3, Azure Blob


Storage, Google Cloud Storage, etc.

anyoneai.com
Data Systems for Storage
A data warehouse (DWH) is a type of database that's designed
for storing structured data from many different sources for
downstream analytics and data science. It's an online analytical
processing (OLAP) system that's optimized for performing
operations across aggregating column values rather than
accessing specific rows.

Popular data warehouse options include SnowFlake, Google


BigQuery, Amazon RedShift, Hive, etc.

anyoneai.com
Data Lake vs Data
Warehouse
Characteristics Data lake Data warehouse

Structured, semi-structured,
Structured.
unstructured.
Type
Relational.
Relational, non-relational.

Written at the time of analysis Designed prior to the DW implementation


Schema
(Schema on read) (Schema on write)

Format Raw, unfiltered Processed, vetted

Big data, IoT, social media, Application, business, transactional data,


Sources
streaming data batch reporting

Scalability Easy to scale at a low cost Difficult and expensive to scale

anyoneai.com
ELT/ETL Pipelines

A Data Pipeline is a system for transporting data from one location (the source)
to another (the destination) (such as a data warehouse). Data is transformed
and optimized along the way, and it eventually reaches a state that can be
analyzed and used to develop business insights.

The two most popular data pipelines today are ELT and ETL.

anyoneai.com
ELT Pipeline

ELT Pipeline refers to the process of extracting data from source systems,
loading it into a storage environment, and then transforming it using in-
database operations like SQL.

It requires the ability to store large amounts of raw data in the beginning.

anyoneai.com
Extract and Load

The first step in our data pipeline is to


extract data from a source and load it into
the appropriate destination.

While we could construct custom scripts


to do this manually or on a schedule, an
ecosystem of data ingestion tools have
already standardized the entire process.

They all come equipped with connectors


that allow for extraction, normalization,
cleaning and loading between sources and
destinations. And these pipelines can be
scaled, monitored, etc. all with very little
to no code. Popular data ingestion
tools include Fivetran, Airbyte, Stitch, etc.

anyoneai.com
Extract and Load
Data Sources

Our data sources we want to extract from


can be from anywhere.

Regardless of the source of our data, they


type of data should fit into one of these
categories:
● Structured: organized data stored in
an explicit structure (ex. tables)
● Semi-structured: data with some
structure but no formal schema or
data types (web pages, CSV, JSON,
etc.)
● Unstructured: qualitative data with
no formal structure (text, images,
audio, etc.)

anyoneai.com
Transform
Once extracted and loaded our data, we need to
transform the data so that it's ready for downstream
applications.

Common transformations include defining schemas,


filtering, cleaning and joining data across tables, etc.

While we could do all of these things with SQL in our


data warehouse (save queries as tables or views),
there’s a tool that is gaining a lot of popularity these
days called dbt (Data build tool).

dbt delivers production functionality around version


control, testing, documentation, packaging, etc. out of
the box. This becomes crucial for maintaining
observability and high-quality data workflows.

anyoneai.com
Transform
Learning about dbt is out of the scope of the career at the moment but, if you’re
interested, you can create an account and take for free the official dbt fundamentals
course here: https://courses.getdbt.com/courses/fundamentals

anyoneai.com
ETL Pipeline
ETL was the traditional processing framework, involving extraction of data from
source systems, doing some transformations (cleaning) on the data and finally
loading the data into a data warehouse.

Analysts must predict the data models and insights they will use ahead of time
when transforming before loading. This means that data engineers and analysts
must frequently design and build complex processes and workflows ahead of
time to use data, then redesign and rebuild them as reporting requirements
change.

Given these know issues you may ask why people do the data pipeline in that
way? The reason was data storage was expensive and mostly at on-premise
servers. Introducing the need of cleaning our data as much as possible before
storing it.

With the advent of powerful data warehouses like Snowflake, Google BigQuery
and, AWS Redshift, it has become economical to store data in the data
warehouse and then transform them as required.
anyoneai.com
Always keep in
mind…
It can be enticing to set up a modern data stack in your organization, especially
with all the hype. But it's very important to motivate utility and add additional
complexity:

● Start with a use case that we already have data sources for and that has a
direct impact on the business's bottom line (ex. user churn).

● Start with the simplest infrastructure (source → database → report) and add
complexity (in infrastructure, performance, and team) as needed.

anyoneai.com
Key takeaways for the Sprint
project
● Data sources will come from CSVs files and APIs.
● The Database for storage will be just a SQLite database for simplicity
reasons.
● Extract and Load process will be written in custom Python scripts.
● Transform will be held with custom SQL queries.
● A Jupyter notebook will consume the final results to produce the Analytics
reports.

anyoneai.com

You might also like