19.1 - Data Pipelines
19.1 - Data Pipelines
Data Pipelines
Agenda:
● Motivation
● Data Systems for Storage: DB, Data Lake, Data
Warehouse
● ELT Pipeline
● ETL Pipeline
anyoneai.co
m
Motivation
So far we've have been using local CSV files as data sources but in reality, our
data can come from many disparate sources.
Most the companies out there likely deal with massive amounts of data. To
analyze all of that data, you need a single view of the entire data set. When that
data resides in multiple systems and services, it needs to be combined in ways
that make sense for in-depth analysis.
It is also common that new data is generated at every day, hour, minute, second
and for a lot of use cases we want to quickly store, process and feed other
applications or Machine Learning models from that.
anyoneai.com
Motivation
For example, you work as a Machine Learning Engineer for a Food Delivery App. They ask you
for help with the release of a new feature consisting in showing the most popular products.
The request seems to be simple but if we look at the steps needed for implementing and
taking into account the massive volume of data we must handle, it would be tough to
implement and maintain without the proper data engineering tools and processes.
During the rest of the lessons, we're going to learn about the fundamentals of data
engineering and modern data stacks.
anyoneai.com
Data Systems for Storage
Before we start working with our data, it's important to understand the different
types of systems that our data can live in. So far we've worked with files, APIs, etc.
but there are several types of data systems that are widely adopted in industry for
different purposes.
anyoneai.com
Data Systems for Storage
A popular storage option is a database, which is an organized
collection of structured data that adheres to either:
anyoneai.com
Data Systems for Storage
A data lake is a flat data management system that stores raw
objects. It's a great option for inexpensive storage and has the
capability to hold all types of data (unstructured, semi-structured
and structured).
Object stores are becoming the standard for data lakes with
default options across the popular cloud providers.
Unfortunately, because data is stored as objects in a data lake,
it's not designed for operating on structured data.
anyoneai.com
Data Systems for Storage
A data warehouse (DWH) is a type of database that's designed
for storing structured data from many different sources for
downstream analytics and data science. It's an online analytical
processing (OLAP) system that's optimized for performing
operations across aggregating column values rather than
accessing specific rows.
anyoneai.com
Data Lake vs Data
Warehouse
Characteristics Data lake Data warehouse
Structured, semi-structured,
Structured.
unstructured.
Type
Relational.
Relational, non-relational.
anyoneai.com
ELT/ETL Pipelines
A Data Pipeline is a system for transporting data from one location (the source)
to another (the destination) (such as a data warehouse). Data is transformed
and optimized along the way, and it eventually reaches a state that can be
analyzed and used to develop business insights.
The two most popular data pipelines today are ELT and ETL.
anyoneai.com
ELT Pipeline
ELT Pipeline refers to the process of extracting data from source systems,
loading it into a storage environment, and then transforming it using in-
database operations like SQL.
It requires the ability to store large amounts of raw data in the beginning.
anyoneai.com
Extract and Load
anyoneai.com
Extract and Load
Data Sources
anyoneai.com
Transform
Once extracted and loaded our data, we need to
transform the data so that it's ready for downstream
applications.
anyoneai.com
Transform
Learning about dbt is out of the scope of the career at the moment but, if you’re
interested, you can create an account and take for free the official dbt fundamentals
course here: https://courses.getdbt.com/courses/fundamentals
anyoneai.com
ETL Pipeline
ETL was the traditional processing framework, involving extraction of data from
source systems, doing some transformations (cleaning) on the data and finally
loading the data into a data warehouse.
Analysts must predict the data models and insights they will use ahead of time
when transforming before loading. This means that data engineers and analysts
must frequently design and build complex processes and workflows ahead of
time to use data, then redesign and rebuild them as reporting requirements
change.
Given these know issues you may ask why people do the data pipeline in that
way? The reason was data storage was expensive and mostly at on-premise
servers. Introducing the need of cleaning our data as much as possible before
storing it.
With the advent of powerful data warehouses like Snowflake, Google BigQuery
and, AWS Redshift, it has become economical to store data in the data
warehouse and then transform them as required.
anyoneai.com
Always keep in
mind…
It can be enticing to set up a modern data stack in your organization, especially
with all the hype. But it's very important to motivate utility and add additional
complexity:
● Start with a use case that we already have data sources for and that has a
direct impact on the business's bottom line (ex. user churn).
● Start with the simplest infrastructure (source → database → report) and add
complexity (in infrastructure, performance, and team) as needed.
anyoneai.com
Key takeaways for the Sprint
project
● Data sources will come from CSVs files and APIs.
● The Database for storage will be just a SQLite database for simplicity
reasons.
● Extract and Load process will be written in custom Python scripts.
● Transform will be held with custom SQL queries.
● A Jupyter notebook will consume the final results to produce the Analytics
reports.
anyoneai.com