0% found this document useful (0 votes)
227 views27 pages

What Is ETL, ETL Vs ELT

- ETL stands for Extract, Transform, Load and is a process that extracts data from different sources, transforms the data, and loads it into a data warehouse. ELT loads and transforms data directly in the data warehouse. - In ETL, data is extracted from sources into a staging area, transformed, and then loaded into the target. In ELT, data is extracted and loaded into the target where transformations are performed. - ETL is generally used for smaller amounts of data and on-premises systems, while ELT is better suited for large datasets and cloud-based data warehouses as it provides data lake support.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
227 views27 pages

What Is ETL, ETL Vs ELT

- ETL stands for Extract, Transform, Load and is a process that extracts data from different sources, transforms the data, and loads it into a data warehouse. ELT loads and transforms data directly in the data warehouse. - In ETL, data is extracted from sources into a staging area, transformed, and then loaded into the target. In ELT, data is extracted and loaded into the target where transformations are performed. - ETL is generally used for smaller amounts of data and on-premises systems, while ELT is better suited for large datasets and cloud-based data warehouses as it provides data lake support.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Transformation Process

Functions
Lesson 7:
>>>What is ETL?
>>>ETL vs ELT
What is ETL?
• is a process that extracts the data from different source systems,
then transforms the data (like applying calculations,
concatenations, etc.) and finally loads the data into the Data
Warehouse system.
• Extract, Transform and Load.
• is simply extracting data from multiple sources and loading into
database of a Data warehouse.
• process requires active inputs from various stakeholders
• is a recurring activity (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.

Source:https://www.guru99.com
Why do you need ETL?
• to analyze their business data for taking critical business decisions.
• cannot answer complex business questions that can be answered by
ETL.
• a Data Warehouse provides a common data repository
• method of moving the data from various sources into a data
warehouse.
• as data sources change, the Data Warehouse will automatically
update.
• is almost essential to the success of a Data Warehouse project.
• verification of data transformation, aggregation and calculations
rules.
Source:https://www.guru99.com
Why do you need ETL?
• allows sample data comparison between the source and the target
system.
• perform complex transformations and requires the extra area to store
the data.
• helps to Migrate data into a Data Warehouse. Convert to the various
formats and types
• for accessing and manipulating source data into the target database.
• offers deep historical context for the business.
• improve productivity because it codifies and reuses without a need
for technical skills.

Source:https://www.guru99.com
ETL Process in Data Warehouses
Step 1: Extraction
• data is extracted from
the source system into
the staging area
Step 2: Transformation
• cleansed, mapped
and transformed
Step 3: Loading
• Loading data into the
target data
warehouse database
is the last step
Source:https://www.guru99.com
Step 1: Extraction
• data is extracted from the source system into the staging
area
• transformations if any are done in staging area
• rollback will be a challenge
• Staging area gives an opportunity to validate extracted
• Data warehouse needs to integrate systems
• Hence one needs a logical data map before data is extracted
and loaded physically.
Source:https://www.guru99.com
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification

Source:https://www.guru99.com
Some validations are done during Extraction:
• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not

Source:https://www.guru99.com
Step 2: Transformation
• it needs to be cleansed, mapped and transformed
• apply a set of functions on extracted data
• direct move or pass through data
• you can perform customized operations on data
• if the user wants sum-of-sales revenue which is not in the
database
• if the first name and the last name in a table is in different
columns
• It is possible to concatenate them before loading.
Source:https://www.guru99.com
• Data
Integration
Issues

Source:https://www.guru99.com
Following are Data Integrity Problems:
1. Different spelling
2. There are multiple ways to denote company name
3. Use of different names
4. different account numbers
5. files remains blank
6. Invalid product collected

Source:https://www.guru99.com
Validations are done during this stage
• Filtering • Cleaning
• Using rules and lookup tables • Split a column into multiples and
• Conversion and encoding merging multiple columns into a
handling single column.
• Conversion of Units of • Transposing rows and columns,
Measurements • Use lookups to merge data
• Data threshold validation check • Using any complex data
• Data flow validation from validation
• Required fields should not be
left blank.
Source:https://www.guru99.com
Step 3: Loading
• is the last step of the ETL process
• huge volume of data needs to be loaded in a relatively short
period
• load process should be optimized for performance.
• In case of load failure, recover mechanisms should be
configured
• monitor, resume, cancel loads as per prevailing server
performance.
Source:https://www.guru99.com
Types of Loading:
1. Initial Load — populating all the Data Warehouse
tables
2. Incremental Load — applying ongoing changes as
when needed periodically.
3. Full Refresh —erasing the contents of one or more
tables and reloading with fresh data.

Source:https://www.guru99.com
Load verification
• Ensure that the key field data is neither missing nor
null.
• Test modeling views based on the target tables.
• Check that combined values and calculated measures.
• Data checks in dimension table as well as history
table.
• Check the BI reports on the loaded fact and dimension
table.
Source:https://www.guru99.com
ETL tools

Source:https://www.guru99.com
Best practices ETL process
• Never try to cleanse all the data
• Never cleanse Anything
• Determine the cost of cleansing the data
• Determine the cleansing cost for every dirty data
element
• Have auxiliary views and indexes:

Source:https://www.guru99.com
Summary:
• ETL stands for Extract, Transform and Load.
• ETL provides a method of moving the data from various
sources into a data warehouse.
• In the first step extraction, data is extracted from the source
system into the staging area.
• In the transformation step, the data extracted from source is
cleansed and transformed.
• Loading data into the target data warehouse is the last step
of the ETL process.
Source:https://www.guru99.com
ETL vs ELT
• What is ETL?
• ETL is an abbreviation of Extract, Transform and Load
• Extracts the data from different RDBMS source systems
then transforms the data like applying calculations,
concatenations, etc. and then load the data into the Data
Warehouse system.
• Data is flows from the source to the target
• Process transformation engine takes care of any data
changes.
Source:https://www.guru99.com
• transformation
engine takes
care of any
data changes.

Source:https://www.guru99.com
•What is ELT?
• is a different method of looking at the tool
approach to data movement.
• Instead of transforming the data before it's written,
ELT lets the target system to do the transformation.
• The data first copied to the target and then
transformed in place.

Source:https://www.guru99.com
• Lets the target
system to do the
transformation.
• The data first
copied to the
target and then
transformed in
place.
Source:https://www.guru99.com
KEY DIFFERENCE
ETL ELT
Extract, Transform and Load Extract, Load, Transform
loads data first into the staging server loads data directly into the target
and then into the target system system
model is used for on-premises, is used for scalable cloud structured
relational and structured data and unstructured data sources
is mainly used for a small amount of is used for large amounts of data
data
doesn’t provide data lake supports provides data lake support.
is easy to implement requires niche skills to implement
and maintain
Source:https://www.guru99.com
Difference between ETL vs. ELT (Parameters)
Parameters ETL ELT
Process Data is transformed at staging server Data remains in the DB of the
and then transferred to Datawarehouse.
Datawarehouse DB.
Code Usage Used for Used for High amounts of data
 Compute-intensive
Transformations
 Small amount of data
Transformation Transformations are done in ETL Transformations are performed in the
server/staging area. target system
Time-Load Data first loaded into staging and Data loaded into target system only
later loaded into target system. Time once. Faster.
intensive.
Time- ETL process needs to wait for In ELT process, speed is never dependant
Transformation transformation to complete. As data on the size of the data.
size grows, transformation time
increases.
Time- It needs highs maintenance as you Low maintenance as data is always
Maintenance need to select data to load and available.
transform.
Source:https://www.guru99.com
Implementation Complexity At an early stage, easier to implement. To implement ELT process
organization should have deep
knowledge of tools and expert
skills.

Support for Data warehouse ETL model used for on-premises, relational Used in scalable cloud
and structured data. infrastructure which supports
structured, unstructured data
sources.

Data Lake Support Does not support. Allows use of Data lake with
unstructured data.
Complexity The ETL process loads only the important This process involves development
data, as identified at design time. from the output-backward and
loading only relevant data.

Cost High costs for small and medium businesses. Low entry costs using online
Software as a Service Platforms.

Lookups In the ETL process, both facts and dimensions All data will be available because
need to be available in staging area. Extract and load occur in one
single action.

Source:https://www.guru99.com
Aggregations Complexity increase with the additional Power of the target platform
amount of data in the dataset. can process significant amount
of data quickly.

Calculations Overwrites existing column or Need to Easily add the calculated


append the dataset and push to the column to the existing table.
target platform.

Maturity The process is used for over two decades. Relatively new concept and
It is well documented and best practices complex to implement.
easily available.

Hardware Most tools have unique hardware Being Saas hardware cost is not
requirements that are expensive. an issue.

Support for Unstructured Data Mostly supports relational data Support for unstructured data
readily available.

Source:https://www.guru99.com

You might also like