SlideShare a Scribd company logo
DATA ENGINEERING
BASICS & GETTING STARTED
DEFINITIONS
Data Engineer
 They build and scale the platforms that enable data collection, processing and
storage for data science/business analytics use.
Data Scientist
 They use linear algebra and multivariable calculus to create new insight from
existing data.
DATA ENGINEERING
Designing, building and scaling systems that organize
data for analytics
ETL (EXTRACT,
TRANSFORM, LOAD)
Basic architecture of
ETL
Scaling factor
Data Engineering Basics
DATA CLASSIFICATION
Raw data
 Unprocessed data in format used on source e.g JSON
 No schema applied
Processed data
 Raw data with schema applied
 Stored in event tables/destinations in pipelines
Cooked data
 Processed data that has been summarized.
BIG DATA PROPERTIES
Volume
 How much data you have
Velocity
 How fast data is getting to you
Variety
 How different your data is
Veracity
 How reliable your data is
DATA PROCESSING
METHODS
BATCH PROCESSING
STREAM PROCESSING
Process data on the fly, as it comes in
STREAMING METHODS
At Least Once
At Most Once
Exactly Once
PROCESSING
FRAMEWORKS
MAP REDUCE
Key –Value pairing.
Organize the data into keys and values,
Sort by the key,
Combine the data with matching keys
Repeat until you have the final key- value outcome.
Data Engineering Basics
DATA STORAGE
Relational Database (SQL)
Document Store (NoSQL)
THANKYOU
REFERENCES
The Data Engineering Cookbook
https://github.com/andkret/Cookbook

More Related Content

PPTX
Introduction to Data Engineering
Hadi Fadlallah
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PPTX
Introduction to Data Engineering
Vivek Aanand Ganesan
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PDF
Summary introduction to data engineering
Novita Sari
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
Data Engineering.pdf
Datacademy.ai
 
PPTX
(The life of a) Data engineer
Alex Chalini
 
Introduction to Data Engineering
Hadi Fadlallah
 
Introduction to Data Engineering
Durga Gadiraju
 
Introduction to Data Engineering
Vivek Aanand Ganesan
 
Demystifying data engineering
Thang Bui (Bob)
 
Summary introduction to data engineering
Novita Sari
 
Building a modern data warehouse
James Serra
 
Data Engineering.pdf
Datacademy.ai
 
(The life of a) Data engineer
Alex Chalini
 

What's hot (20)

PDF
Introducing Databricks Delta
Databricks
 
PPTX
Building an Effective Data Warehouse Architecture
James Serra
 
PDF
Introduction to Azure Data Factory
Slava Kokaev
 
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
PDF
Future of Data Engineering
C4Media
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PPTX
Big data architectures and the data lake
James Serra
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPTX
Data Vault Overview
Empowered Holdings, LLC
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PDF
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PPTX
Big Data Analytics
Ghulam Imaduddin
 
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Data platform architecture
Sudheer Kondla
 
PDF
Intro to Delta Lake
Databricks
 
PPT
Data Warehouse Modeling
vivekjv
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Introducing Databricks Delta
Databricks
 
Building an Effective Data Warehouse Architecture
James Serra
 
Introduction to Azure Data Factory
Slava Kokaev
 
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Future of Data Engineering
C4Media
 
Modern Data architecture Design
Kujambu Murugesan
 
Big data architectures and the data lake
James Serra
 
Modernizing to a Cloud Data Architecture
Databricks
 
Data Vault Overview
Empowered Holdings, LLC
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Big Data Analytics
Ghulam Imaduddin
 
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
Databricks Platform.pptx
Alex Ivy
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Data platform architecture
Sudheer Kondla
 
Intro to Delta Lake
Databricks
 
Data Warehouse Modeling
vivekjv
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Ad

Similar to Data Engineering Basics (20)

PPTX
Is the traditional data warehouse dead?
James Serra
 
PPTX
Data Lake Overview
James Serra
 
PPTX
Data science | What is Data science
ShilpaKrishna6
 
PDF
Unifying Analytics
Data Con LA
 
PPTX
Spark Data Streaming Pipeline
Jonathan Bradshaw
 
PPTX
Database Vs Data Warehouse Vs Data Lake : What Is the Difference
Simplilearn
 
PPT
Database 2 External Schema
Ashwani Kumar Ramani
 
PPTX
SQL Server 2008 Development for Programmers
Adam Hutson
 
PDF
data_engineering_basics.pdf
Ketan Patil
 
PPT
Datawarehousing & DSS
Deepali Raut
 
PPTX
Azure Synapse Analytics Overview (r1)
James Serra
 
PPT
ITReady DW Day2
Siwawong Wuttipongprasert
 
PDF
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
PDF
esProc introduction
ssuser9671cc
 
DOCX
Microsoft Fabric data warehouse by dataplatr
ajaykumar405166
 
PDF
Lighthouse - an open-source library to build data lakes - Kris Peeters
Data Science Leuven
 
PDF
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
TIBCO Spotfire
 
PPT
Whats a datawarehouse
vijjudarling
 
PDF
Prague data management meetup 2018-03-27
Martin Bém
 
PPTX
Tableau Desktop Material
Kishore Chaganti
 
Is the traditional data warehouse dead?
James Serra
 
Data Lake Overview
James Serra
 
Data science | What is Data science
ShilpaKrishna6
 
Unifying Analytics
Data Con LA
 
Spark Data Streaming Pipeline
Jonathan Bradshaw
 
Database Vs Data Warehouse Vs Data Lake : What Is the Difference
Simplilearn
 
Database 2 External Schema
Ashwani Kumar Ramani
 
SQL Server 2008 Development for Programmers
Adam Hutson
 
data_engineering_basics.pdf
Ketan Patil
 
Datawarehousing & DSS
Deepali Raut
 
Azure Synapse Analytics Overview (r1)
James Serra
 
ITReady DW Day2
Siwawong Wuttipongprasert
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
esProc introduction
ssuser9671cc
 
Microsoft Fabric data warehouse by dataplatr
ajaykumar405166
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Data Science Leuven
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
TIBCO Spotfire
 
Whats a datawarehouse
vijjudarling
 
Prague data management meetup 2018-03-27
Martin Bém
 
Tableau Desktop Material
Kishore Chaganti
 
Ad

Recently uploaded (20)

PDF
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
1intro to AI.pptx AI components & composition
ssuserb993e5
 
PDF
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Decoding Physical Presence: Unlocking Business Intelligence with Wi-Fi Analytics
meghahiremath253
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Linux OS guide to know, operate. Linux Filesystem, command, users and system
Kiran Maharjan
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
1intro to AI.pptx AI components & composition
ssuserb993e5
 
oop_java (1) of ice or cse or eee ic.pdf
sabiquntoufiqlabonno
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Decoding Physical Presence: Unlocking Business Intelligence with Wi-Fi Analytics
meghahiremath253
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 

Data Engineering Basics