Data Engineer
Intro - What is Data Engineering?
It comes the data engineer
Data scattered
Not optimized for analyses
Legacy coding is causing corrupt data
Data Engineer to the rescue!
Data engineers: making your life easier
Gather data from different sources
Optimized database for analyses
Removed corrupt data
Data scientist’s life got way easier!
Definition of job
An engineer that develops, constructs, tests, and maintains architectures such as databases and large-scale
processing systems.
Processing large amounts of data
Use clusters of machines
Data engineer vs Data scientist
Data Engineer Data Scientist
Develop scalable data architecture Mining data for patterns
Streamline data acquisition Statistical modelling
Set up processes to bring together data Predictive models using machine learning
Clean up corrupt data Monitor business processes
Well versed in cloud technology Clean outliers in data
Intro - Tools of the data engineer
Databases
Hold large amounts of data
Support application
Other databases are used for analyses
Processing
Clean data
Aggregate data
Join data
-Data engineer understand the abstractions
Scheduling
Plan jobs with specific intervals
Resolve dependency requirements of jobs
Existing tools: example
Databases: MySQL, PostgreSQL, etc
Processing: Spark, Hive, etc
Scheduling: Apache AirFlow, Oozie, etc. Or using simple Bash tools: Cron
Intro - A data pipeline
To sum everything up, you can think of the data engineering pipeline through this diagram. It extracts all data
through connections with several databases, transforms it using a cluster computing framework like Spark, and loads
it into an analytical database. Also, everything is scheduled to run in a specific order through a scheduling framework
like Airflow. A small side note here is that the sources can be external APIs or other file formats too. We'll see this in
the exercises.
----------------------------------------> Scheduling (Apache AirFlow) ---------------------------------------->
SQL(Accounting) -----------------> Processing (Apache Spark) -----------------> SQL(Analitycs)
SQL(Online Stone)
No SQL(Catalog)
Intro - Cloud Providers
Data processing in the cloud
Clusters of machines required
Problem: self-host data-center
Cover electrical and maintenance costs
Peaks vs Quiet moments: hard to optimize
Solution: use the cloud
Data storage in the cloud
Reliability is required
Problem: self-host data-center
Disaster will strike
Need different geographical locations
Solution: use the cloud
The big three: AWS, Azure & Google
AWS: 32% market share in 2018
Azure: 17% market share in 2018
Google: 10% market share in 2018
Storage
Upload files, e.g. storing product images
Services
AWS S3
Azure Blob Storage
Google Cloud Storage
Computation
Perform calculations, e.g. hosting a web server
Services
AWS EC2
Azure Virtual Machines
Google Compute Engine
Databases
Hold structured information
Services
AWS RDS
Azure SQL Database
Google Cloud SQL