SlideShare a Scribd company logo
So, you want to build a
Data Lake?
The Basics of Data Lakes, Key Considerations, and Lessons Learned
David P. Moore
12/15/2020
Agenda
• Introduction
• What is a Data Lake?
• Architecture and Design
• Governance and Support
• Lessons Learned
• What’s Next?
About Me…
• Sr. Software Developer at CarMax since 2019,
 Consultant with CapTech for 3+ years
 Before that worked at Capital One in a variety of
roles including Developer, Data Modeler, Tech
Lead
• Have worked on 3 data lake implementations
at 3 different companies using 3 different
technologies
• 20+ years in data and software dev, with a
passion for continuous improvement
• Two Fun facts:
 I have a black belt in Silkisondan Karate
 I love to play guitar and listen to music
What is a Data Lake?
First a little data history lesson…
Data warehouse and proprietary ETL and database tools
• 1990’s to mid 2000’s – Data Warehouse Popularized
 Ralph Kimball – Star Schema, Data Marts
 Bill Inmon - EDW
• SMP Database Systems (Oracle, SQL Server, Sybase)
• ETL Tools (Informatica, Ab Initio, Talend, etc)
• MPP Database Systems (Teradata, Netezza, Greenplum, etc)
 ELT, 3NF
Open-source, big data and the
cloud…
• 2003, 2004 – Google File System, and Google MapReduce Papers
published
• 2006 – Hadoop started by Doug Cutting and Mike Cafarell
• 2008 - Companies like Cloudera, Hortonworks, MapR form to
package and distribute open-source Hadoop
• 2006 – AWS launched, followed by Google in 2008 and Azure in 2010
• 2010 – Apache Spark started by Matei Zaharia
• 2013 – Databricks launched offering Spark as a Service
• 2019 – Delta Lake released by Databricks
What is Big Data?
• Big Data is a term used to describe massive volumes of data that can
flood a business daily
• This data can be either structured or unstructured, but ultimately
the datasets are so large that they cannot be processed on a single
machine in a reasonable amount of time
• 3 V’s, popularized by Doug Laney from Gartner:
Volume Variety Velocity
What is a Data Lake?
• “A data lake is a system or repository of data stored in its natural/raw
format, usually object blobs or files.”
“A data lake is usually a single store of all enterprise data including raw copies
of source system data and transformed data used for tasks such as reporting,
visualization, advanced analytics and machine learning. A data lake can
include structured data from relational databases (rows and columns), semi-
structured data (CSV, logs, XML, JSON), unstructured data (emails,
documents, PDFs) and binary data (images, audio, video).”
Source: https://en.wikipedia.org/wiki/Data_lake
James Dixon of Pentaho:
“If you think of a datamart as a store of bottled water –
cleansed and packaged and structured for easy consumption –
the data lake is a large body of water in a more natural state.
The contents of the data lake stream in from a source to fill the
lake, and various users of the lake can come to examine, dive
in, or take samples.”
Data Warehouse vs. Data Lake
Data Warehouse Data Lake
Data Format Structured Structured, Semi-
structured, Unstructured
Data Schema / Modeling Schema-on-Write Schema-on-Read
Relative Cost $$$ $
Flexibility Less agile Highly agile
Performance Tuned for fast query
response
General purpose access,
slower responses
Data Quality High quality, curated data Lower quality, raw data
Target Users Business Analysts Data Scientists
Typical Use Cases Reporting, Visualizations Predictive Analytics,
Machine Learning
What is Delta Lake?
“Delta Lake is an open source storage layer that brings reliability to data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies
streaming and batch data processing. Delta Lake runs on top of your existing data
lake and is fully compatible with Apache Spark APIs.”
https://docs.delta.io/latest/delta-faq.html
Created by Databricks, and open sourced and contributed to the Linux Foundation as an
open standard, Delta Lake is a technology layer compatible with Apache Spark that
adds some database-like features to a data lake.
The cloud has enabled a massive
transformation in data capabilities
• Going from on-premises data centers, where provisioning new
hardware took weeks or months, to being able to scale up within
minutes
• Decoupling of compute from storage allows for flexible scalability and
optimizing costs
Architecture and
Design
Data Lake Architectural
Considerations
Scalability Flexibility Security
Availability Supportability
Cloud vs. On-Premises?
• Flexibility & Agility
• Scalability
• Op-ex cost model
• No data center
• Lack of control of data
• Depending on workload
costs can be higher
• Slower Time to Market
• Limit of Scalability
• Cap-ex cost model
• Full control over data
• Depending on workload
costs could be lower
Data Lake Architecture
Primary Components
Storage
Processing Engine
Orchestration Engine
User Access Tools
Data Catalog
Data Lake Environments
DEV
TEST
PRODUCTION
• As in any traditional systems development, having multiple environments for
developing and testing code is necessary.
• Changes to each subsequent environment should be made via automation
• Pre-prod environments need to be kept in sync with prod
Refresh
Process
Data Lake Zones
Landing
Raw (Bronze)
Clean/Valid (Silver)
Refined (Gold)
Secure
Sandbox
Data Lakes typically are divided into separate zones with data going through a
refining process as it progresses from one zone to the next.
Progression
Data Lake Storage paradigms
The Data Lake has two primary storage paradigms for accessing and dealing
with its data:
Hierarchical File System
 Typically based on HDFS
 Data organized into Files and Folders
 N-levels deep
 Based on Posix file system standard
Database
 Typically based on Hive
 Data is organized into Databases and Tables
 2-levels deep
 Compatible with SQL-based access
Most Data Lake systems use both at the same time, where the Database layer sits
on top of the File System. This can cause confusion for users.
Storage Design Decisions
• Datasets in a data lake are typically defined at a folder level
instead of at the file level.
• At the top level there is typically a folder structure that aligns
with the zones
• There are two primary types of data to consider:
 Event/Fact data (Clicks, Transactions, Sensor readings, etc)
 Reference/Master/Dimension data (Customer, Product, etc)
• Reference/Dimension data requires thinking about how to store
history of changes:
1. Snapshots
2. Deltas
File formats and compression
An important design choice is what file format to use in the Lake as
well as whether to compress the data
 For the Landing/Raw zone, the convention is preserve the data in
whatever format it arrived in.
 For subsequent zones, it makes sense to conform to a standard
format that is designed for data lakes that includes schema
information
 Parquet is popular for analytics (Columnar) with Snappy
Compression
 Delta Lake uses Parquet with additional metadata
 ORC is an alternative columnar format popular on Hadoop
 Avro is row-based popular for streaming (Kafka)
 Avoid CSV or plain text formats where possible
 Consider whether the format is splittable for parallel processing
 CSV and Gzip may not be splittable formats
Data Ingestion Choices
ETL frameworks:
• GUI-Based
• Code-based
• Notebooks
• Metadata-driven
Frequency:
• Batch
 Weekly
 Daily
 Hourly
• Micro batch
 Every N minutes
• Streaming / Real time
Push vs. Pull:
• Push – systems send
their data to the lake
• Pull – The lake
initiates extracts
Ingestion is the process of getting data into the lake. When designing ingestion
systems, there are many options and choices that need to be made such as:
Data Catalog
The data catalog is a central part of managing the lake
and should have features such as:
• Dataset definitions
• Fields/column definitions
• Tags: Owner, Classification, PII
• Subject Matter Experts (SMEs)
Modern catalog tools also provide features such as:
• Crowdsourcing of metadata and gamification
• Automated annotation
Some examples:
Alation
Lumada Data Catalog
IBM Watson Knowledge
Catalog
AWS Glue
Azure Data Catalog /
Purview
Hive Metastore
• Most data lakes that are Hadoop-based or Spark-based rely on a
metadata catalog called the Hive Metastore
• It is important to consider how this should be provisioned and
managed
• The metastore is a relational database and supports a variety of
DBMS types including both open source (PostgreSQL, MySQL) and
closed (Oracle, MS SQL Server)
• Some configurations allow for an external metastore that can be
shared by workspaces (i.e. Databricks)
Data Lake Consuming Systems
The lake will most likely host
multiple consuming systems
including:
• Data Warehouses
• Data Marts
• Operational Data Stores
• Feature Stores
• Data products or applications
 Dashboards
 Alerts/Notifications
 Automated Actions
 Datasets
Designing and architecting for data
consumption will require answering
questions such as:
• Will systems pull data from the
lake, or will data be pushed?
• How will these systems access the
data?
• How will systems be notified that
data is available?
• What environments will these
systems use for developing and
testing?
• What apis will be used? (JDBC,
ODBC, REST, SFTP)
Example: Modern Data Warehouse
in Azure
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse
https://aws.amazon.com/solutions/implementations/data-lake-solution/
Example: Data Lake on AWS
Governance and
Support
Keeping the Lake Secure
• Network security controls
• Role-Based Access Controls (RBAC)
• Encryption
 Transparent Data Encryption
 Explicit Encryption
• Row level and column level access
Keeping the Lake Available
• Service Level Agreements
 RPO and RTO
• Backups
 Data
 Configuration
 Secrets
• Version Control
• Resource Locks
• Geo-Redundancy
• Automation
What’s your disaster recovery plan?
Access Patterns and Roles
The Lake needs to support several different types of access patterns:
1. System Access
 Platform systems
 Applications
2. Business User Access
 Data Analysts
 Data Scientists
3. Technology User Access
 Support Access
 Developer Access
Each of these groups need to have different access rights appropriate to the
role.
Regulations and Policies
impacting the Lake
External Regulations
• GDPR
• CCPA
• HIPAA
• PCI
Internal Policies
• PII and Privacy
• Information Classification
Some regulations such as GDPR and CCPA require customer data to
be disclosed and/or deleted. This requires careful design.
User Support
• Data Catalog
• Access to Data and Tools
• Training
• Sandbox Provisioning
• Help & Support
Technical Exploration and Tool
Selection
• Explore and select tools and technologies
• Minimize number of tools
• Choose best of breed
• Consider Total Cost of Ownership (TCO)
• Select compatible technologies
Performance Tuning
• CPUs/Cores
• Memory
• Parallelism
• Skew
• Caching
Lessons Learned
1. Managing Environments is Hard
2. Automate Everything
3. Don’t rush to fill the lake, you might wind up with a swamp
4. Know your data
5. Pick a high value use case and demonstrate value quickly
6. Minimize complexity
7. Make sure you have backups
8. Enable self-service
9. But set limits and controls on user space
10. Try out different options, but settle on a single solution
What’s Next?
Machine Learning and AI
The Data Lake should not be an end of itself, but instead
should be an enabler of new ways of using data for the benefit
of the business and its customers.
Machine Learning and Artificial intelligence hold much
promise and potential to leverage big data to create
innovative data products.
Some newer capabilities that are critical to this include:
• Feature Stores – Systems for storing and managing
“features” used by machine learning pipelines or models
• Model Registries – Systems for storing, managing and
operationalizing predictive models
The Lakehouse
https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
Technologies like Delta Lake have enabled the combining of the data lake
and data warehouse, simplifying the data architecture
• Lakehouse concept introduced by Databricks
The Event Streaming Platform
Championed by Confluent (creators of Kafka)
this enterprise architecture pattern uses a
hub-and-spoke model where systems stream
events to a hub, which can be read by other
systems.
• Enables real-time event driven systems
• Simplifies point to point dependencies
• Compliments Data Lakes, Data Warehouses
and other systems
Further reading
The Enterprise Big Data Lake
Alex Gorelik
Questions?

More Related Content

PDF
The Hidden Value of Hadoop Migration
Databricks
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PDF
Building a Data Lake on AWS
Gary Stafford
 
PDF
O'Reilly ebook: Operationalizing the Data Lake
Vasu S
 
PPTX
Data Vault Automation at the Bijenkorf
Rob Winters
 
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
PDF
Cloud Storage Spring Cleaning: A Treasure Hunt
Steven Moy
 
The Hidden Value of Hadoop Migration
Databricks
 
Introduction to Data Engineering
Durga Gadiraju
 
Building a Data Lake on AWS
Gary Stafford
 
O'Reilly ebook: Operationalizing the Data Lake
Vasu S
 
Data Vault Automation at the Bijenkorf
Rob Winters
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Cloud Storage Spring Cleaning: A Treasure Hunt
Steven Moy
 

What's hot (20)

PPTX
Should I move my database to the cloud?
James Serra
 
PPTX
Microsoft Data Platform - What's included
James Serra
 
PPTX
Choosing technologies for a big data solution in the cloud
James Serra
 
PDF
Data lake
GHAZOUANI WAEL
 
PDF
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Slim Baltagi
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PPTX
Synapse for mere mortals
Michael Stephenson
 
PPTX
Microsoft Azure Big Data Analytics
Mark Kromer
 
PPTX
Big data architectures and the data lake
James Serra
 
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
PPTX
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
PPTX
Design Principles for a Modern Data Warehouse
Rob Winters
 
PPTX
Designing modern dw and data lake
punedevscom
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PPTX
Data modeling trends for analytics
Ike Ellis
 
PDF
The Warranty Data Lake – After, Inc.
Richard Vermillion
 
PPTX
Delta Lake with Azure Databricks
Dustin Vannoy
 
PPTX
2022 02 Integration Bootcamp
Michael Stephenson
 
PPTX
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
PDF
Integrated Data Warehouse with Hadoop and Oracle Database
Gwen (Chen) Shapira
 
Should I move my database to the cloud?
James Serra
 
Microsoft Data Platform - What's included
James Serra
 
Choosing technologies for a big data solution in the cloud
James Serra
 
Data lake
GHAZOUANI WAEL
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Slim Baltagi
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Synapse for mere mortals
Michael Stephenson
 
Microsoft Azure Big Data Analytics
Mark Kromer
 
Big data architectures and the data lake
James Serra
 
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
Design Principles for a Modern Data Warehouse
Rob Winters
 
Designing modern dw and data lake
punedevscom
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Data modeling trends for analytics
Ike Ellis
 
The Warranty Data Lake – After, Inc.
Richard Vermillion
 
Delta Lake with Azure Databricks
Dustin Vannoy
 
2022 02 Integration Bootcamp
Michael Stephenson
 
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
Integrated Data Warehouse with Hadoop and Oracle Database
Gwen (Chen) Shapira
 
Ad

Similar to So You Want to Build a Data Lake? (20)

PDF
Agile data lake? An oxymoron?
samthemonad
 
PDF
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
PDF
Data lakes
Şaban Dalaman
 
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
PDF
Planing and optimizing data lake architecture
Milos Milovanovic
 
PDF
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PPTX
Data lake ppt
SwarnaLatha177
 
PDF
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PPTX
Architecting a datalake
Laurent Leturgez
 
PDF
Datalake Architecture
TechYugadi IT Solutions & Consulting
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PPTX
Data lake-itweekend-sharif university-vahid amiry
datastack
 
PPTX
Data Lake Organization (Data Mining and Knowledge discovery)
klkovida04
 
PDF
5 Steps for Architecting a Data Lake
MetroStar
 
PPTX
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
PPTX
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
PDF
Building Data Lakes with Apache Airflow
Gary Stafford
 
Agile data lake? An oxymoron?
samthemonad
 
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Data lakes
Şaban Dalaman
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Planing and optimizing data lake architecture
Milos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Free Training: How to Build a Lakehouse
Databricks
 
Data lake ppt
SwarnaLatha177
 
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Architecting a datalake
Laurent Leturgez
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Data lake-itweekend-sharif university-vahid amiry
datastack
 
Data Lake Organization (Data Mining and Knowledge discovery)
klkovida04
 
5 Steps for Architecting a Data Lake
MetroStar
 
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
Building Data Lakes with Apache Airflow
Gary Stafford
 
Ad

Recently uploaded (20)

PPTX
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
abhinavmemories2026
 
PDF
1 Simple and Compound Interest_953c061c981ff8640f0b8e733b245589.pdf
JaexczJol060205
 
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
dushyantsharma1221
 
PDF
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
PDF
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
PPTX
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
PPTX
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
PPTX
Understanding Prototyping in Design and Development
SadiaJanjua2
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
sumitmundhe77
 
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
JanakiRaman206018
 
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PratyushPrem2
 
PDF
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
mswindow00
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
PPTX
Lecture 1 Intro in Inferential Statistics.pptx
MiraLamuton
 
Economic Sector Performance Recovery.pptx
yulisbaso2020
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
CH2-MODEL-SETUP-v2017.1-JC-APR27-2017.pdf
jcc00023con
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
abhinavmemories2026
 
1 Simple and Compound Interest_953c061c981ff8640f0b8e733b245589.pdf
JaexczJol060205
 
Major-Components-ofNKJNNKNKNKNKronment.pptx
dushyantsharma1221
 
Company Profile 2023 PT. ZEKON INDONESIA.pdf
hendranofriadi26
 
345_IT infrastructure for business management.pdf
LEANHTRAN4
 
GR3-PPTFINAL (1).pptx 0.91 MbHIHUHUGG,HJGH
DarylArellaga1
 
Measurement of Afordability for Water Supply and Sanitation in Bangladesh .pptx
akmibrahimbd
 
Understanding Prototyping in Design and Development
SadiaJanjua2
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
sumitmundhe77
 
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
JanakiRaman206018
 
Taxes Foundatisdcsdcsdon Certificate.pdf
PratyushPrem2
 
AI Lect 2 Identifying AI systems, branches of AI, etc.pdf
mswindow00
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
Lecture 1 Intro in Inferential Statistics.pptx
MiraLamuton
 

So You Want to Build a Data Lake?

  • 1. So, you want to build a Data Lake? The Basics of Data Lakes, Key Considerations, and Lessons Learned David P. Moore 12/15/2020
  • 2. Agenda • Introduction • What is a Data Lake? • Architecture and Design • Governance and Support • Lessons Learned • What’s Next?
  • 3. About Me… • Sr. Software Developer at CarMax since 2019,  Consultant with CapTech for 3+ years  Before that worked at Capital One in a variety of roles including Developer, Data Modeler, Tech Lead • Have worked on 3 data lake implementations at 3 different companies using 3 different technologies • 20+ years in data and software dev, with a passion for continuous improvement • Two Fun facts:  I have a black belt in Silkisondan Karate  I love to play guitar and listen to music
  • 4. What is a Data Lake?
  • 5. First a little data history lesson… Data warehouse and proprietary ETL and database tools • 1990’s to mid 2000’s – Data Warehouse Popularized  Ralph Kimball – Star Schema, Data Marts  Bill Inmon - EDW • SMP Database Systems (Oracle, SQL Server, Sybase) • ETL Tools (Informatica, Ab Initio, Talend, etc) • MPP Database Systems (Teradata, Netezza, Greenplum, etc)  ELT, 3NF
  • 6. Open-source, big data and the cloud… • 2003, 2004 – Google File System, and Google MapReduce Papers published • 2006 – Hadoop started by Doug Cutting and Mike Cafarell • 2008 - Companies like Cloudera, Hortonworks, MapR form to package and distribute open-source Hadoop • 2006 – AWS launched, followed by Google in 2008 and Azure in 2010 • 2010 – Apache Spark started by Matei Zaharia • 2013 – Databricks launched offering Spark as a Service • 2019 – Delta Lake released by Databricks
  • 7. What is Big Data? • Big Data is a term used to describe massive volumes of data that can flood a business daily • This data can be either structured or unstructured, but ultimately the datasets are so large that they cannot be processed on a single machine in a reasonable amount of time • 3 V’s, popularized by Doug Laney from Gartner: Volume Variety Velocity
  • 8. What is a Data Lake? • “A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files.” “A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi- structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).” Source: https://en.wikipedia.org/wiki/Data_lake James Dixon of Pentaho: “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
  • 9. Data Warehouse vs. Data Lake Data Warehouse Data Lake Data Format Structured Structured, Semi- structured, Unstructured Data Schema / Modeling Schema-on-Write Schema-on-Read Relative Cost $$$ $ Flexibility Less agile Highly agile Performance Tuned for fast query response General purpose access, slower responses Data Quality High quality, curated data Lower quality, raw data Target Users Business Analysts Data Scientists Typical Use Cases Reporting, Visualizations Predictive Analytics, Machine Learning
  • 10. What is Delta Lake? “Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.” https://docs.delta.io/latest/delta-faq.html Created by Databricks, and open sourced and contributed to the Linux Foundation as an open standard, Delta Lake is a technology layer compatible with Apache Spark that adds some database-like features to a data lake.
  • 11. The cloud has enabled a massive transformation in data capabilities • Going from on-premises data centers, where provisioning new hardware took weeks or months, to being able to scale up within minutes • Decoupling of compute from storage allows for flexible scalability and optimizing costs
  • 13. Data Lake Architectural Considerations Scalability Flexibility Security Availability Supportability
  • 14. Cloud vs. On-Premises? • Flexibility & Agility • Scalability • Op-ex cost model • No data center • Lack of control of data • Depending on workload costs can be higher • Slower Time to Market • Limit of Scalability • Cap-ex cost model • Full control over data • Depending on workload costs could be lower
  • 15. Data Lake Architecture Primary Components Storage Processing Engine Orchestration Engine User Access Tools Data Catalog
  • 16. Data Lake Environments DEV TEST PRODUCTION • As in any traditional systems development, having multiple environments for developing and testing code is necessary. • Changes to each subsequent environment should be made via automation • Pre-prod environments need to be kept in sync with prod Refresh Process
  • 17. Data Lake Zones Landing Raw (Bronze) Clean/Valid (Silver) Refined (Gold) Secure Sandbox Data Lakes typically are divided into separate zones with data going through a refining process as it progresses from one zone to the next. Progression
  • 18. Data Lake Storage paradigms The Data Lake has two primary storage paradigms for accessing and dealing with its data: Hierarchical File System  Typically based on HDFS  Data organized into Files and Folders  N-levels deep  Based on Posix file system standard Database  Typically based on Hive  Data is organized into Databases and Tables  2-levels deep  Compatible with SQL-based access Most Data Lake systems use both at the same time, where the Database layer sits on top of the File System. This can cause confusion for users.
  • 19. Storage Design Decisions • Datasets in a data lake are typically defined at a folder level instead of at the file level. • At the top level there is typically a folder structure that aligns with the zones • There are two primary types of data to consider:  Event/Fact data (Clicks, Transactions, Sensor readings, etc)  Reference/Master/Dimension data (Customer, Product, etc) • Reference/Dimension data requires thinking about how to store history of changes: 1. Snapshots 2. Deltas
  • 20. File formats and compression An important design choice is what file format to use in the Lake as well as whether to compress the data  For the Landing/Raw zone, the convention is preserve the data in whatever format it arrived in.  For subsequent zones, it makes sense to conform to a standard format that is designed for data lakes that includes schema information  Parquet is popular for analytics (Columnar) with Snappy Compression  Delta Lake uses Parquet with additional metadata  ORC is an alternative columnar format popular on Hadoop  Avro is row-based popular for streaming (Kafka)  Avoid CSV or plain text formats where possible  Consider whether the format is splittable for parallel processing  CSV and Gzip may not be splittable formats
  • 21. Data Ingestion Choices ETL frameworks: • GUI-Based • Code-based • Notebooks • Metadata-driven Frequency: • Batch  Weekly  Daily  Hourly • Micro batch  Every N minutes • Streaming / Real time Push vs. Pull: • Push – systems send their data to the lake • Pull – The lake initiates extracts Ingestion is the process of getting data into the lake. When designing ingestion systems, there are many options and choices that need to be made such as:
  • 22. Data Catalog The data catalog is a central part of managing the lake and should have features such as: • Dataset definitions • Fields/column definitions • Tags: Owner, Classification, PII • Subject Matter Experts (SMEs) Modern catalog tools also provide features such as: • Crowdsourcing of metadata and gamification • Automated annotation Some examples: Alation Lumada Data Catalog IBM Watson Knowledge Catalog AWS Glue Azure Data Catalog / Purview
  • 23. Hive Metastore • Most data lakes that are Hadoop-based or Spark-based rely on a metadata catalog called the Hive Metastore • It is important to consider how this should be provisioned and managed • The metastore is a relational database and supports a variety of DBMS types including both open source (PostgreSQL, MySQL) and closed (Oracle, MS SQL Server) • Some configurations allow for an external metastore that can be shared by workspaces (i.e. Databricks)
  • 24. Data Lake Consuming Systems The lake will most likely host multiple consuming systems including: • Data Warehouses • Data Marts • Operational Data Stores • Feature Stores • Data products or applications  Dashboards  Alerts/Notifications  Automated Actions  Datasets Designing and architecting for data consumption will require answering questions such as: • Will systems pull data from the lake, or will data be pushed? • How will these systems access the data? • How will systems be notified that data is available? • What environments will these systems use for developing and testing? • What apis will be used? (JDBC, ODBC, REST, SFTP)
  • 25. Example: Modern Data Warehouse in Azure https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse
  • 28. Keeping the Lake Secure • Network security controls • Role-Based Access Controls (RBAC) • Encryption  Transparent Data Encryption  Explicit Encryption • Row level and column level access
  • 29. Keeping the Lake Available • Service Level Agreements  RPO and RTO • Backups  Data  Configuration  Secrets • Version Control • Resource Locks • Geo-Redundancy • Automation What’s your disaster recovery plan?
  • 30. Access Patterns and Roles The Lake needs to support several different types of access patterns: 1. System Access  Platform systems  Applications 2. Business User Access  Data Analysts  Data Scientists 3. Technology User Access  Support Access  Developer Access Each of these groups need to have different access rights appropriate to the role.
  • 31. Regulations and Policies impacting the Lake External Regulations • GDPR • CCPA • HIPAA • PCI Internal Policies • PII and Privacy • Information Classification Some regulations such as GDPR and CCPA require customer data to be disclosed and/or deleted. This requires careful design.
  • 32. User Support • Data Catalog • Access to Data and Tools • Training • Sandbox Provisioning • Help & Support
  • 33. Technical Exploration and Tool Selection • Explore and select tools and technologies • Minimize number of tools • Choose best of breed • Consider Total Cost of Ownership (TCO) • Select compatible technologies
  • 34. Performance Tuning • CPUs/Cores • Memory • Parallelism • Skew • Caching
  • 36. 1. Managing Environments is Hard 2. Automate Everything 3. Don’t rush to fill the lake, you might wind up with a swamp 4. Know your data 5. Pick a high value use case and demonstrate value quickly 6. Minimize complexity 7. Make sure you have backups 8. Enable self-service 9. But set limits and controls on user space 10. Try out different options, but settle on a single solution
  • 38. Machine Learning and AI The Data Lake should not be an end of itself, but instead should be an enabler of new ways of using data for the benefit of the business and its customers. Machine Learning and Artificial intelligence hold much promise and potential to leverage big data to create innovative data products. Some newer capabilities that are critical to this include: • Feature Stores – Systems for storing and managing “features” used by machine learning pipelines or models • Model Registries – Systems for storing, managing and operationalizing predictive models
  • 39. The Lakehouse https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html Technologies like Delta Lake have enabled the combining of the data lake and data warehouse, simplifying the data architecture • Lakehouse concept introduced by Databricks
  • 40. The Event Streaming Platform Championed by Confluent (creators of Kafka) this enterprise architecture pattern uses a hub-and-spoke model where systems stream events to a hub, which can be read by other systems. • Enables real-time event driven systems • Simplifies point to point dependencies • Compliments Data Lakes, Data Warehouses and other systems
  • 41. Further reading The Enterprise Big Data Lake Alex Gorelik