So You Want to Build a Data Lake?

So, you want to build a
Data Lake?
The Basics of Data Lakes, Key Considerations, and Lessons Learned
David P. Moore
12/15/2020

Agenda
• Introduction
• What is a Data Lake?
• Architecture and Design
• Governance and Support
• Lessons Learned
• What’s Next?

About Me…
• Sr. Software Developer at CarMax since 2019,
 Consultant with CapTech for 3+ years
 Before that worked at Capital One in a variety of
roles including Developer, Data Modeler, Tech
Lead
• Have worked on 3 data lake implementations
at 3 different companies using 3 different
technologies
• 20+ years in data and software dev, with a
passion for continuous improvement
• Two Fun facts:
 I have a black belt in Silkisondan Karate
 I love to play guitar and listen to music

First a little data history lesson…
Data warehouse and proprietary ETL and database tools
• 1990’s to mid 2000’s – Data Warehouse Popularized
 Ralph Kimball – Star Schema, Data Marts
 Bill Inmon - EDW
• SMP Database Systems (Oracle, SQL Server, Sybase)
• ETL Tools (Informatica, Ab Initio, Talend, etc)
• MPP Database Systems (Teradata, Netezza, Greenplum, etc)
 ELT, 3NF

Open-source, big data and the
cloud…
• 2003, 2004 – Google File System, and Google MapReduce Papers
published
• 2006 – Hadoop started by Doug Cutting and Mike Cafarell
• 2008 - Companies like Cloudera, Hortonworks, MapR form to
package and distribute open-source Hadoop
• 2006 – AWS launched, followed by Google in 2008 and Azure in 2010
• 2010 – Apache Spark started by Matei Zaharia
• 2013 – Databricks launched offering Spark as a Service
• 2019 – Delta Lake released by Databricks

What is Big Data?
• Big Data is a term used to describe massive volumes of data that can
flood a business daily
• This data can be either structured or unstructured, but ultimately
the datasets are so large that they cannot be processed on a single
machine in a reasonable amount of time
• 3 V’s, popularized by Doug Laney from Gartner:
Volume Variety Velocity

What is a Data Lake?
• “A data lake is a system or repository of data stored in its natural/raw
format, usually object blobs or files.”
“A data lake is usually a single store of all enterprise data including raw copies
of source system data and transformed data used for tasks such as reporting,
visualization, advanced analytics and machine learning. A data lake can
include structured data from relational databases (rows and columns), semi-
structured data (CSV, logs, XML, JSON), unstructured data (emails,
documents, PDFs) and binary data (images, audio, video).”
Source: https://en.wikipedia.org/wiki/Data_lake
James Dixon of Pentaho:
“If you think of a datamart as a store of bottled water –
cleansed and packaged and structured for easy consumption –
the data lake is a large body of water in a more natural state.
The contents of the data lake stream in from a source to fill the
lake, and various users of the lake can come to examine, dive
in, or take samples.”

Data Warehouse vs. Data Lake
Data Warehouse Data Lake
Data Format Structured Structured, Semi-
structured, Unstructured
Data Schema / Modeling Schema-on-Write Schema-on-Read
Relative Cost $$$ $
Flexibility Less agile Highly agile
Performance Tuned for fast query
response
General purpose access,
slower responses
Data Quality High quality, curated data Lower quality, raw data
Target Users Business Analysts Data Scientists
Typical Use Cases Reporting, Visualizations Predictive Analytics,
Machine Learning

What is Delta Lake?
“Delta Lake is an open source storage layer that brings reliability to data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies
streaming and batch data processing. Delta Lake runs on top of your existing data
lake and is fully compatible with Apache Spark APIs.”
https://docs.delta.io/latest/delta-faq.html
Created by Databricks, and open sourced and contributed to the Linux Foundation as an
open standard, Delta Lake is a technology layer compatible with Apache Spark that
adds some database-like features to a data lake.

The cloud has enabled a massive
transformation in data capabilities
• Going from on-premises data centers, where provisioning new
hardware took weeks or months, to being able to scale up within
minutes
• Decoupling of compute from storage allows for flexible scalability and
optimizing costs

Data Lake Architectural
Considerations
Scalability Flexibility Security
Availability Supportability

Cloud vs. On-Premises?
• Flexibility & Agility
• Scalability
• Op-ex cost model
• No data center
• Lack of control of data
• Depending on workload
costs can be higher
• Slower Time to Market
• Limit of Scalability
• Cap-ex cost model
• Full control over data
• Depending on workload
costs could be lower

Data Lake Architecture
Primary Components
Storage
Processing Engine
Orchestration Engine
User Access Tools
Data Catalog

Data Lake Environments
DEV
TEST
PRODUCTION
• As in any traditional systems development, having multiple environments for
developing and testing code is necessary.
• Changes to each subsequent environment should be made via automation
• Pre-prod environments need to be kept in sync with prod
Refresh
Process

Data Lake Zones
Landing
Raw (Bronze)
Clean/Valid (Silver)
Refined (Gold)
Secure
Sandbox
Data Lakes typically are divided into separate zones with data going through a
refining process as it progresses from one zone to the next.
Progression

Data Lake Storage paradigms
The Data Lake has two primary storage paradigms for accessing and dealing
with its data:
Hierarchical File System
 Typically based on HDFS
 Data organized into Files and Folders
 N-levels deep
 Based on Posix file system standard
Database
 Typically based on Hive
 Data is organized into Databases and Tables
 2-levels deep
 Compatible with SQL-based access
Most Data Lake systems use both at the same time, where the Database layer sits
on top of the File System. This can cause confusion for users.

Storage Design Decisions
• Datasets in a data lake are typically defined at a folder level
instead of at the file level.
• At the top level there is typically a folder structure that aligns
with the zones
• There are two primary types of data to consider:
 Event/Fact data (Clicks, Transactions, Sensor readings, etc)
 Reference/Master/Dimension data (Customer, Product, etc)
• Reference/Dimension data requires thinking about how to store
history of changes:
1. Snapshots
2. Deltas

File formats and compression
An important design choice is what file format to use in the Lake as
well as whether to compress the data
 For the Landing/Raw zone, the convention is preserve the data in
whatever format it arrived in.
 For subsequent zones, it makes sense to conform to a standard
format that is designed for data lakes that includes schema
information
 Parquet is popular for analytics (Columnar) with Snappy
Compression
 Delta Lake uses Parquet with additional metadata
 ORC is an alternative columnar format popular on Hadoop
 Avro is row-based popular for streaming (Kafka)
 Avoid CSV or plain text formats where possible
 Consider whether the format is splittable for parallel processing
 CSV and Gzip may not be splittable formats

Data Ingestion Choices
ETL frameworks:
• GUI-Based
• Code-based
• Notebooks
• Metadata-driven
Frequency:
• Batch
 Weekly
 Daily
 Hourly
• Micro batch
 Every N minutes
• Streaming / Real time
Push vs. Pull:
• Push – systems send
their data to the lake
• Pull – The lake
initiates extracts
Ingestion is the process of getting data into the lake. When designing ingestion
systems, there are many options and choices that need to be made such as:

Data Catalog
The data catalog is a central part of managing the lake
and should have features such as:
• Dataset definitions
• Fields/column definitions
• Tags: Owner, Classification, PII
• Subject Matter Experts (SMEs)
Modern catalog tools also provide features such as:
• Crowdsourcing of metadata and gamification
• Automated annotation
Some examples:
Alation
Lumada Data Catalog
IBM Watson Knowledge
Catalog
AWS Glue
Azure Data Catalog /
Purview

Hive Metastore
• Most data lakes that are Hadoop-based or Spark-based rely on a
metadata catalog called the Hive Metastore
• It is important to consider how this should be provisioned and
managed
• The metastore is a relational database and supports a variety of
DBMS types including both open source (PostgreSQL, MySQL) and
closed (Oracle, MS SQL Server)
• Some configurations allow for an external metastore that can be
shared by workspaces (i.e. Databricks)

Data Lake Consuming Systems
The lake will most likely host
multiple consuming systems
including:
• Data Warehouses
• Data Marts
• Operational Data Stores
• Feature Stores
• Data products or applications
 Dashboards
 Alerts/Notifications
 Automated Actions
 Datasets
Designing and architecting for data
consumption will require answering
questions such as:
• Will systems pull data from the
lake, or will data be pushed?
• How will these systems access the
data?
• How will systems be notified that
data is available?
• What environments will these
systems use for developing and
testing?
• What apis will be used? (JDBC,
ODBC, REST, SFTP)

Example: Modern Data Warehouse
in Azure
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse

https://aws.amazon.com/solutions/implementations/data-lake-solution/
Example: Data Lake on AWS

Keeping the Lake Secure
• Network security controls
• Role-Based Access Controls (RBAC)
• Encryption
 Transparent Data Encryption
 Explicit Encryption
• Row level and column level access

Keeping the Lake Available
• Service Level Agreements
 RPO and RTO
• Backups
 Data
 Configuration
 Secrets
• Version Control
• Resource Locks
• Geo-Redundancy
• Automation
What’s your disaster recovery plan?

Access Patterns and Roles
The Lake needs to support several different types of access patterns:
1. System Access
 Platform systems
 Applications
2. Business User Access
 Data Analysts
 Data Scientists
3. Technology User Access
 Support Access
 Developer Access
Each of these groups need to have different access rights appropriate to the
role.

Regulations and Policies
impacting the Lake
External Regulations
• GDPR
• CCPA
• HIPAA
• PCI
Internal Policies
• PII and Privacy
• Information Classification
Some regulations such as GDPR and CCPA require customer data to
be disclosed and/or deleted. This requires careful design.

User Support
• Data Catalog
• Access to Data and Tools
• Training
• Sandbox Provisioning
• Help & Support

Technical Exploration and Tool
Selection
• Explore and select tools and technologies
• Minimize number of tools
• Choose best of breed
• Consider Total Cost of Ownership (TCO)
• Select compatible technologies

Performance Tuning
• CPUs/Cores
• Memory
• Parallelism
• Skew
• Caching

1. Managing Environments is Hard
2. Automate Everything
3. Don’t rush to fill the lake, you might wind up with a swamp
4. Know your data
5. Pick a high value use case and demonstrate value quickly
6. Minimize complexity
7. Make sure you have backups
8. Enable self-service
9. But set limits and controls on user space
10. Try out different options, but settle on a single solution

Machine Learning and AI
The Data Lake should not be an end of itself, but instead
should be an enabler of new ways of using data for the benefit
of the business and its customers.
Machine Learning and Artificial intelligence hold much
promise and potential to leverage big data to create
innovative data products.
Some newer capabilities that are critical to this include:
• Feature Stores – Systems for storing and managing
“features” used by machine learning pipelines or models
• Model Registries – Systems for storing, managing and
operationalizing predictive models

The Lakehouse
https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
Technologies like Delta Lake have enabled the combining of the data lake
and data warehouse, simplifying the data architecture
• Lakehouse concept introduced by Databricks

The Event Streaming Platform
Championed by Confluent (creators of Kafka)
this enterprise architecture pattern uses a
hub-and-spoke model where systems stream
events to a hub, which can be read by other
systems.
• Enables real-time event driven systems
• Simplifies point to point dependencies
• Compliments Data Lakes, Data Warehouses
and other systems

Further reading
The Enterprise Big Data Lake
Alex Gorelik

So You Want to Build a Data Lake?

More Related Content

What's hot (20)

Similar to So You Want to Build a Data Lake? (20)

Recently uploaded (20)

So You Want to Build a Data Lake?