SlideShare a Scribd company logo
Best Practices for building Robust Data Platform
with Apache Spark & Delta
Vini Jaiswal
Spark+AI Summit - June 2020
https://www.linkedin.com/in/vinijaiswal/
▪ Data Strategy
Optimizing the cost to drive Business value
▪ Performance and tuning with Delta Lake & Apache Spark
▪ Governance and security controls
Bringing it all together - A reference architecture
Agenda
Data Strategy
Data Challenges
Data Warehouse limits the potential of
intelligence
Data Volume is growing rapidly
More Variety of data -> Different
applications
Need for faster processing and scalability
Data silos limits innovation
Promise of the Data Lake
1. Collect
Everything
2. Store it all in
the Data Lake
🔥
🔥🔥
3. Data
Science &
Machine
Learning
🔥
🔥
Usual Data Lake
Garbage
In
Garbage
Out
Garbage Stored
Ideal Data Lake with
Ideal data lakes with
No atomicity
No quality enforcement
No consistency /
isolation
✗ Reliability - High Quality Data
● Schema Enforcement
● ACID Transactions
● Time Travel
● Open Standards, Open Source
● Powered by
● Unifies Streaming / Batch
Usual Data Lake
References: https://youtu.be/qtCxNSmTejk
Getting the Data Right
Audience Segmentation
CSV,
JSON, TXT…
Data
Types
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
Table
Categorization
Align with
Business
Outcomes
Is my data use
case worthy?
Is my data ready
for Analytics / ML?
Optimizing the Cost to Drive Business Value
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Workload Type AWS
Type
Azure
Type
Recommended Use Case
Memory Optimized r5 Dsv2 Memory-intensive applications
Use Case: ML workload with data caching
Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data
Science Applications
Use Case: ETL with full file scans and no data reuse
Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO
Use Case: Analytics - Storage Optimized i3 class with
Delta IO Cache
Selection of Instance Types
Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Selection of node size
Rule of thumb
1. Fewer big instances > more small instances
a. (larger heap = larger GC)
b. Multiple executors per machine
2. Size based on the number of tasks initially, tweak later
a. Run the job with a small cluster to get idea of # of tasks
b. Observe Cluster metrics for CPU, memory and network utilization
Observe Spark UI & tweak the workloads
Fully cached with room to spare?
> decrease instances
Almost completely cached?
> Increase cluster size
Not even close to cached?
> Consider instance with SSD
instead of EBS or use R class
Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
Observe Ganglia Metrics & tweak the workloads
○ Are we compute bound?
○ Are we network bound?
○ Are we spilling a ton?
Performance and Tuning with
Delta Lake & Apache Spark
Performance Symptoms
Look for these 4 symptoms
Shuffle
Spill
Skew
Small Files
Can I make
Spark application run faster?
Use broadcast join
Review Join order
I found Shuffle, now what?
Query completion time
28 Minutes
Sort Merge Join
rows
output:
2,509,189,31
3
Before
1.8 Minutes
rows
output:
1023
After
Reference: https://spark.apache.org/docs/latest/sql-
performance-tuning.html#broadcast-hint-for-sql-queries
● Increase Shuffle Partitions
(for this example: 48)
● Reduce the number of cores
spark.executor.cores < total
cores per worker
● Larger cluster - faster disk
SSDs
Shuffle Partitions = 16
I found Spill, now what?
set spark.sql.shuffle.partitions=48
More spill you can remove, larger
the impact!
Symptom
● Ganglia CPU usage becomes low for long time after
initial high usage
● Task duration -> Significant difference in max than
75% and 25% values
● Input Size/Records
What to do?
● Use broadcast join
● Use Skew Join
● Filter out large keys/salt keys and set
up multiple reduce steps
● Explicitly repartition the data on a
different field
I found Skew, now what?
Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
Adaptive Query Execution
Reduced manual effort of tuning spark.sql.shuffle.partitions
By default it is turned off, Set spark.sql.adaptive.enabled=true
Dynamically change sort-merge join into broadcast-hash join
▪ Dynamically optimizing skew joins
*Available in DBR 7.x/Spark 3.0
Upstream
● Fix the upstream application building tons of files
● Use a seperate tool to compact them before
processing with Spark
Changes in Spark Application
● Write your own compaction job
● Delta solves this problem!
I found a lot of small files, now what?
Achieving Performance with
Compaction
● Improves the Read
Performance
● Solves Small Files problem
Reference: https://docs.delta.io/latest/best-practices.html#compact-files
● Optimizes Apache Spark partition
● Maximizes the throughput of data being
written
● Compacts files for partitions
Auto Optimize
Auto Optimize consists of two complementary features:
Optimized Writes and Auto Compaction.
Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
Reference:https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering
Z-order Sorting
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Z-Ordering
A technique to colocate related information in the same set of files
● Safely skips more data
● Faster queries
Governance & Security Controls
Data Governance with Delta Lake
Create retention policy to age out and
erase raw data that may contain
personal information
High Level Aggregates
(e.g. # of users that took an action)
Historical Data Repository
● Easy to navigate
● Pseudonymization
Data Lake
Satisfy Compliance requests using
UPDATE / DELETE commands
Create tables that don't contain
personal data
Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
Audit & Monitoring
▪ Use cluster tags for chargeback
▪ Audit logs
▪ Monitor Databricks DBU usage
▪ Delta Transactional Logs
Governance - The Who/What/Where
Perform standard extraction,
transformation and loading
tasks (ETL) and apply best
coding practices including
source control, unit test, and
automation
drives product innovation
with state-of-the-art
Machine Learning models
applied to big data
Improves business process
through providing standardized
and ad-hoc business analysis.
Acts as intermediary between
Analytics and Business team
Performs automated jobs based
on Data Engineering configs.
Data Scientist Data Engineer Data/Business
Analyst Automated Jobs
Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Data Strategy
Cost Optimization &
Performance Tuning
Business Value
Security
THANK YOU!!!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot (20)

Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
Databricks
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
Databricks
 
Challenge And Evolution Of Data Orchestration at Rakuten Data System
Challenge And Evolution Of Data Orchestration at Rakuten Data SystemChallenge And Evolution Of Data Orchestration at Rakuten Data System
Challenge And Evolution Of Data Orchestration at Rakuten Data System
Alluxio, Inc.
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
Supply Chain Twin Demo - Companion Deck
Supply Chain Twin Demo - Companion DeckSupply Chain Twin Demo - Companion Deck
Supply Chain Twin Demo - Companion Deck
Neo4j
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Kai Wähner
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
Neo4j
 
Elsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with GraphsElsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with Graphs
Neo4j
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Visual_BI
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
Chris Bailey
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
Neo4j
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
Cloudera, Inc.
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
Databricks
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
Databricks
 
Challenge And Evolution Of Data Orchestration at Rakuten Data System
Challenge And Evolution Of Data Orchestration at Rakuten Data SystemChallenge And Evolution Of Data Orchestration at Rakuten Data System
Challenge And Evolution Of Data Orchestration at Rakuten Data System
Alluxio, Inc.
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Supply Chain Twin Demo - Companion Deck
Supply Chain Twin Demo - Companion DeckSupply Chain Twin Demo - Companion Deck
Supply Chain Twin Demo - Companion Deck
Neo4j
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Kai Wähner
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
Neo4j
 
Elsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with GraphsElsevier: Empowering Knowledge Discovery in Research with Graphs
Elsevier: Empowering Knowledge Discovery in Research with Graphs
Neo4j
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Visual_BI
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
Chris Bailey
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
Neo4j
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 

Similar to Best Practices for Building Robust Data Platform with Apache Spark and Delta (20)

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SeeQuality.net
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Spark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedSpark Gotchas and Lessons Learned
Spark Gotchas and Lessons Learned
Jen Waller
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
Jen Waller
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
Guido Oswald
 
Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
Paulo Gutierrez
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Operating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance TuningSQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SeeQuality.net
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Spark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedSpark Gotchas and Lessons Learned
Spark Gotchas and Lessons Learned
Jen Waller
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
Jen Waller
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020Spark + AI Summit recap jul16 2020
Spark + AI Summit recap jul16 2020
Guido Oswald
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays
 
Али махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptxАли махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptx
palr19411
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdfReport_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptxBE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
AaronBaluyut
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdfMedia_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
Gen AI futfyfufufufufuytfyctrwaeq3A435.pdf
Gen AI futfyfufufufufuytfyctrwaeq3A435.pdfGen AI futfyfufufufufuytfyctrwaeq3A435.pdf
Gen AI futfyfufufufufuytfyctrwaeq3A435.pdf
divyanshuM3
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays New York 2025 - Spring Modulith Design for Microservices by Renjith R...
apidays
 
Али махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptxАли махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptx
palr19411
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays Singapore 2025 - Enhancing Developer Productivity with UX (Government...
apidays
 
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays New York 2025 - Open Source and disrupting the travel distribution ec...
apidays
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays New York 2025 - Unifying OpenAPI & AsyncAPI by Naresh Jain & Hari Kri...
apidays
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdfReport_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays New York 2025 - Why I Built Another Carbon Measurement Tool for LLMs ...
apidays
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdfBODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
BODMAS-Rule-&-Unit-Digit-Concept-pdf.pdf
SiddharthSean
 
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays Singapore 2025 - 4 Identity Essentials for Scaling SaaS in Large Orgs...
apidays
 
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptxBE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
AaronBaluyut
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays New York 2025 - The FINOS Common Domain Model for Capital Markets by ...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdfMedia_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
Gen AI futfyfufufufufuytfyctrwaeq3A435.pdf
Gen AI futfyfufufufufuytfyctrwaeq3A435.pdfGen AI futfyfufufufufuytfyctrwaeq3A435.pdf
Gen AI futfyfufufufufuytfyctrwaeq3A435.pdf
divyanshuM3
 
THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)THE FRIEDMAN TEST ( Biostatics B. Pharm)
THE FRIEDMAN TEST ( Biostatics B. Pharm)
JishuHaldar
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 

Best Practices for Building Robust Data Platform with Apache Spark and Delta

  • 1. Best Practices for building Robust Data Platform with Apache Spark & Delta Vini Jaiswal Spark+AI Summit - June 2020 https://www.linkedin.com/in/vinijaiswal/
  • 2. ▪ Data Strategy Optimizing the cost to drive Business value ▪ Performance and tuning with Delta Lake & Apache Spark ▪ Governance and security controls Bringing it all together - A reference architecture Agenda
  • 4. Data Challenges Data Warehouse limits the potential of intelligence Data Volume is growing rapidly More Variety of data -> Different applications Need for faster processing and scalability Data silos limits innovation Promise of the Data Lake 1. Collect Everything 2. Store it all in the Data Lake 🔥 🔥🔥 3. Data Science & Machine Learning 🔥 🔥
  • 7. Ideal data lakes with No atomicity No quality enforcement No consistency / isolation ✗ Reliability - High Quality Data ● Schema Enforcement ● ACID Transactions ● Time Travel ● Open Standards, Open Source ● Powered by ● Unifies Streaming / Batch Usual Data Lake References: https://youtu.be/qtCxNSmTejk
  • 8. Getting the Data Right Audience Segmentation CSV, JSON, TXT… Data Types Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold Table Categorization Align with Business Outcomes Is my data use case worthy? Is my data ready for Analytics / ML?
  • 9. Optimizing the Cost to Drive Business Value
  • 10. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 11. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 12. Workload Type AWS Type Azure Type Recommended Use Case Memory Optimized r5 Dsv2 Memory-intensive applications Use Case: ML workload with data caching Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data Science Applications Use Case: ETL with full file scans and no data reuse Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO Use Case: Analytics - Storage Optimized i3 class with Delta IO Cache Selection of Instance Types Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
  • 13. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 14. Selection of node size Rule of thumb 1. Fewer big instances > more small instances a. (larger heap = larger GC) b. Multiple executors per machine 2. Size based on the number of tasks initially, tweak later a. Run the job with a small cluster to get idea of # of tasks b. Observe Cluster metrics for CPU, memory and network utilization
  • 15. Observe Spark UI & tweak the workloads Fully cached with room to spare? > decrease instances Almost completely cached? > Increase cluster size Not even close to cached? > Consider instance with SSD instead of EBS or use R class Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
  • 16. Observe Ganglia Metrics & tweak the workloads ○ Are we compute bound? ○ Are we network bound? ○ Are we spilling a ton?
  • 17. Performance and Tuning with Delta Lake & Apache Spark
  • 18. Performance Symptoms Look for these 4 symptoms Shuffle Spill Skew Small Files Can I make Spark application run faster?
  • 19. Use broadcast join Review Join order I found Shuffle, now what? Query completion time 28 Minutes Sort Merge Join rows output: 2,509,189,31 3 Before 1.8 Minutes rows output: 1023 After Reference: https://spark.apache.org/docs/latest/sql- performance-tuning.html#broadcast-hint-for-sql-queries
  • 20. ● Increase Shuffle Partitions (for this example: 48) ● Reduce the number of cores spark.executor.cores < total cores per worker ● Larger cluster - faster disk SSDs Shuffle Partitions = 16 I found Spill, now what? set spark.sql.shuffle.partitions=48 More spill you can remove, larger the impact!
  • 21. Symptom ● Ganglia CPU usage becomes low for long time after initial high usage ● Task duration -> Significant difference in max than 75% and 25% values ● Input Size/Records What to do? ● Use broadcast join ● Use Skew Join ● Filter out large keys/salt keys and set up multiple reduce steps ● Explicitly repartition the data on a different field I found Skew, now what? Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
  • 22. Adaptive Query Execution Reduced manual effort of tuning spark.sql.shuffle.partitions By default it is turned off, Set spark.sql.adaptive.enabled=true Dynamically change sort-merge join into broadcast-hash join ▪ Dynamically optimizing skew joins *Available in DBR 7.x/Spark 3.0
  • 23. Upstream ● Fix the upstream application building tons of files ● Use a seperate tool to compact them before processing with Spark Changes in Spark Application ● Write your own compaction job ● Delta solves this problem! I found a lot of small files, now what?
  • 25. Compaction ● Improves the Read Performance ● Solves Small Files problem Reference: https://docs.delta.io/latest/best-practices.html#compact-files
  • 26. ● Optimizes Apache Spark partition ● Maximizes the throughput of data being written ● Compacts files for partitions Auto Optimize Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction. Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
  • 27. Reference:https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering Z-order Sorting 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Z-Ordering A technique to colocate related information in the same set of files ● Safely skips more data ● Faster queries
  • 29. Data Governance with Delta Lake Create retention policy to age out and erase raw data that may contain personal information High Level Aggregates (e.g. # of users that took an action) Historical Data Repository ● Easy to navigate ● Pseudonymization Data Lake Satisfy Compliance requests using UPDATE / DELETE commands Create tables that don't contain personal data Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
  • 30. Audit & Monitoring ▪ Use cluster tags for chargeback ▪ Audit logs ▪ Monitor Databricks DBU usage ▪ Delta Transactional Logs
  • 31. Governance - The Who/What/Where Perform standard extraction, transformation and loading tasks (ETL) and apply best coding practices including source control, unit test, and automation drives product innovation with state-of-the-art Machine Learning models applied to big data Improves business process through providing standardized and ad-hoc business analysis. Acts as intermediary between Analytics and Business team Performs automated jobs based on Data Engineering configs. Data Scientist Data Engineer Data/Business Analyst Automated Jobs Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
  • 32. Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage
  • 33. Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage Data Strategy Cost Optimization & Performance Tuning Business Value Security
  • 35. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.