Building Data Quality
pipelines with Apache
Spark and Delta Lake
Sandy May & Darren Fuller
Lead Data Engineers
Elastacloud
Speaker Bio
Sandy May - @spark-spartan
Databricks Champion
Data Science London Co-Organizer
Tech speaker across the UK
Passionate about Apache Spark,
Databricks, AI, Data Security and
Reporting platforms in Microsoft
Azure
Speaker Bio
Darren Fuller - @dazfuller
Databricks Champion
Tech speaker across the UK
Passions include Apache Spark,
Microsoft Azure, Raspberry Pi
Agenda
Sandy May
What is the problem? What do we
need? How can we make it easy to
use?
Darren Fuller
How can we investigate? Where
do we go from here? What have
we learnt
Data Quality Overview
What is the problem?
• Harvard Business review suggested Dirty Data cost US companies $3
trillion in 2017
• Business data is hard to clean generically, it often requires domain
knowledge
• Dirty Data can be frustrating for Data Scientists and BI Engineers
• In the worst case, Dirty Data can provide incorrect reports and predictions
leading to potential significant losses
@dazfuller @spark_spartan
Should we Build or Buy?
§ Own the IP
§ Prioritise the features you want
§ Built for your use case
§ No licence fees
§ Use your core technology
§ May have track record
§ Bugs fixed by vendor
§ Features not thought about by
business
§ Service Level Agreements
Buy
Build
@dazfuller @spark_spartan
Key Design Decisions
▪ Support to Run Cross Cloud
▪ Use Native tools in Azure and AWS
▪ Easy for SQL Devs to write rules
▪ Single Reporting Platform
▪ Capability to reuse custom business rules
▪ Run as part of our Data Ingestion Pipelines with Delta Lake
@dazfuller @spark_spartan
Enterprise Data Warehouse
@dazfuller @spark_spartan
Let’s Build it!
@dazfuller @spark_spartan
Summing Up
@dazfuller @spark_spartan
Conclusions
▪ Building can be quick and effective
▪ Prioritise your own business needs, you know your data best
▪ Can be used as a stop gap while you create a service for an off the shelf
product
▪ Easy to run as part of ingestion pipelines
▪ Business value in reports and reuse of rules
▪ Use Delta Lake for Schema Evolution
@dazfuller @spark_spartan
Thanks for listening!
Questions?
@dazfuller @spark_spartan
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Delta lake and the delta architecture
PDF
Modernizing to a Cloud Data Architecture
PDF
Intro to Delta Lake
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
PPTX
Big data architectures and the data lake
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Delta lake and the delta architecture
Modernizing to a Cloud Data Architecture
Intro to Delta Lake
Data Mesh Part 4 Monolith to Mesh
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Big data architectures and the data lake

What's hot (20)

PPTX
Data Lakehouse Symposium | Day 4
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PPTX
Databricks on AWS.pptx
PPTX
Building a modern data warehouse
PDF
Lakehouse in Azure
PDF
Introduction to Azure Data Lake
PDF
Databricks Delta Lake and Its Benefits
PPTX
Data Mesh using Microsoft Fabric
PPTX
Azure Synapse Analytics Overview (r2)
PDF
Technical Deck Delta Live Tables.pdf
PDF
Data Mesh for Dinner
PDF
Time to Talk about Data Mesh
PDF
Introduction SQL Analytics on Lakehouse Architecture
PPTX
Databricks Platform.pptx
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Free Training: How to Build a Lakehouse
PDF
Data Warehouse or Data Lake, Which Do I Choose?
PDF
Introducing Databricks Delta
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Data Lakehouse Symposium | Day 4
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Databricks on AWS.pptx
Building a modern data warehouse
Lakehouse in Azure
Introduction to Azure Data Lake
Databricks Delta Lake and Its Benefits
Data Mesh using Microsoft Fabric
Azure Synapse Analytics Overview (r2)
Technical Deck Delta Live Tables.pdf
Data Mesh for Dinner
Time to Talk about Data Mesh
Introduction SQL Analytics on Lakehouse Architecture
Databricks Platform.pptx
DW Migration Webinar-March 2022.pptx
Architect’s Open-Source Guide for a Data Mesh Architecture
Free Training: How to Build a Lakehouse
Data Warehouse or Data Lake, Which Do I Choose?
Introducing Databricks Delta
Building Lakehouses on Delta Lake with SQL Analytics Primer
Ad

Similar to Building Data Quality pipelines with Apache Spark and Delta Lake (20)

PDF
Healthcare Claim Reimbursement using Apache Spark
PDF
Trivadis - Microsoft Transform your data estate with cloud, data and AI
PPTX
Liberate Legacy Data Sources with Precisely and Databricks
PDF
Making Apache Spark Better with Delta Lake
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PDF
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PPTX
Modern data warehouse
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
PDF
Big Data Architecture
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PDF
Devoxx 2022
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
PDF
Architecting Agile Data Applications for Scale
PDF
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
PDF
Building a Cross Cloud Data Protection Engine
PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
PPTX
Data Lake Overview
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
Healthcare Claim Reimbursement using Apache Spark
Trivadis - Microsoft Transform your data estate with cloud, data and AI
Liberate Legacy Data Sources with Precisely and Databricks
Making Apache Spark Better with Delta Lake
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Modern data warehouse
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Big Data Architecture
Delta Lake: Open Source Reliability w/ Apache Spark
Devoxx 2022
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Architecting Agile Data Applications for Scale
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Building a Cross Cloud Data Protection Engine
Data lake-itweekend-sharif university-vahid amiry
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Data Lake Overview
Data Warehouse Modernization: Accelerating Time-To-Action
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality

Recently uploaded (20)

PPTX
AI_Agriculture_Presentation_Enhanced.pptx
PPT
Classification methods in data analytics.ppt
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPTX
ai agent creaction with langgraph_presentation_
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
transformers as a tool for understanding advance algorithms in deep learning
PPTX
Introduction to Fundamentals of Data Security
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PPTX
Stats annual compiled ipd opd ot br 2024
PDF
The Role of Pathology AI in Translational Cancer Research and Education
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PDF
technical specifications solar ear 2025.
PPTX
ch20 Database System Architecture by Rizvee
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
AI_Agriculture_Presentation_Enhanced.pptx
Classification methods in data analytics.ppt
inbound2857676998455010149.pptxmmmmmmmmm
1 hour to get there before the game is done so you don’t need a car seat for ...
ai agent creaction with langgraph_presentation_
inbound6529290805104538764.pptxmmmmmmmmm
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
transformers as a tool for understanding advance algorithms in deep learning
Introduction to Fundamentals of Data Security
indiraparyavaranbhavan-240418134200-31d840b3.pptx
Stats annual compiled ipd opd ot br 2024
The Role of Pathology AI in Translational Cancer Research and Education
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
AI AND ML PROPOSAL PRESENTATION MUST.pptx
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
technical specifications solar ear 2025.
ch20 Database System Architecture by Rizvee
machinelearningoverview-250809184828-927201d2.pptx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd

Building Data Quality pipelines with Apache Spark and Delta Lake

  • 1. Building Data Quality pipelines with Apache Spark and Delta Lake Sandy May & Darren Fuller Lead Data Engineers Elastacloud
  • 2. Speaker Bio Sandy May - @spark-spartan Databricks Champion Data Science London Co-Organizer Tech speaker across the UK Passionate about Apache Spark, Databricks, AI, Data Security and Reporting platforms in Microsoft Azure
  • 3. Speaker Bio Darren Fuller - @dazfuller Databricks Champion Tech speaker across the UK Passions include Apache Spark, Microsoft Azure, Raspberry Pi
  • 4. Agenda Sandy May What is the problem? What do we need? How can we make it easy to use? Darren Fuller How can we investigate? Where do we go from here? What have we learnt
  • 6. What is the problem? • Harvard Business review suggested Dirty Data cost US companies $3 trillion in 2017 • Business data is hard to clean generically, it often requires domain knowledge • Dirty Data can be frustrating for Data Scientists and BI Engineers • In the worst case, Dirty Data can provide incorrect reports and predictions leading to potential significant losses @dazfuller @spark_spartan
  • 7. Should we Build or Buy? § Own the IP § Prioritise the features you want § Built for your use case § No licence fees § Use your core technology § May have track record § Bugs fixed by vendor § Features not thought about by business § Service Level Agreements Buy Build @dazfuller @spark_spartan
  • 8. Key Design Decisions ▪ Support to Run Cross Cloud ▪ Use Native tools in Azure and AWS ▪ Easy for SQL Devs to write rules ▪ Single Reporting Platform ▪ Capability to reuse custom business rules ▪ Run as part of our Data Ingestion Pipelines with Delta Lake @dazfuller @spark_spartan
  • 12. Conclusions ▪ Building can be quick and effective ▪ Prioritise your own business needs, you know your data best ▪ Can be used as a stop gap while you create a service for an off the shelf product ▪ Easy to run as part of ingestion pipelines ▪ Business value in reports and reuse of rules ▪ Use Delta Lake for Schema Evolution @dazfuller @spark_spartan
  • 14. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.