SlideShare a Scribd company logo
1
© 2020 The MathWorks, Inc.
Processing Large Datasets for ADAS applications
using Spark
Arvind Hosagrahara (ahosagra@mathworks.com)
Lead Solutions Architect
May 2021
2
Outline
§ The Big Picture
– Problem Statement / Challenges
§ Architecture of the system (ETL, Training, Inference)
§ Deep-Dive into Spark Workflows
– Cluster Management
– Interactive exploration
– Pushdown of analytics to Spark Clusters
– Experiment and metadata management
§ Design considerations
– Scaling, Accuracy and Interoperability
– Security, Governance, Model Lifecycle management
§ Conclusions / Key Takeaways
§ Q&A
3
The Big Picture – Automated Driving
How can I
analyze & simulate
scenarios?
How can I
design & deploy
algorithms?
How can I
integrate & test
systems?
Control
Planning
Perception
4
Semantic Segmentation
What?
§ Semantic segmentation is the classification
of every pixel in an image/video
Why?
§ Semantic segmentation: Higher quality
perception outcomes, easier to analyze and
validation of localization modules for AD.
§ Automation: Faster design iterations
5
“…autonomous vehicles would have
to be driven hundreds of millions of
miles and sometimes hundreds of
billions of miles to demonstrate their
reliability in terms of fatalities and
injuries.” [1]
[1] Driving to Safety, Rand Corporation
https://www.rand.org/content/dam/rand/pubs/research_reports/RR1400/RR1478/RAND_RR1478.pdf
6
Problem Statement
How can we:
1. Accelerate development of semantic segmentation
models from prototype to production in order to build better
perception software
2. Scale the workflows, capabilities and throughput of
engineering processes by leveraging automation, cloud
storage, compute and other infrastructure
7
Challenges
Business Challenges
§ Prototype to production: establish system & software engineering processes
§ Verification and validation: connect data, embedded software and simulation
§ Workforce development: hire & integrate engineers with disparate
background
Technical Challenges
§ Handling large datasets and specialized formats
§ Leveraging the cloud to scale computational requirements
§ Enable self-serve analytics for the data scientist / engineer
8
Demonstration / Results
(Image-level features with batch normalization)
9
10
Demo / Results
3-D simulation data to train a semantic segmentation
network and fine-tune it to real-world data using
generative adversarial networks (GANs)
11
12
Big Picture: Architecture
13
Architecture of the demo system
Burst
Data access
Experiment
Management
/
Governance
/
Security
14
Burst
Data access
Experiment
Management
/
Governance
/
Security
Data access
Experiment
Management
/
Governance
/
Security
Development
Environment
(Desktop / Cloud)
Storage
(Large Datasets /
Models / Metadata)
Model Inference Engine
for ETL and Performance
Monitoring
Monitoring / Visualization
Front-ends
Cloud-based training
system
(Spark Clusters / GPU
support / Fast networks)
Recorded and
Simulated Data
(Batch and
Streaming
Ingest)
Burst
Ingest
15
Data access
Experiment
Management
/
Governance
/
Security
Development
Environment
(Desktop / Cloud)
Storage
(Large Datasets /
Models / Metadata)
Model Inference Engine
for ETL and Performance
Monitoring
Monitoring / Visualization
Front-ends
Cloud-based training
system
(Spark Clusters / GPU
support / Fast networks)
Recorded and
Simulated Data
(Batch and
Streaming
Ingest)
Burst
Ingest
16
A simplified view of the system architecture
Experiment
Management
/
Governance
/
Security
Development Environment
(Desktop / Cloud)
Storage
(Large Datasets / Models /
Metadata)
Model Inference Engine for
ETL and Performance
Monitoring
Monitoring / Visualization
Front-ends
Cloud-based training system
(Spark Clusters / GPU support /
Fast networks)
Recorded and
Simulated Data
(Batch and
Streaming Ingest)
Burst
Ingest
Data access
17
The ETL (Extract Transform Load) Pipeline
§ The ETL (Extract
Transform Load) Pipeline
Experiment
Management
/
Governance
/
Security
Storage
(Large Datasets / Models /
Metadata)
Model Inference Engine for
ETL and Performance
Monitoring
Recorded and
Simulated Data
(Batch and
Streaming Ingest)
Ingest
§
§
18
The Training Pipeline
§
Experiment
Management
/
Governance
/
Security
Development Environment
(Desktop / Cloud)
Storage
(Large Datasets / Models /
Metadata)
Cloud-based training system
(Spark Clusters / GPU support /
Fast networks)
Burst
Ingest
Data access § The Training Pipeline
§
19
The Inference Pipeline
§
Experiment
Management
/
Governance
/
Security
Storage
(Large Datasets / Models /
Metadata)
Model Inference Engine for
ETL and Performance
Monitoring
Monitoring / Visualization
Front-ends
Burst
Ingest
Data access §
§ The Inference Pipeline
20
The Software Stack
Experiment
Management
/
Governance
/
Security
Development Environment
(Desktop / Cloud)
Storage
(Large Datasets / Models /
Metadata)
Model Inference Engine for
ETL and Performance
Monitoring
Monitoring / Visualization
Front-ends
Cloud-based training system
(Spark Clusters / GPU support /
Fast networks)
Recorded and
Simulated Data
(Batch and
Streaming Ingest)
Burst
Ingest
Data access
21
The Software Stack
Experiment
Management
/
Governance
/
Security
Development Environment
(Desktop / Cloud)
Storage
(Large Datasets / Models /
Metadata)
Model Inference Engine for
ETL and Performance
Monitoring
Monitoring / Visualization
Front-ends
Cloud-based training system
(Spark Clusters / GPU support /
Fast networks)
Recorded and
Simulated Data
(Batch and
Streaming Ingest)
Burst
Ingest
Data access
MATLAB® and
Simulink®
MATLAB
Production
Server™
MATLAB
Webapp
Server™
3P Dashboards
Amazon S3™
AWS Identity
and Access
Management
(IAM)
AWS™
MLflow™
22
The DevOps perspective
23
Workflows supporting DevOps
Development Environment Operational Environment
TEST
O
P
E
R
A
T
E
DEPLOY
MONITOR
DESIGN
B
U
I
L
D
P
L
A
N
F
E
E
D
B
A
C
K
Cluster
Management
JDBC based
connectivity to
data sources
Databricks-
connect and
interactive
exploration
CI/CD
integration
Jobs and
Spark-Submit
Model Lifecycle
Management
Experiment
Management
Coder based
acceleration
Embedded
Applications
Inference engine
Central result,
config, data
management
24
Deep-Dive: Spark Workflows
25
Workflow: Cluster management and Data Engineering via REST
Engineers
Developers
Scientists
Data Analyst
Compute Cluster
(Cloud/On-Prem)
jobsubmission.m
MATLAB
HTTP
Client
MATLAB Integration (REST 2.0 API)
• Create Clusters
• CRUD with DBFS
• Create Jobs
• Authentication /Tokens
• Secrets
• Notifications
• Libraries
• Workspace
Databricks
specific
runtime
image
MATLAB
Runtime
init_script Dockerfile
dockerized runtime
analytics.jar
26
Video 1 Outline
§ Planning capacity and performance
§ Cluster creation
§ Data and application transfer via the DBFS API
27
Workflow: JDBC and Database toolbox (a)
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster
(Cloud/On-Prem)
DatabaseToolbox
ODBC 2.6.x
JDBC 2.6.x
Driver
SQL
28
Workflow: JDBC and Database toolbox (b)
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster
(Cloud/On-Prem)
ODBC 2.6.x
JDBC 2.6.x
Driver
SparkSession
SQL
Databases / Delta
Tables
29
Workflow: Databricks-connect
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster
(Cloud/On-Prem)
Databricks-connect
Spark DAG
Spark Master
Spark API
30
Video 2 Outline
§ Databricks-Connect workflows
§ Read/Write delta tables
31
The Training Pipeline
32
The Training Pipeline (Local Experimentation)
§ Optimized for GPU
§ High-level API (“write less and do
more”)
§ Language abstractions to improve
accuracy (eg: imageDataAugmenter)
§ Workflows for ease-of-use and
debug
– apps for model selection
– monitor accuracy, loss and training
performance
– many quick-start examples
33
Workflow: Push-down via MATLAB Compiler
Engineers
Developers
Scientists
Data Analyst
Compute Cluster
(Cloud/On-Prem)
spark-submit job for New
Clusters
analytics.jar
Spark DAG
Spark Master
analytics.m
MATLAB
Compiler/SDK
MATLAB API for Spark Tall Arrays
Libraries for Existing Clusters
34
Video 3 Outline
§ SparkSession and RDD abstractions
§ Tall-based workflows
§ Deploytool and app based workflows
– Compiler SDK for Java
§ mcc
– Spark Submit
§ DBFS API to push application to Databricks
35
Build and Test
§ Improve quality of code and
models by testing on CI Servers
(self-hosted or Cloud-hosted)
§ Package libraries for deployment
§ Adhere to agile development
practices
https://www.mathworks.com/solutions/continuous-integration.html
36
The Training Pipeline (Cloud-based training)
§ Infrastructure defined as code
§ Leverage the latest hardware,
optimized drivers and toolkits
§ APIs for command and control of
Jobs, artifacts, notifications,
execution, etc.
§ Capital Expenses (CapEx) è
Operating Expenses (OpEx)
37
Management of Experiments and tracking of performance
38
Video 4 Outline
§ MLFlow Demo
– Creation of Experiments / Runs
– Logging of Parameters and Metrics
– Integration with Experiment Management features
39
Training Benchmarks and Performance
VGG-16 based network on
NVIDIA Quadro GV100
40
The Inference Pipeline
41
The Inference Pipeline
§ Directed Acyclic Graphs (DAG)
used to resolve changes to:
– Training models, ETL changes
– The training data
– New data / changes to the cache
§ Trained Models can be deployed
as cloud services
§ Cloud Reference Architectures
available on github.com to
integrate with existing systems
42
Scaling Inference on the MATLAB Production Server
§ Scales gracefully
§ Optimal usage of
available CPU and GPU
resources
§ Can increase throughput
as desired
§ PAYG models available
on the cloud
43
The ETL (Extract-Transform-Load) Pipeline
44
The ETL Pipeline
§ Wide variety of Automotive and Engineering
Formats (eg: ROSBag, MP4, MDF, BLF, PCAP)
§ RESTful endpoints or Client based access
§ Connectivity to Cloud Storage (S3) and distributed
streaming systems (Kafka)
§ Horizontally scalable and designed to run
24x7x365 on headless cloud systems
§ Provides data proximity when executing on the
cloud
§ Accelerated CUDA execution on NVIDIA® via GPU
Coder™
45
Putting it all together
46
Build Graph (DAG)
§ Engine code
§ Trained models
§ New/Cached Data
§ Operates on S3 version metadata
§ Can be triggered manually or
automatically (eg: via CI/CD
pipelines)
47
Design Considerations
48
Scaling considerations
Storage
§ Cloud based storage allows
ingestion of larger datasets
Compute
§ Quicker training iterations both
locally and on the cloud
§ Faster inference pipeline
Cost
§ CapEx è OpEx
Workflows
§ Self-serve analytics optimized for
agility along a maturity framework
"Glue code and pipeline jungles are symptomatic of
integration issues that may have a root cause in
overly separated “research” and “engineering”
roles…
… engineers and researchers are embedded together
on the same teams (and indeed, are often the same
people) can help reduce this source of friction
significantly [1].
[1] The Hidden Technical Debt of Machine Learning Systems
https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
49
Accuracy considerations
§ More Training Data
§ Simulation / Scene generation
– Roadrunner
Simulation
50
Interop considerations
§ ONNX interop
§ Transfer learning techniques
§ Import/Export to
– OpenDRIVE®,
– FBX®,
– glTF™,
– OpenFlight,
– OpenSceneGraph,
– OBJ, and USD formats.
§ Automated driving simulators
and game engines (eg: CARLA, VIRES VTD, NVIDIA DRIVE Sim®,
Metamoto®, LGSVL, Baidu Apollo®, Unity®, and Unreal Engine®)
51
Embedded Targeting
§ GPU Coder™ Support Package
for NVIDIA GPUs
§ NVIDIA® DRIVE™ and Jetson
hardware platforms
52
Governance and Lifecycle management
§ Local Experimentation
§ Centralized Tracking
53
Security Considerations
§ Cloud Storage Best Practices
§ ACL based permissions
§ IAM based identity and access
54
Conclusions
§ Well-architected systems [1] accelerate development of automated
semantic segmentation against large datasets
§ Local development provides smooth workflows for the development and
refinement of deep learning models
§ Cloud-based scaling of compute and storage can be leveraged on
Databricks to enable self-service analytics
§ Techniques that are relevant for ADAS development are equally applicable
across other domains such as medical, geo-exploration, etc.
[1] AWS Well-Architected
https://aws.amazon.com/architecture/well-architected/
55
Key Takeaways
§ Build upon proven, safety compliant, and DevOps-ready development tools
§ Leveraging best-in-class simulation integration platform and modeling tools
leads to faster time-to-market outcomes for ADAS development
§ Domain specific tooling allows practitioners to write less and do more
§ If this is interesting, please contact us at databricks@mathworks.com for
more details or to get started.
56
Q&A
Contact us at: databricks@mathworks.com
58

More Related Content

What's hot (20)

Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
Jeffrey T. Pollock
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
宇 傅
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Edureka!
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
Sandesh Rao
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
Jeffrey T. Pollock
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
宇 傅
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
PySpark Training | PySpark Tutorial for Beginners | Apache Spark with Python ...
Edureka!
 

Similar to Processing Large Datasets for ADAS Applications using Apache Spark (20)

A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Power Software Development with Apache Spark
Power Software Development with Apache SparkPower Software Development with Apache Spark
Power Software Development with Apache Spark
OpenPOWERorg
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
03_aiops-1.pptx
03_aiops-1.pptx03_aiops-1.pptx
03_aiops-1.pptx
FarazulHoda2
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
Jesus Rodriguez
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
Indrajit Poddar
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
Jan Wiegelmann
 
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
DataBench
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
Andy Petrella
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
DataWorks Summit
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Databricks
 
Power Software Development with Apache Spark
Power Software Development with Apache SparkPower Software Development with Apache Spark
Power Software Development with Apache Spark
OpenPOWERorg
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
Jesus Rodriguez
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
GoDataDriven
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
Indrajit Poddar
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
Jan Wiegelmann
 
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
DataBench
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
Andy Petrella
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
Andy Petrella
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Ad

Recently uploaded (20)

PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
LECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power pointLECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power point
nwanjamakane
 
531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...
531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...
531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...
spratistha569
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
llm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blahllm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blah
saud140081
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
Understanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its ApplicationsUnderstanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its Applications
M Munim
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
llm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blahllm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blah
saud140081
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
LECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power pointLECT CONCURRENCY………………..pdf document or power point
LECT CONCURRENCY………………..pdf document or power point
nwanjamakane
 
531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...
531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...
531a07261283c4efb4cbae5fb8. Tele2 Sverige AB v post-och telestyrelsen, C-203:...
spratistha569
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
llm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blahllm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blah
saud140081
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
Understanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its ApplicationsUnderstanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its Applications
M Munim
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
llm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blahllm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blah
saud140081
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 

Processing Large Datasets for ADAS Applications using Apache Spark

  • 1. 1 © 2020 The MathWorks, Inc. Processing Large Datasets for ADAS applications using Spark Arvind Hosagrahara ([email protected]) Lead Solutions Architect May 2021
  • 2. 2 Outline § The Big Picture – Problem Statement / Challenges § Architecture of the system (ETL, Training, Inference) § Deep-Dive into Spark Workflows – Cluster Management – Interactive exploration – Pushdown of analytics to Spark Clusters – Experiment and metadata management § Design considerations – Scaling, Accuracy and Interoperability – Security, Governance, Model Lifecycle management § Conclusions / Key Takeaways § Q&A
  • 3. 3 The Big Picture – Automated Driving How can I analyze & simulate scenarios? How can I design & deploy algorithms? How can I integrate & test systems? Control Planning Perception
  • 4. 4 Semantic Segmentation What? § Semantic segmentation is the classification of every pixel in an image/video Why? § Semantic segmentation: Higher quality perception outcomes, easier to analyze and validation of localization modules for AD. § Automation: Faster design iterations
  • 5. 5 “…autonomous vehicles would have to be driven hundreds of millions of miles and sometimes hundreds of billions of miles to demonstrate their reliability in terms of fatalities and injuries.” [1] [1] Driving to Safety, Rand Corporation https://www.rand.org/content/dam/rand/pubs/research_reports/RR1400/RR1478/RAND_RR1478.pdf
  • 6. 6 Problem Statement How can we: 1. Accelerate development of semantic segmentation models from prototype to production in order to build better perception software 2. Scale the workflows, capabilities and throughput of engineering processes by leveraging automation, cloud storage, compute and other infrastructure
  • 7. 7 Challenges Business Challenges § Prototype to production: establish system & software engineering processes § Verification and validation: connect data, embedded software and simulation § Workforce development: hire & integrate engineers with disparate background Technical Challenges § Handling large datasets and specialized formats § Leveraging the cloud to scale computational requirements § Enable self-serve analytics for the data scientist / engineer
  • 8. 8 Demonstration / Results (Image-level features with batch normalization)
  • 9. 9
  • 10. 10 Demo / Results 3-D simulation data to train a semantic segmentation network and fine-tune it to real-world data using generative adversarial networks (GANs)
  • 11. 11
  • 13. 13 Architecture of the demo system Burst Data access Experiment Management / Governance / Security
  • 14. 14 Burst Data access Experiment Management / Governance / Security Data access Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest
  • 15. 15 Data access Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest
  • 16. 16 A simplified view of the system architecture Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest Data access
  • 17. 17 The ETL (Extract Transform Load) Pipeline § The ETL (Extract Transform Load) Pipeline Experiment Management / Governance / Security Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Recorded and Simulated Data (Batch and Streaming Ingest) Ingest § §
  • 18. 18 The Training Pipeline § Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Cloud-based training system (Spark Clusters / GPU support / Fast networks) Burst Ingest Data access § The Training Pipeline §
  • 19. 19 The Inference Pipeline § Experiment Management / Governance / Security Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Burst Ingest Data access § § The Inference Pipeline
  • 20. 20 The Software Stack Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest Data access
  • 21. 21 The Software Stack Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest Data access MATLAB® and Simulink® MATLAB Production Server™ MATLAB Webapp Server™ 3P Dashboards Amazon S3™ AWS Identity and Access Management (IAM) AWS™ MLflow™
  • 23. 23 Workflows supporting DevOps Development Environment Operational Environment TEST O P E R A T E DEPLOY MONITOR DESIGN B U I L D P L A N F E E D B A C K Cluster Management JDBC based connectivity to data sources Databricks- connect and interactive exploration CI/CD integration Jobs and Spark-Submit Model Lifecycle Management Experiment Management Coder based acceleration Embedded Applications Inference engine Central result, config, data management
  • 25. 25 Workflow: Cluster management and Data Engineering via REST Engineers Developers Scientists Data Analyst Compute Cluster (Cloud/On-Prem) jobsubmission.m MATLAB HTTP Client MATLAB Integration (REST 2.0 API) • Create Clusters • CRUD with DBFS • Create Jobs • Authentication /Tokens • Secrets • Notifications • Libraries • Workspace Databricks specific runtime image MATLAB Runtime init_script Dockerfile dockerized runtime analytics.jar
  • 26. 26 Video 1 Outline § Planning capacity and performance § Cluster creation § Data and application transfer via the DBFS API
  • 27. 27 Workflow: JDBC and Database toolbox (a) Engineers Developers Scientists Data Analyst Existing Compute Cluster (Cloud/On-Prem) DatabaseToolbox ODBC 2.6.x JDBC 2.6.x Driver SQL
  • 28. 28 Workflow: JDBC and Database toolbox (b) Engineers Developers Scientists Data Analyst Existing Compute Cluster (Cloud/On-Prem) ODBC 2.6.x JDBC 2.6.x Driver SparkSession SQL Databases / Delta Tables
  • 29. 29 Workflow: Databricks-connect Engineers Developers Scientists Data Analyst Existing Compute Cluster (Cloud/On-Prem) Databricks-connect Spark DAG Spark Master Spark API
  • 30. 30 Video 2 Outline § Databricks-Connect workflows § Read/Write delta tables
  • 32. 32 The Training Pipeline (Local Experimentation) § Optimized for GPU § High-level API (“write less and do more”) § Language abstractions to improve accuracy (eg: imageDataAugmenter) § Workflows for ease-of-use and debug – apps for model selection – monitor accuracy, loss and training performance – many quick-start examples
  • 33. 33 Workflow: Push-down via MATLAB Compiler Engineers Developers Scientists Data Analyst Compute Cluster (Cloud/On-Prem) spark-submit job for New Clusters analytics.jar Spark DAG Spark Master analytics.m MATLAB Compiler/SDK MATLAB API for Spark Tall Arrays Libraries for Existing Clusters
  • 34. 34 Video 3 Outline § SparkSession and RDD abstractions § Tall-based workflows § Deploytool and app based workflows – Compiler SDK for Java § mcc – Spark Submit § DBFS API to push application to Databricks
  • 35. 35 Build and Test § Improve quality of code and models by testing on CI Servers (self-hosted or Cloud-hosted) § Package libraries for deployment § Adhere to agile development practices https://www.mathworks.com/solutions/continuous-integration.html
  • 36. 36 The Training Pipeline (Cloud-based training) § Infrastructure defined as code § Leverage the latest hardware, optimized drivers and toolkits § APIs for command and control of Jobs, artifacts, notifications, execution, etc. § Capital Expenses (CapEx) è Operating Expenses (OpEx)
  • 37. 37 Management of Experiments and tracking of performance
  • 38. 38 Video 4 Outline § MLFlow Demo – Creation of Experiments / Runs – Logging of Parameters and Metrics – Integration with Experiment Management features
  • 39. 39 Training Benchmarks and Performance VGG-16 based network on NVIDIA Quadro GV100
  • 41. 41 The Inference Pipeline § Directed Acyclic Graphs (DAG) used to resolve changes to: – Training models, ETL changes – The training data – New data / changes to the cache § Trained Models can be deployed as cloud services § Cloud Reference Architectures available on github.com to integrate with existing systems
  • 42. 42 Scaling Inference on the MATLAB Production Server § Scales gracefully § Optimal usage of available CPU and GPU resources § Can increase throughput as desired § PAYG models available on the cloud
  • 44. 44 The ETL Pipeline § Wide variety of Automotive and Engineering Formats (eg: ROSBag, MP4, MDF, BLF, PCAP) § RESTful endpoints or Client based access § Connectivity to Cloud Storage (S3) and distributed streaming systems (Kafka) § Horizontally scalable and designed to run 24x7x365 on headless cloud systems § Provides data proximity when executing on the cloud § Accelerated CUDA execution on NVIDIA® via GPU Coder™
  • 45. 45 Putting it all together
  • 46. 46 Build Graph (DAG) § Engine code § Trained models § New/Cached Data § Operates on S3 version metadata § Can be triggered manually or automatically (eg: via CI/CD pipelines)
  • 48. 48 Scaling considerations Storage § Cloud based storage allows ingestion of larger datasets Compute § Quicker training iterations both locally and on the cloud § Faster inference pipeline Cost § CapEx è OpEx Workflows § Self-serve analytics optimized for agility along a maturity framework "Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… … engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce this source of friction significantly [1]. [1] The Hidden Technical Debt of Machine Learning Systems https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
  • 49. 49 Accuracy considerations § More Training Data § Simulation / Scene generation – Roadrunner Simulation
  • 50. 50 Interop considerations § ONNX interop § Transfer learning techniques § Import/Export to – OpenDRIVE®, – FBX®, – glTF™, – OpenFlight, – OpenSceneGraph, – OBJ, and USD formats. § Automated driving simulators and game engines (eg: CARLA, VIRES VTD, NVIDIA DRIVE Sim®, Metamoto®, LGSVL, Baidu Apollo®, Unity®, and Unreal Engine®)
  • 51. 51 Embedded Targeting § GPU Coder™ Support Package for NVIDIA GPUs § NVIDIA® DRIVE™ and Jetson hardware platforms
  • 52. 52 Governance and Lifecycle management § Local Experimentation § Centralized Tracking
  • 53. 53 Security Considerations § Cloud Storage Best Practices § ACL based permissions § IAM based identity and access
  • 54. 54 Conclusions § Well-architected systems [1] accelerate development of automated semantic segmentation against large datasets § Local development provides smooth workflows for the development and refinement of deep learning models § Cloud-based scaling of compute and storage can be leveraged on Databricks to enable self-service analytics § Techniques that are relevant for ADAS development are equally applicable across other domains such as medical, geo-exploration, etc. [1] AWS Well-Architected https://aws.amazon.com/architecture/well-architected/
  • 55. 55 Key Takeaways § Build upon proven, safety compliant, and DevOps-ready development tools § Leveraging best-in-class simulation integration platform and modeling tools leads to faster time-to-market outcomes for ADAS development § Domain specific tooling allows practitioners to write less and do more § If this is interesting, please contact us at [email protected] for more details or to get started.
  • 57. 58