Democratizing AI with
Apache Spark
Ali Ghodsi
Co-Founderand CEO
AI is changing the world
2
Why now?
AlphaGoSIRI/assistantsSelf-driving cars
Data is the catalyst
3
AI hasn’t been democratized
Better training, tuning,
validation
More data
Clickstreams
Sensor data (IoT)
Video
Speech
Handwriting
…
The hardest part of AI isn’t AI
4
“Hidden Technical Debt in Machine LearningSystems“, Google NIPS 2015
How do we democratize AI?
5
“Hidden Technical Debt in Machine LearningSystems“, Google NIPS 2015
+ AI
FLEXIBLE FAST BIG DATA
Some gaps remain
6
ManageData
infrastructure
• Create, configure, monitor resilient big data clusters.
• Securely access silos of disparate data sources.
• Enforce proper data governance.
•1
Empower teams to be
productive
• Interactively explore data and prototypeideas.
• Securely share big data clusters among analysts.
• Debug, troubleshoot, version-control big data applications.•
2
Establish Production-
Ready Applications
• Setup robust ML data pipelines for ETL/ELT.
• Productionize real-time applications with HA,FT.
• Build, serve, maintain advanced machine learning models.
•
3
Databricks: Closing the gap
7
• Separate compute & storage
• Integrate existing data stores
• Efficient cache on first access
Just-in-Time
Data Platform1
Agile + Low TCO
• Interactive notebooks,
dashboards, reports
• Real-time exploration,
machine learning, graph
use cases
IntegratedWorkspace2
Accelerate Time to Value
• Workflow scheduler for ML,
streaming, SQL, ETL
• Performance-optimized,
high availability, fault-
tolerant
Automated
Spark Management3
Performance
Enterprise AI use-cases
8
Predict credit score,credit limit, anomalies
Predict energy demand based on massive weather data
Natural language processing to extract author graph
Predict player churn, predicting network outages
Predict machine equipment failure
New Frontier of AI: Deep Learning
9
Detect cancer Understand speech Infer location
Identifylandmarks in photosRecognizeMandarin and
English
Improvecancer detection
Faster and easier deep learning with Databricks
10
GPUs
• TensorFlow: The most popular
deep learning framework.
• TensorFrames: Makes
TensorFlowcomputations
faster and easier to program
on Spark.
TensorFlow on
TensorFrames and GPUs support out-of-the-box
Massive parallelism
Deep Learning on Databricks
11
Data
Ingest
Feature
extraction
Model
Training
Product-
ionize
Clusters
Jobs & WorkflowsTensorFrames
+
GPUs
Interactive
exploration
Just-in-time data
platform
Automated
management
Thank you.

Spark Summit Europe 2016 Keynote - Databricks CEO

  • 1.
    Democratizing AI with ApacheSpark Ali Ghodsi Co-Founderand CEO
  • 2.
    AI is changingthe world 2 Why now? AlphaGoSIRI/assistantsSelf-driving cars
  • 3.
    Data is thecatalyst 3 AI hasn’t been democratized Better training, tuning, validation More data Clickstreams Sensor data (IoT) Video Speech Handwriting …
  • 4.
    The hardest partof AI isn’t AI 4 “Hidden Technical Debt in Machine LearningSystems“, Google NIPS 2015 How do we democratize AI?
  • 5.
    5 “Hidden Technical Debtin Machine LearningSystems“, Google NIPS 2015 + AI FLEXIBLE FAST BIG DATA
  • 6.
    Some gaps remain 6 ManageData infrastructure •Create, configure, monitor resilient big data clusters. • Securely access silos of disparate data sources. • Enforce proper data governance. •1 Empower teams to be productive • Interactively explore data and prototypeideas. • Securely share big data clusters among analysts. • Debug, troubleshoot, version-control big data applications.• 2 Establish Production- Ready Applications • Setup robust ML data pipelines for ETL/ELT. • Productionize real-time applications with HA,FT. • Build, serve, maintain advanced machine learning models. • 3
  • 7.
    Databricks: Closing thegap 7 • Separate compute & storage • Integrate existing data stores • Efficient cache on first access Just-in-Time Data Platform1 Agile + Low TCO • Interactive notebooks, dashboards, reports • Real-time exploration, machine learning, graph use cases IntegratedWorkspace2 Accelerate Time to Value • Workflow scheduler for ML, streaming, SQL, ETL • Performance-optimized, high availability, fault- tolerant Automated Spark Management3 Performance
  • 8.
    Enterprise AI use-cases 8 Predictcredit score,credit limit, anomalies Predict energy demand based on massive weather data Natural language processing to extract author graph Predict player churn, predicting network outages Predict machine equipment failure
  • 9.
    New Frontier ofAI: Deep Learning 9 Detect cancer Understand speech Infer location Identifylandmarks in photosRecognizeMandarin and English Improvecancer detection
  • 10.
    Faster and easierdeep learning with Databricks 10 GPUs • TensorFlow: The most popular deep learning framework. • TensorFrames: Makes TensorFlowcomputations faster and easier to program on Spark. TensorFlow on TensorFrames and GPUs support out-of-the-box Massive parallelism
  • 11.
    Deep Learning onDatabricks 11 Data Ingest Feature extraction Model Training Product- ionize Clusters Jobs & WorkflowsTensorFrames + GPUs Interactive exploration Just-in-time data platform Automated management
  • 12.