SlideShare a Scribd company logo
Visualizing big data in the browser
using Spark
Hossein Falaki @mhfalaki
Spark Summit East – March 18, 2015
Exploratory Visualization
Put visualization back in the normal workflow of data analysis
regardless of data size.
2
“Critical part of data analysis”
—William S. Cleveland
• Interactive
• Collaborative
• Reproducible
Expository Visualization
3
Communication is often the bottleneck in data science, 

and a graph is worth a thousand words.
• Control over details
• Shareable
Requirements
• Interactive
• Collaborative
• Shareable
• Reproducible
• Control over details
4
Use visualization libraries
}
Use the browser
}
Visualization as programming
• For complex tasks point and click may not be enough
• Best expressed with a grammar (API)
• Scripts are reproducible
• Control over all details
• Data scientists are already familiar with these tools
5
D3.js, Three.js, matplotlib, ggplot, Bokeh, Vincent, …
Do it in the browser
• Output of these tools can be readily used on the web 

(PNG, SVG, Canvas, WebGL)
• No need to transfer data and results
• Browser is conducive to collaboration (e.g., Notebooks)
• Separating data manipulation from rendering enables users
to freely choose the best tool for each job
6
Challenges with big data visualization
1.Manipulating large data can take a long time
2.We have more data points than pixels
7
Apache Spark can help solve both problems
Challenges
1. Manipulating large data can take a long time
8
> Memory
> CPU
Reducing latency: caching
Take advantage of memory and storage hierarchy
9
• Serialized storage levels (for memory)
• Memory & GC tuning
Reducing latency: parallelism
10
Increase number of CPUs
> Get more executors with Mesos or Yarn
> Click a button to increase cluster size in DBC
• Control level of parallelism for map and reduce tasks
• Configure spark locality if needed
Challenges
1. Manipulating large data can take a long time
2. We have more data points than possible pixels
11
> Summarize
> Model
> Sample
More data than pixels? Summarize
• Extensively used by BI tools
> Aggregation
> Pivoting
• Most data scientists’ nightly jobs
summarize data
12
More data than pixels? Model
MLLib supports a large (and growing)
set of distributed algorithms
• Clustering: k-means, GMM, LDA
• Classification and regression: 

LM, DT, NB
• Dimensionality reduction: SVD, PCA
• Collaborative filtering: ALS
• Correlation, hypothesis testing
13
More data than pixels? Sample
Extensively used in statistics
Spark offers native support for:
• Approximate and exact sampling
• Approximate and exact stratified
sampling
Approximate sampling is faster 

and is good enough in most cases
14
Demo
15
Summary
Using Spark we can extend interactive visualization of large data
Reduce interaction latency to seconds
> Cache data in memory
> Increase parallelism
To visualize millions of points in the browser
> Summarize
> Model
> Sample
16
Visualizing big data in the browser
using Spark

More Related Content

What's hot (20)

Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Modern Data Stack France
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Modern Data Stack France
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 

Viewers also liked (20)

Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
Vinay Shukla
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
bigdataviz_bay
 
Introduction au langage Go
Introduction au langage GoIntroduction au langage Go
Introduction au langage Go
Sylvain Wallez
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
Chester Chen
 
601 l5-encycs-100902165613-phpapp01
601 l5-encycs-100902165613-phpapp01601 l5-encycs-100902165613-phpapp01
601 l5-encycs-100902165613-phpapp01
bellhawaii
 
601 Session5-Encyclopedias
601 Session5-Encyclopedias601 Session5-Encyclopedias
601 Session5-Encyclopedias
Diane Nahl
 
Searching beyond google
Searching beyond googleSearching beyond google
Searching beyond google
tdurnell
 
Big data - The beauty or the Beast
Big data  - The beauty or the BeastBig data  - The beauty or the Beast
Big data - The beauty or the Beast
Steliana Moraru
 
Web server
Web serverWeb server
Web server
Sajan Sahu
 
What is a Web Browser
What is a Web BrowserWhat is a Web Browser
What is a Web Browser
Priyanka Dalal
 
Web browser
Web browserWeb browser
Web browser
titigarcia
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
Ghulam Imaduddin
 
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
Victor Asanza
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
Vinay Shukla
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
bigdataviz_bay
 
Introduction au langage Go
Introduction au langage GoIntroduction au langage Go
Introduction au langage Go
Sylvain Wallez
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
Chester Chen
 
601 l5-encycs-100902165613-phpapp01
601 l5-encycs-100902165613-phpapp01601 l5-encycs-100902165613-phpapp01
601 l5-encycs-100902165613-phpapp01
bellhawaii
 
601 Session5-Encyclopedias
601 Session5-Encyclopedias601 Session5-Encyclopedias
601 Session5-Encyclopedias
Diane Nahl
 
Searching beyond google
Searching beyond googleSearching beyond google
Searching beyond google
tdurnell
 
Big data - The beauty or the Beast
Big data  - The beauty or the BeastBig data  - The beauty or the Beast
Big data - The beauty or the Beast
Steliana Moraru
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI
 
Social Network Analysis with Spark
Social Network Analysis with SparkSocial Network Analysis with Spark
Social Network Analysis with Spark
Ghulam Imaduddin
 
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
⭐⭐⭐⭐⭐ Examen Sistemas Digitales SD+MSA (2do Parcial)
Victor Asanza
 
Ad

Similar to Visualizing big data in the browser using spark (20)

Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
Bitly
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
Alpine Data
 
2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata
balu kvm
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
Sahan Bulathwela
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
chapter 6 data visualization ppt.pptx
chapter 6 data visualization ppt.pptxchapter 6 data visualization ppt.pptx
chapter 6 data visualization ppt.pptx
sayalisonavane3
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Big Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analyticsBig Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
Spark 101
Spark 101Spark 101
Spark 101
Lance Co Ting Keh
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
Bitly
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
Alpine Data
 
2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata
balu kvm
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
Krishna Sankar
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
chapter 6 data visualization ppt.pptx
chapter 6 data visualization ppt.pptxchapter 6 data visualization ppt.pptx
chapter 6 data visualization ppt.pptx
sayalisonavane3
 
Big Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analyticsBig Data Analytics M1.pdf big data analytics
Big Data Analytics M1.pdf big data analytics
nithishlkumar9194
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

Custom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdfCustom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdf
Digital Aptech
 
Intranet Examples That Are Changing the Way We Work
Intranet Examples That Are Changing the Way We WorkIntranet Examples That Are Changing the Way We Work
Intranet Examples That Are Changing the Way We Work
BizPortals Solutions
 
SQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptxSQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptx
Ashlei5
 
Scalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple DevicesScalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple Devices
Scalefusion
 
UberEats clone app Development TechBuilder
UberEats clone app Development  TechBuilderUberEats clone app Development  TechBuilder
UberEats clone app Development TechBuilder
TechBuilder
 
Agentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptxAgentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptx
MOSIUOA WESI
 
Top 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdfTop 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdf
LL Technolab
 
Marketing And Sales Software Services.pptx
Marketing And Sales Software Services.pptxMarketing And Sales Software Services.pptx
Marketing And Sales Software Services.pptx
julia smits
 
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdfICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
M. Luisetto Pharm.D.Spec. Pharmacology
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptxTopic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfHow to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdfHow a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
mary rojas
 
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdfBoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
Ortus Solutions, Corp
 
aswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjar
aswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjaraswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjar
aswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjar
muhammadalikhanalikh1
 
zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17
zOSCommserver
 
Design by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First DevelopmentDesign by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First Development
Par-Tec S.p.A.
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptxHow AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATIONAI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
miso_uam
 
War Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona ToolkitWar Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona Toolkit
Sveta Smirnova
 
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
Philip Schwarz
 
Custom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdfCustom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdf
Digital Aptech
 
Intranet Examples That Are Changing the Way We Work
Intranet Examples That Are Changing the Way We WorkIntranet Examples That Are Changing the Way We Work
Intranet Examples That Are Changing the Way We Work
BizPortals Solutions
 
SQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptxSQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptx
Ashlei5
 
Scalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple DevicesScalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple Devices
Scalefusion
 
UberEats clone app Development TechBuilder
UberEats clone app Development  TechBuilderUberEats clone app Development  TechBuilder
UberEats clone app Development TechBuilder
TechBuilder
 
Agentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptxAgentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptx
MOSIUOA WESI
 
Top 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdfTop 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdf
LL Technolab
 
Marketing And Sales Software Services.pptx
Marketing And Sales Software Services.pptxMarketing And Sales Software Services.pptx
Marketing And Sales Software Services.pptx
julia smits
 
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdfICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
M. Luisetto Pharm.D.Spec. Pharmacology
 
Topic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptxTopic 26 Security Testing Considerations.pptx
Topic 26 Security Testing Considerations.pptx
marutnand8
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfHow to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdfHow a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
mary rojas
 
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdfBoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
Ortus Solutions, Corp
 
aswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjar
aswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjaraswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjar
aswjkdwelhjdfshlfjkhewljhfljawerhwjarhwjkahrjar
muhammadalikhanalikh1
 
zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17
zOSCommserver
 
Design by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First DevelopmentDesign by Contract - Building Robust Software with Contract-First Development
Design by Contract - Building Robust Software with Contract-First Development
Par-Tec S.p.A.
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptxHow AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATIONAI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
miso_uam
 
War Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona ToolkitWar Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona Toolkit
Sveta Smirnova
 
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
Philip Schwarz
 

Visualizing big data in the browser using spark

  • 1. Visualizing big data in the browser using Spark Hossein Falaki @mhfalaki Spark Summit East – March 18, 2015
  • 2. Exploratory Visualization Put visualization back in the normal workflow of data analysis regardless of data size. 2 “Critical part of data analysis” —William S. Cleveland • Interactive • Collaborative • Reproducible
  • 3. Expository Visualization 3 Communication is often the bottleneck in data science, 
 and a graph is worth a thousand words. • Control over details • Shareable
  • 4. Requirements • Interactive • Collaborative • Shareable • Reproducible • Control over details 4 Use visualization libraries } Use the browser }
  • 5. Visualization as programming • For complex tasks point and click may not be enough • Best expressed with a grammar (API) • Scripts are reproducible • Control over all details • Data scientists are already familiar with these tools 5 D3.js, Three.js, matplotlib, ggplot, Bokeh, Vincent, …
  • 6. Do it in the browser • Output of these tools can be readily used on the web 
 (PNG, SVG, Canvas, WebGL) • No need to transfer data and results • Browser is conducive to collaboration (e.g., Notebooks) • Separating data manipulation from rendering enables users to freely choose the best tool for each job 6
  • 7. Challenges with big data visualization 1.Manipulating large data can take a long time 2.We have more data points than pixels 7 Apache Spark can help solve both problems
  • 8. Challenges 1. Manipulating large data can take a long time 8 > Memory > CPU
  • 9. Reducing latency: caching Take advantage of memory and storage hierarchy 9 • Serialized storage levels (for memory) • Memory & GC tuning
  • 10. Reducing latency: parallelism 10 Increase number of CPUs > Get more executors with Mesos or Yarn > Click a button to increase cluster size in DBC • Control level of parallelism for map and reduce tasks • Configure spark locality if needed
  • 11. Challenges 1. Manipulating large data can take a long time 2. We have more data points than possible pixels 11 > Summarize > Model > Sample
  • 12. More data than pixels? Summarize • Extensively used by BI tools > Aggregation > Pivoting • Most data scientists’ nightly jobs summarize data 12
  • 13. More data than pixels? Model MLLib supports a large (and growing) set of distributed algorithms • Clustering: k-means, GMM, LDA • Classification and regression: 
 LM, DT, NB • Dimensionality reduction: SVD, PCA • Collaborative filtering: ALS • Correlation, hypothesis testing 13
  • 14. More data than pixels? Sample Extensively used in statistics Spark offers native support for: • Approximate and exact sampling • Approximate and exact stratified sampling Approximate sampling is faster 
 and is good enough in most cases 14
  • 16. Summary Using Spark we can extend interactive visualization of large data Reduce interaction latency to seconds > Cache data in memory > Increase parallelism To visualize millions of points in the browser > Summarize > Model > Sample 16
  • 17. Visualizing big data in the browser using Spark