SlideShare a Scribd company logo
Performant data processing
with PySpark, SparkR and
DataFrame API
Ryuji Tamagawa from Osaka
Many Thanks to Holden Karau,
for the discussion we had about this talk.
Agenda
Who am I ?
Spark
Spark and non-JVM languages
DataFrame APIs come to rescue
Examples
Who am I ?
Software engineer working for
Sky, from architecture design to
troubleshooting in the field
Translator working with O’Reilly
Japan
‘Learning Spark’ is the 27th book
Prized Rakuten tech award
Silver 2010 for translating
‘Hadoop the definitive guide’
A bed for 6 cats
Works of 2015
Available
Jan, 2016 ?
Works of past
Motivation for
today’s talk
I want to deal with my ‘Big’ data, 

WITH PYTHON !!
Apache Spark
Apache Spark
You may already
have heard a lot
Fast, distributed
data processing
framework with
high-level APIs
Written in Scala,
run in JVM
OS
HDFS
Hive e.t.c.
HBaseMapReduce
YARN
Impala
e.t.c(in-
memory SQL
engine)
Spark
(Spark Streaming, MLlib,
GraphX, Spark SQL)
Why it’s fast
Do not need to write temporary data to storage every time
Do not need to invoke JVM process every time
map
JVM Invocation
I/0
HDFS
reduce
JVM Invocation
I/0
map
JVM Invocation
I/0
reduce
JVM Invocation
I/0
f1(read data to RDD)
Executor(JVM)Invocation
HDFS
I/O
f2
f3
f4(persist to storage)
f5(does shuffle) I/O
f6
f7
Memory(RDDs)
access
access
access
access I/O
access
access
MapReduce Spark
Apache Spark
and
non-JVM languages
Spark supports
non-JVM languages
Shells
PySpark, 

for Python users
SparkR, 

for R users
GUI Environment : 

Jupiter, RStudio
You can write application code in
these languages
The Web UI tells us a lot
http://<address>:4040
Performance problems
with those languages
Data processing
performance with
those languages
may be several
times slower than
JVM languages
The reason lies in
the architecture https://cwiki.apache.org/confluence/
display/SPARK/PySpark+Internals
The choices you
have had
Learn Scala
Write (more lines of) code in Java
Use non-JVM languages with more
CPU cores to make up the
performance gap
DataFrame APIs
come to the rescue !
DataFrame
Tabular data with schema based on RDD
Successor of Schema RDD (Since 1.4)
Has rich set of APIs for data operation
Or, you can simply use SQL!
Do it within JVM
When you call
DataFrame APIs from
non-JVM Languages,
data will not be
transferred between JVM
and the language
runtime
Obviously, the
performance is almost
same compared to JVM
languages
Only code goes
through
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver
Executor
DataFrame APIs compared to
RDD APIs by Examples
JVM
DataFrame,
Cached
filter(df[“_1”]
== “abc”)
transfer
DataFrame,
result
Driver
Watch out for UDFs
You can write UDFs
in Python
You can use
lambdas in Python,
too
Once you use them,
data flows between
the two worlds
slen = udf(
lambda s: len(s),
IntegerType())
df.select(
slen(df.name))
.collect()
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
‘BIG’ data
in DataFrame
filtering with
‘native APIs’
‘Small’ data in DataFrame
whatever
operation with
UDFs
Make it small first,
then use UDFs
Filter or sample your
‘big’ data with
DataFrame APIs
Then use UDFs
SQL optimizer does
not take it into
account when making
plans (so far)
slen = udf(
lambda s: len(s),
IntegerType())
sqc.SQL(
‘select…
from df
where fname like “tama%”
and slen(name)’
).collect()
processed first !
Ingesting Data
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Executor
JVM
DataFrameDriver
Local Data
Py4J
Driver Machine
HDFS (Parquet)
Driver Machine
Ingesting Data
Executor
JVM
DataFrameDriver Py4Jcode only
HDFS (Parquet)
code only
It’s slow to Deal with files like CSVs by non-JVM driver
Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first
You can process Such files directly from JVM processes (executors) even when
using non-JVM languages
Appendix : Parquet
Parquet: general purpose file
format for analytic workload
Columnar storage : reduces I/O
significantly
High compression rate
projection pushdown
Today, workloads become CPU-
intensive : very fast read, CPU-internal-
aware

More Related Content

What's hot (20)

PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Spark Summit
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 

Viewers also liked (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Ryuji Tamagawa
 
lessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conferencelessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conference
Ryuji Tamagawa
 
ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践
Ryuji Tamagawa
 
20161215 python pandas-spark四方山話
20161215 python pandas-spark四方山話20161215 python pandas-spark四方山話
20161215 python pandas-spark四方山話
Ryuji Tamagawa
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
Ryuji Tamagawa
 
Mongo dbを知ろう devlove関西
Mongo dbを知ろう   devlove関西Mongo dbを知ろう   devlove関西
Mongo dbを知ろう devlove関西
Ryuji Tamagawa
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測
Ryuji Tamagawa
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
Master Data Mastery – Strategies to improve procurement performance
Master Data Mastery – Strategies to improve procurement performanceMaster Data Mastery – Strategies to improve procurement performance
Master Data Mastery – Strategies to improve procurement performance
Verdantis Inc.
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thSparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
丸の内MongoDB勉強会#20LT 2.8のストレージエンジン動かしてみました
Ryuji Tamagawa
 
lessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conferencelessons learned from talking at rakuten technology conference
lessons learned from talking at rakuten technology conference
Ryuji Tamagawa
 
ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践ヘルシープログラマ・翻訳と実践
ヘルシープログラマ・翻訳と実践
Ryuji Tamagawa
 
20161215 python pandas-spark四方山話
20161215 python pandas-spark四方山話20161215 python pandas-spark四方山話
20161215 python pandas-spark四方山話
Ryuji Tamagawa
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
Ryuji Tamagawa
 
Mongo dbを知ろう devlove関西
Mongo dbを知ろう   devlove関西Mongo dbを知ろう   devlove関西
Mongo dbを知ろう devlove関西
Ryuji Tamagawa
 
Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測Google BigQueryについて 紹介と推測
Google BigQueryについて 紹介と推測
Ryuji Tamagawa
 
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Presentation at AMPCamp 6 (November, 2015)
Tachyon Nexus, Inc.
 
Master Data Mastery – Strategies to improve procurement performance
Master Data Mastery – Strategies to improve procurement performanceMaster Data Mastery – Strategies to improve procurement performance
Master Data Mastery – Strategies to improve procurement performance
Verdantis Inc.
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
Krishna Sankar
 
Ad

Similar to Performant data processing with PySpark, SparkR and DataFrame API (20)

Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Holden Karau
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Let's start with Spark
Let's start with SparkLet's start with Spark
Let's start with Spark
Milos Milovanovic
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
Databricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
David Taieb
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Holden Karau
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Ad

More from Ryuji Tamagawa (20)

20171012 found IT #9 PySparkの勘所
20171012 found  IT #9 PySparkの勘所20171012 found  IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
hbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineeringhbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineering
Ryuji Tamagawa
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
20170210 sapporotechbar7
20170210 sapporotechbar720170210 sapporotechbar7
20170210 sapporotechbar7
Ryuji Tamagawa
 
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
Ryuji Tamagawa
 
20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌
Ryuji Tamagawa
 
20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark
Ryuji Tamagawa
 
20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet
Ryuji Tamagawa
 
Apache Sparkの紹介
Apache Sparkの紹介Apache Sparkの紹介
Apache Sparkの紹介
Ryuji Tamagawa
 
足を地に着け落ち着いて考える
足を地に着け落ち着いて考える足を地に着け落ち着いて考える
足を地に着け落ち着いて考える
Ryuji Tamagawa
 
BigQueryの課金、節約しませんか
BigQueryの課金、節約しませんかBigQueryの課金、節約しませんか
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
Seleniumをもっと知るための本の話
Seleniumをもっと知るための本の話Seleniumをもっと知るための本の話
Seleniumをもっと知るための本の話
Ryuji Tamagawa
 
データベース勉強会 In 広島 mongodb
データベース勉強会 In 広島  mongodbデータベース勉強会 In 広島  mongodb
データベース勉強会 In 広島 mongodb
Ryuji Tamagawa
 
Invitation to mongo db @ Rakuten TechTalk
Invitation to mongo db @ Rakuten TechTalkInvitation to mongo db @ Rakuten TechTalk
Invitation to mongo db @ Rakuten TechTalk
Ryuji Tamagawa
 
MongoDB tuning on AWS
MongoDB tuning on AWSMongoDB tuning on AWS
MongoDB tuning on AWS
Ryuji Tamagawa
 
初めてのMongo db
初めてのMongo db初めてのMongo db
初めてのMongo db
Ryuji Tamagawa
 
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
Ryuji Tamagawa
 
初めてのAws elastic map reduce
初めてのAws elastic map reduce初めてのAws elastic map reduce
初めてのAws elastic map reduce
Ryuji Tamagawa
 
初めてのAws rds for sql server
初めてのAws   rds for sql server初めてのAws   rds for sql server
初めてのAws rds for sql server
Ryuji Tamagawa
 
20171012 found IT #9 PySparkの勘所
20171012 found  IT #9 PySparkの勘所20171012 found  IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所
Ryuji Tamagawa
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
Ryuji Tamagawa
 
hbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineeringhbstudy 74 Site Reliability Engineering
hbstudy 74 Site Reliability Engineering
Ryuji Tamagawa
 
PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase) PySparkの勘所(20170630 sapporo db analytics showcase)
PySparkの勘所(20170630 sapporo db analytics showcase)
Ryuji Tamagawa
 
20170210 sapporotechbar7
20170210 sapporotechbar720170210 sapporotechbar7
20170210 sapporotechbar7
Ryuji Tamagawa
 
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
Ryuji Tamagawa
 
20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌20160708 データ処理のプラットフォームとしてのpython 札幌
20160708 データ処理のプラットフォームとしてのpython 札幌
Ryuji Tamagawa
 
20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark20160127三木会 RDB経験者のためのspark
20160127三木会 RDB経験者のためのspark
Ryuji Tamagawa
 
20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet20151205 Japan.R SparkRとParquet
20151205 Japan.R SparkRとParquet
Ryuji Tamagawa
 
足を地に着け落ち着いて考える
足を地に着け落ち着いて考える足を地に着け落ち着いて考える
足を地に着け落ち着いて考える
Ryuji Tamagawa
 
BigQueryの課金、節約しませんか
BigQueryの課金、節約しませんかBigQueryの課金、節約しませんか
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
Seleniumをもっと知るための本の話
Seleniumをもっと知るための本の話Seleniumをもっと知るための本の話
Seleniumをもっと知るための本の話
Ryuji Tamagawa
 
データベース勉強会 In 広島 mongodb
データベース勉強会 In 広島  mongodbデータベース勉強会 In 広島  mongodb
データベース勉強会 In 広島 mongodb
Ryuji Tamagawa
 
Invitation to mongo db @ Rakuten TechTalk
Invitation to mongo db @ Rakuten TechTalkInvitation to mongo db @ Rakuten TechTalk
Invitation to mongo db @ Rakuten TechTalk
Ryuji Tamagawa
 
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
RDB経験者に送るMongoDBの勘所(db tech showcase tokyo 2013)
Ryuji Tamagawa
 
初めてのAws elastic map reduce
初めてのAws elastic map reduce初めてのAws elastic map reduce
初めてのAws elastic map reduce
Ryuji Tamagawa
 
初めてのAws rds for sql server
初めてのAws   rds for sql server初めてのAws   rds for sql server
初めてのAws rds for sql server
Ryuji Tamagawa
 

Recently uploaded (20)

From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesFrom Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
Marjukka Niinioja
 
Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...
Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...
Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...
Safe Software
 
Key AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence CompaniesKey AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence Companies
Mypcot Infotech
 
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdfThe Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
Varsha Nayak
 
Agile Software Engineering Methodologies
Agile Software Engineering MethodologiesAgile Software Engineering Methodologies
Agile Software Engineering Methodologies
Gaurav Sharma
 
The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptxIMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
Bonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdfBonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdf
Herond Labs
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfHow to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 
IBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - IntroductionIBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - Introduction
Gaurav Sharma
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
SheenBrisals
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your wayPlooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Revolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management SoftwareRevolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management Software
Insurance Tech Services
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptxHow AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI SearchAgentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesFrom Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
Marjukka Niinioja
 
Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...
Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...
Scaling FME Flow on Demand with Kubernetes: A Case Study At Cadac Group SaaS ...
Safe Software
 
Key AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence CompaniesKey AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence Companies
Mypcot Infotech
 
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdfThe Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
The Future of Open Source Reporting Best Alternatives to Jaspersoft.pdf
Varsha Nayak
 
Agile Software Engineering Methodologies
Agile Software Engineering MethodologiesAgile Software Engineering Methodologies
Agile Software Engineering Methodologies
Gaurav Sharma
 
The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptxIMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
Bonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdfBonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdf
Herond Labs
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfHow to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 
IBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - IntroductionIBM Rational Unified Process For Software Engineering - Introduction
IBM Rational Unified Process For Software Engineering - Introduction
Gaurav Sharma
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
Eliminate the complexities of Event-Driven Architecture with Domain-Driven De...
SheenBrisals
 
Plooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your wayPlooma is a writing platform to plan, write, and shape books your way
Plooma is a writing platform to plan, write, and shape books your way
Plooma
 
Revolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management SoftwareRevolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management Software
Insurance Tech Services
 
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptxHow AI Can Improve Media Quality Testing Across Platforms (1).pptx
How AI Can Improve Media Quality Testing Across Platforms (1).pptx
kalichargn70th171
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
How Insurance Policy Administration Streamlines Policy Lifecycle for Agile Op...
Insurance Tech Services
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI SearchAgentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Agentic Techniques in Retrieval-Augmented Generation with Azure AI Search
Maxim Salnikov
 

Performant data processing with PySpark, SparkR and DataFrame API

  • 1. Performant data processing with PySpark, SparkR and DataFrame API Ryuji Tamagawa from Osaka Many Thanks to Holden Karau, for the discussion we had about this talk.
  • 2. Agenda Who am I ? Spark Spark and non-JVM languages DataFrame APIs come to rescue Examples
  • 3. Who am I ? Software engineer working for Sky, from architecture design to troubleshooting in the field Translator working with O’Reilly Japan ‘Learning Spark’ is the 27th book Prized Rakuten tech award Silver 2010 for translating ‘Hadoop the definitive guide’ A bed for 6 cats
  • 6. Motivation for today’s talk I want to deal with my ‘Big’ data, 
 WITH PYTHON !!
  • 8. Apache Spark You may already have heard a lot Fast, distributed data processing framework with high-level APIs Written in Scala, run in JVM OS HDFS Hive e.t.c. HBaseMapReduce YARN Impala e.t.c(in- memory SQL engine) Spark (Spark Streaming, MLlib, GraphX, Spark SQL)
  • 9. Why it’s fast Do not need to write temporary data to storage every time Do not need to invoke JVM process every time map JVM Invocation I/0 HDFS reduce JVM Invocation I/0 map JVM Invocation I/0 reduce JVM Invocation I/0 f1(read data to RDD) Executor(JVM)Invocation HDFS I/O f2 f3 f4(persist to storage) f5(does shuffle) I/O f6 f7 Memory(RDDs) access access access access I/O access access MapReduce Spark
  • 11. Spark supports non-JVM languages Shells PySpark, 
 for Python users SparkR, 
 for R users GUI Environment : 
 Jupiter, RStudio You can write application code in these languages
  • 12. The Web UI tells us a lot http://<address>:4040
  • 13. Performance problems with those languages Data processing performance with those languages may be several times slower than JVM languages The reason lies in the architecture https://cwiki.apache.org/confluence/ display/SPARK/PySpark+Internals
  • 14. The choices you have had Learn Scala Write (more lines of) code in Java Use non-JVM languages with more CPU cores to make up the performance gap
  • 15. DataFrame APIs come to the rescue !
  • 16. DataFrame Tabular data with schema based on RDD Successor of Schema RDD (Since 1.4) Has rich set of APIs for data operation Or, you can simply use SQL!
  • 17. Do it within JVM When you call DataFrame APIs from non-JVM Languages, data will not be transferred between JVM and the language runtime Obviously, the performance is almost same compared to JVM languages Only code goes through
  • 18. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver
  • 19. Executor DataFrame APIs compared to RDD APIs by Examples JVM DataFrame, Cached filter(df[“_1”] == “abc”) transfer DataFrame, result Driver
  • 20. Watch out for UDFs You can write UDFs in Python You can use lambdas in Python, too Once you use them, data flows between the two worlds slen = udf( lambda s: len(s), IntegerType()) df.select( slen(df.name)) .collect()
  • 21. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) ‘BIG’ data in DataFrame filtering with ‘native APIs’ ‘Small’ data in DataFrame whatever operation with UDFs
  • 22. Make it small first, then use UDFs Filter or sample your ‘big’ data with DataFrame APIs Then use UDFs SQL optimizer does not take it into account when making plans (so far) slen = udf( lambda s: len(s), IntegerType()) sqc.SQL( ‘select… from df where fname like “tama%” and slen(name)’ ).collect() processed first !
  • 23. Ingesting Data It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages Executor JVM DataFrameDriver Local Data Py4J Driver Machine HDFS (Parquet)
  • 24. Driver Machine Ingesting Data Executor JVM DataFrameDriver Py4Jcode only HDFS (Parquet) code only It’s slow to Deal with files like CSVs by non-JVM driver Anyway, convert raw data to ‘Dataframe-native’ formats like Parquet at first You can process Such files directly from JVM processes (executors) even when using non-JVM languages
  • 26. Parquet: general purpose file format for analytic workload Columnar storage : reduces I/O significantly High compression rate projection pushdown Today, workloads become CPU- intensive : very fast read, CPU-internal- aware