SlideShare a Scribd company logo
df: Dataframe on Spark
df: Dataframe on 
Spark 
Mohit Jaggi 
Code Ninja and Troublemaker 
at
Agenda 
• About Ayasdi 
• df 
• Brief Demo (if time allows) 
• Conclusion
About Ayasdi
Traditional Analytics 
CODE 
? 
Hypothesis
Automated Insights
Ayasdi Solution 
UX 
Ayasdi Platform 
Distributed Computing Algorithmic Reach 
ETL
df
Day in a data scientist’s life 
• Get data 
• Need more/something else 
• Data wrangling 
• Rinse, repeat 
• Load into analysis software like Ayasdi Core 
• Actual data analysis, model-building etc
Data Wrangling Tools 
• grep, cut, wc -l, head, tail 
• Python Pandas 
• Most useful construct: pandas data frame ala Excel 
with CLI
Challenges 
• Applying data science techniques to data larger 
than single machine’s memory 
• Easy to procure cluster of small machines than one 
big machine 
• Processing takes too long
Solution: Distribute 
• Hadoop ecosystem: Spark is great 
• Learning curve, what is this RDD thing? where is 
my familiar data frame? 
• There is pyspark but to get the best out of Spark 
use Scala, another learning curve
df: Gentle Incline 
“I want to put my projects on hold, and learn several new things simultaneously” 
- No One Ever 
• Attempts to provide an API on Spark that looks and feels like 
pandas data frame 
e.g. in pandas 
df[“a”] 
in df 
df(“a”) 
• Also intuitive for R programmers
Advantages 
• Quite transparently runs on Spark: Distributed processing 
• Is in Scala: No layering overhead 
• Is in Scala: Can directly call cutting edge Spark libraries like 
MLLib [pyspark wrappers usually a bit behind] 
• Is an “internal DSL”: Advanced users can augment with 
arbitrary Scala code. [python wrapper still possible] 
• Is an “internal DSL”: Fast without resorting to code-generation 
• Fully open sourced, Apache license
Real Life Examples 
Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to 
Add a column with total 
mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] 
—> 
mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”) 
Remove $ and , from numbers representing money 
mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') 
mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) 
—> 
mppu(“de-dollar”) = mppu(“dollar”).map 
{ x: String => x.replace("$", "").replace(",","").toDouble }
Demo
Future 
• pyspark wrapper 
• more data sources like SQL, parquet, HDF5 etc 
• charts and graphs 
• contributors welcome!
Conclusion
Summary 
• pandas is awesome 
• df scales to bigger data, looks and feels like pandas 
• fully open source 
https://github.com/AyasdiOpenSource/df 
• Check out our website. We are hiring! 
http://engineering.ayasdi.com/ 
http://www.ayasdi.com/careers/
Acknowledgements 
• Max Song for introducing me to Pandas 
• Jean-Ezra Young for insurance claims example 
• Ayasdi for open-sourcing this work 
• Hadoop and Spark communities for the awesome 
platform 
• Pandas team for the awesome tool

More Related Content

PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
Spark what's new what's coming
Databricks
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PDF
Data Source API in Spark
Databricks
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Building a modern Application with DataFrames
Spark Summit
 
Spark what's new what's coming
Databricks
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Data Source API in Spark
Databricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 

What's hot (20)

PDF
Enabling exploratory data science with Spark and R
Databricks
 
PPT
Mapreduce in Search
Amund Tveit
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PDF
Spark SQL with Scala Code Examples
Todd McGrath
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PPTX
Building data pipelines
Jonathan Holloway
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
PDF
The BDAS Open Source Community
jeykottalam
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PPTX
2014 09-12 lambda-architecture-at-indix
Yu Ishikawa
 
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
PPTX
Use r tutorial part1, introduction to sparkr
Databricks
 
PDF
Engineering fast indexes
Daniel Lemire
 
Enabling exploratory data science with Spark and R
Databricks
 
Mapreduce in Search
Amund Tveit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Spark SQL with Scala Code Examples
Todd McGrath
 
Spark meetup v2.0.5
Yan Zhou
 
Building data pipelines
Jonathan Holloway
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
 
The BDAS Open Source Community
jeykottalam
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
A look ahead at spark 2.0
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
2014 09-12 lambda-architecture-at-indix
Yu Ishikawa
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
Use r tutorial part1, introduction to sparkr
Databricks
 
Engineering fast indexes
Daniel Lemire
 
Ad

Similar to df: Dataframe on Spark (20)

PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PDF
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
PDF
Fast and Scalable Python
Travis Oliphant
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PPTX
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
PDF
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
Big data clustering
Jagadeesan A S
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PPTX
Building Deep Learning Workflows with DL4J
Josh Patterson
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Domino Data Lab
 
Fast and Scalable Python
Travis Oliphant
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Apache Spark Tutorial
Ahmet Bulut
 
Big data clustering
Jagadeesan A S
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Building Deep Learning Workflows with DL4J
Josh Patterson
 
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Ad

Recently uploaded (20)

PDF
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
VinayB68
 
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
ghousebhasha2007
 
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Dr. Rahul Kumar
 
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
Principles of Food Science and Nutritions
Dr. Yogesh Kumar Kosariya
 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PDF
B.Tech Data Science Program (Industry Integrated ) Syllabus
rvray078
 
PDF
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
Hyogeun Oh
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
5 Influence line.pdf for structural engineers
Endalkazene
 
PPTX
Edge to Cloud Protocol HTTP WEBSOCKET MQTT-SN MQTT.pptx
dhanashri894551
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
Production of bioplastic from fruit peels.pptx
alwingeorgealwingeor
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
ijcncjournal019
 
PDF
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
VinayB68
 
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
ghousebhasha2007
 
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Dr. Rahul Kumar
 
dse_final_merit_2025_26 gtgfffffcjjjuuyy
rushabhjain127
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
Principles of Food Science and Nutritions
Dr. Yogesh Kumar Kosariya
 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
B.Tech Data Science Program (Industry Integrated ) Syllabus
rvray078
 
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
Hyogeun Oh
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
5 Influence line.pdf for structural engineers
Endalkazene
 
Edge to Cloud Protocol HTTP WEBSOCKET MQTT-SN MQTT.pptx
dhanashri894551
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
Production of bioplastic from fruit peels.pptx
alwingeorgealwingeor
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
ijcncjournal019
 
July 2025: Top 10 Read Articles Advanced Information Technology
ijait
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 

df: Dataframe on Spark

  • 2. df: Dataframe on Spark Mohit Jaggi Code Ninja and Troublemaker at
  • 3. Agenda • About Ayasdi • df • Brief Demo (if time allows) • Conclusion
  • 7. Ayasdi Solution UX Ayasdi Platform Distributed Computing Algorithmic Reach ETL
  • 8. df
  • 9. Day in a data scientist’s life • Get data • Need more/something else • Data wrangling • Rinse, repeat • Load into analysis software like Ayasdi Core • Actual data analysis, model-building etc
  • 10. Data Wrangling Tools • grep, cut, wc -l, head, tail • Python Pandas • Most useful construct: pandas data frame ala Excel with CLI
  • 11. Challenges • Applying data science techniques to data larger than single machine’s memory • Easy to procure cluster of small machines than one big machine • Processing takes too long
  • 12. Solution: Distribute • Hadoop ecosystem: Spark is great • Learning curve, what is this RDD thing? where is my familiar data frame? • There is pyspark but to get the best out of Spark use Scala, another learning curve
  • 13. df: Gentle Incline “I want to put my projects on hold, and learn several new things simultaneously” - No One Ever • Attempts to provide an API on Spark that looks and feels like pandas data frame e.g. in pandas df[“a”] in df df(“a”) • Also intuitive for R programmers
  • 14. Advantages • Quite transparently runs on Spark: Distributed processing • Is in Scala: No layering overhead • Is in Scala: Can directly call cutting edge Spark libraries like MLLib [pyspark wrappers usually a bit behind] • Is an “internal DSL”: Advanced users can augment with arbitrary Scala code. [python wrapper still possible] • Is an “internal DSL”: Fast without resorting to code-generation • Fully open sourced, Apache license
  • 15. Real Life Examples Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to Add a column with total mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt'] —> mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”) Remove $ and , from numbers representing money mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','') mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float) —> mppu(“de-dollar”) = mppu(“dollar”).map { x: String => x.replace("$", "").replace(",","").toDouble }
  • 16. Demo
  • 17. Future • pyspark wrapper • more data sources like SQL, parquet, HDF5 etc • charts and graphs • contributors welcome!
  • 19. Summary • pandas is awesome • df scales to bigger data, looks and feels like pandas • fully open source https://github.com/AyasdiOpenSource/df • Check out our website. We are hiring! http://engineering.ayasdi.com/ http://www.ayasdi.com/careers/
  • 20. Acknowledgements • Max Song for introducing me to Pandas • Jean-Ezra Young for insurance claims example • Ayasdi for open-sourcing this work • Hadoop and Spark communities for the awesome platform • Pandas team for the awesome tool