df: Dataframe on Spark

df: Dataframe on
Spark
Mohit Jaggi
Code Ninja and Troublemaker
at

Agenda
• About Ayasdi
• df
• Brief Demo (if time allows)
• Conclusion

Traditional Analytics
CODE
?
Hypothesis

Ayasdi Solution
UX
Ayasdi Platform
Distributed Computing Algorithmic Reach
ETL

Day in a data scientist’s life
• Get data
• Need more/something else
• Data wrangling
• Rinse, repeat
• Load into analysis software like Ayasdi Core
• Actual data analysis, model-building etc

Data Wrangling Tools
• grep, cut, wc -l, head, tail
• Python Pandas
• Most useful construct: pandas data frame ala Excel
with CLI

Challenges
• Applying data science techniques to data larger
than single machine’s memory
• Easy to procure cluster of small machines than one
big machine
• Processing takes too long

Solution: Distribute
• Hadoop ecosystem: Spark is great
• Learning curve, what is this RDD thing? where is
my familiar data frame?
• There is pyspark but to get the best out of Spark
use Scala, another learning curve

df: Gentle Incline
“I want to put my projects on hold, and learn several new things simultaneously”
- No One Ever
• Attempts to provide an API on Spark that looks and feels like
pandas data frame
e.g. in pandas
df[“a”]
in df
df(“a”)
• Also intuitive for R programmers

Advantages
• Quite transparently runs on Spark: Distributed processing
• Is in Scala: No layering overhead
• Is in Scala: Can directly call cutting edge Spark libraries like
MLLib [pyspark wrappers usually a bit behind]
• Is an “internal DSL”: Advanced users can augment with
arbitrary Scala code. [python wrapper still possible]
• Is an “internal DSL”: Fast without resorting to code-generation
• Fully open sourced, Apache license

Real Life Examples
Snippets of data scientist code that was “converted” from Pandas to df larger data to make it scale to
Add a column with total
mppu[“total”] = mppu[“avg”] * mppu['c_line_srvc_cnt']
—>
mppu(“total”) = mppu(“avg”) * mppu(“c_line_srvc_cnt”)
Remove $ and , from numbers representing money
mppu[“de-comma”] = mppu[“dollar”].str.replace(‘$','')
mppu[“de-dollar”] = mppu[“de-comma”].str.replace(‘,’,’').astype(float)
—>
mppu(“de-dollar”) = mppu(“dollar”).map
{ x: String => x.replace("$", "").replace(",","").toDouble }

Future
• pyspark wrapper
• more data sources like SQL, parquet, HDF5 etc
• charts and graphs
• contributors welcome!

Summary
• pandas is awesome
• df scales to bigger data, looks and feels like pandas
• fully open source
https://github.com/AyasdiOpenSource/df
• Check out our website. We are hiring!
http://engineering.ayasdi.com/
http://www.ayasdi.com/careers/

Acknowledgements
• Max Song for introducing me to Pandas
• Jean-Ezra Young for insurance claims example
• Ayasdi for open-sourcing this work
• Hadoop and Spark communities for the awesome
platform
• Pandas team for the awesome tool

df: Dataframe on Spark

More Related Content

What's hot (20)

Similar to df: Dataframe on Spark (20)

Recently uploaded (20)

df: Dataframe on Spark