SlideShare a Scribd company logo
Statistics and Data
                   Analysis in Python with
                   pandas and statsmodels
                          Wes McKinney @wesmckinn

                NYC Open Statistical Programming Meetup
                              9/14/2011

Thursday, September 15,
Talk Overview
                 • Statistical Computing Big Picture
                 • Scientific Python Stack
                 • pandas
                 • statsmodels
                 • Ideas for the (near) future
Thursday, September 15,
Who am I?


                    MIT Math        AQR: Quant Finance



               Back to NYC

                                         Statistics

Thursday, September 15,
The Big Picture

                 • Building the “next generation”
                          statistical computing environment
                 • Making data analysis / statistics more
                          intuitive, flexible, powerful
                 • Closing the “research-production” gap

Thursday, September 15,
Application areas

                 • General data munging, manipulation
                 • Financial modeling and analytics
                 • Statistical modeling and econometrics
                 • “Enterprise” / “Big Data” analytics?

Thursday, September 15,
R, the solution?
      Hadley Wickham (ggplot2, plyr, reshape, ...)


                     “R is the most powerful statistical
                     computing language on the planet”




Thursday, September 15,
Easy to miss the point




Thursday, September 15,
R, the solution?
      Ross Ihaka (One of creators of R)

                “I have been worried for some time that R isn’t going
                to provide the base that we’re going to need for
                statistical computation in the future. (It may well be
                that the future is already upon us.) ... I have come to
                the conclusion that rather than ‘fixing’ R, it would
                be much more productive to simply start
                over and build something better”



Thursday, September 15,
Some of my gripes
                               about R
                 • Wonky, highly idiosyncratic programming
                          language*
                 • Poor speed and memory usage
                 • General purpose libraries and software
                          development tools lacking
                 • The GPL
                             * But yes, really great libraries

Thursday, September 15,
R: great libraries and deep
               connections to academia
                              Example R superstars




                         Jeff Ryan         Hadley Wickham
                      xts, quantmod      ggplot2, plyr, reshape

Thursday, September 15,
Uniting against
                          common enemies




Thursday, September 15,
“Research-Production” Gap

                 • Best data analysis / statistics tools: often
                          least well-suited for building production
                          systems
                 • The “Black Box”: embedding or RPC
                 • High productivity <=> Low productivity

Thursday, September 15,
“Research-Production” Gap

                 • Production: much more than crunching data
                          and making pretty plots
                 • Code readability, debuggability,
                          maintainability matter a lot in the long run
                 • Integration with other systems

Thursday, September 15,
“Research-Production” Gap




Thursday, September 15,
Thursday, September 15,
My assertion

                   Python is the best (only?)
                     viable solution to the
                   Research-Production gap


Thursday, September 15,
Scientific Python Stack
                 • Incredible growth in libraries and tools
                          over the last 5 years
                      • NumPy: the cornerstone
                      • Killer app: IPython
                      • Cython: C speedups, 80+% less dev time
                 • Other exciting high-profile projects: scikit-
                          learn, theano, sympy


Thursday, September 15,
Uniting the Python
                              Community
                 • Fragmentation is a (big) problem / risk
                 • Statistical libraries need to be able to talk
                          to each other easily
                 • R’s success: S-Plus legacy + quality CRAN
                          packages built around cohesive base R /
                          data structures



Thursday, September 15,
pandas
                 • Foundational rich data structures and data
                          analysis tools
                 • Arrays with labeled axes and support for
                          heterogeneous data
                 • Similar to R data.frame, but with many more
                          built-in features
                 • Missing data, time series support
Thursday, September 15,
pandas

                 • Milestone: 0.4 release 9/12/2011
                 • Dozens of new features and enhancements
                 • Completely rewritten docs: pandas.sf.net
                 • Many more new features planned for the
                          future



Thursday, September 15,
The sleeping dragon




Thursday, September 15,
Little did I know...




Thursday, September 15,
pandas: some key features

                 • Automatic and explicit data alignment
                 • Label-based (inc hierarchical) indexing
                 • GroupBy, pivoting, and reshaping
                 • Missing data support
                 • Time series functionality

Thursday, September 15,
Demo time



Thursday, September 15,
statsmodels
                 • Statistics and econometrics in Python
                 • Focused on estimation of statistical models
                  • Regression models (GLS, Robust LM, ...)
                  • Time series models (AR/ARMA,VAR,
                          Kalman Filter, ...)
                      • Non-parametric models (e.g. KDE)

Thursday, September 15,
statsmodels
                 • Development has been largely focused on
                          computation
                      • Correct, tested results
                 • In progress: better user interface
                  • Formula frameworks (e.g. similar to R)
                  • pandas integration

Thursday, September 15,
Demo time



Thursday, September 15,
Ideas for the future

                 • ggpy: ggplot2 for Python
                 • Statistical Python Distribution / Umbrella
                          project
                 • Interactive GUI widgets to visualize /
                          explore data and statsmodels results



Thursday, September 15,
Thanks

                 • pandas: http://pandas.sf.net
                 • statsmodels: http://statsmodels.sf.net
                 • Twitter: @wesmckinn
                 • E-mail: wesmckinn (at) gmail (dot) com
                 • Blog: http://blog.wesmckinney.com

Thursday, September 15,

More Related Content

PPTX
Python Scipy Numpy
PDF
pandas - Python Data Analysis
PDF
Pandas
PDF
Introduction to Python Pandas for Data Analytics
PDF
Data Visualization in Python
PPTX
Presentation on data preparation with pandas
PPTX
Introduction to pandas
PDF
pandas: Powerful data analysis tools for Python
Python Scipy Numpy
pandas - Python Data Analysis
Pandas
Introduction to Python Pandas for Data Analytics
Data Visualization in Python
Presentation on data preparation with pandas
Introduction to pandas
pandas: Powerful data analysis tools for Python

What's hot (20)

PPT
Introduction to Python
PDF
Statistics Using Python | Statistics Python Tutorial | Python Certification T...
PPTX
Python 3 Programming Language
PDF
Java - File Input Output Concepts
ODP
Python Modules
PPTX
Data visualization using R
PPTX
Control Statements in Java
PDF
Python Class | Python Programming | Python Tutorial | Edureka
PPT
Python List.ppt
PPTX
Python 101: Python for Absolute Beginners (PyTexas 2014)
PPTX
Python
PPTX
Python - Numpy/Pandas/Matplot Machine Learning Libraries
PDF
Python Course | Python Programming | Python Tutorial | Python Training | Edureka
PDF
Python Tutorial For Beginners | Python Crash Course - Python Programming Lang...
PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
PDF
Basic Concepts in Python
PPTX
Data Structures in Python
PDF
Python Flow Control
PPTX
Strings in Java
Introduction to Python
Statistics Using Python | Statistics Python Tutorial | Python Certification T...
Python 3 Programming Language
Java - File Input Output Concepts
Python Modules
Data visualization using R
Control Statements in Java
Python Class | Python Programming | Python Tutorial | Edureka
Python List.ppt
Python 101: Python for Absolute Beginners (PyTexas 2014)
Python
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python Course | Python Programming | Python Tutorial | Python Training | Edureka
Python Tutorial For Beginners | Python Crash Course - Python Programming Lang...
Introduction to Pandas and Time Series Analysis [PyCon DE]
Basic Concepts in Python
Data Structures in Python
Python Flow Control
Strings in Java
Ad

Viewers also liked (13)

PDF
pandas: a Foundational Python Library for Data Analysis and Statistics
PDF
Data Structures for Statistical Computing in Python
PDF
Python for Financial Data Analysis with pandas
PDF
Statistical inference for (Python) Data Analysis. An introduction.
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PPT
Lesson04_new
PDF
Recurrent Neural Networks in 10 minutes or less
PDF
What's new in pandas and the SciPy stack for financial users
PDF
A look inside pandas design and development
PDF
Ibis: Scaling the Python Data Experience
PPT
Austin SEO Meetup 4/1/09 with BuzzStream
PPTX
Multivariate
PDF
My Data Journey with Python (SciPy 2015 Keynote)
pandas: a Foundational Python Library for Data Analysis and Statistics
Data Structures for Statistical Computing in Python
Python for Financial Data Analysis with pandas
Statistical inference for (Python) Data Analysis. An introduction.
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Lesson04_new
Recurrent Neural Networks in 10 minutes or less
What's new in pandas and the SciPy stack for financial users
A look inside pandas design and development
Ibis: Scaling the Python Data Experience
Austin SEO Meetup 4/1/09 with BuzzStream
Multivariate
My Data Journey with Python (SciPy 2015 Keynote)
Ad

Similar to Data Analysis and Statistics in Python using pandas and statsmodels (20)

PDF
An Analytics Toolkit Tour
PDF
Python vs. r for data science
PDF
Open source analytics
PPT
R-programming with example representation.ppt
PPT
R Programming for Statistical Applications
PDF
R - the language
PDF
Los Angeles R users group - Nov 17 2010 - Part 2
PPTX
Big data analytics with R tool.pptx
PDF
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Study of Various Tools for Data Science
PPT
Basics of R-Programming with example.ppt
PPT
Basocs of statistics with R-Programming.ppt
PPT
R-Programming.ppt it is based on R programming language
PPTX
Data visualisation in python tool - a brief
PPTX
R_L1-Aug-2022.pptx
PDF
Machine Learning - Intro
PDF
Data analysis in R
PDF
Software For Data Analysis Programming With R 1st Edition John Chambers Auth
PDF
Software For Data Analysis Programming With R 1st Edition John Chambers Auth
An Analytics Toolkit Tour
Python vs. r for data science
Open source analytics
R-programming with example representation.ppt
R Programming for Statistical Applications
R - the language
Los Angeles R users group - Nov 17 2010 - Part 2
Big data analytics with R tool.pptx
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Study of Various Tools for Data Science
Basics of R-Programming with example.ppt
Basocs of statistics with R-Programming.ppt
R-Programming.ppt it is based on R programming language
Data visualisation in python tool - a brief
R_L1-Aug-2022.pptx
Machine Learning - Intro
Data analysis in R
Software For Data Analysis Programming With R 1st Edition John Chambers Auth
Software For Data Analysis Programming With R 1st Edition John Chambers Auth

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
New Directions for Apache Arrow
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Ursa Labs and Apache Arrow in 2019
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Memory Interoperability in Analytics and Machine Learning
PPTX
Raising the Tides: Open Source Analytics for Data Science
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow: Leveling Up the Data Science Stack
Ursa Labs and Apache Arrow in 2019
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Memory Interoperability in Analytics and Machine Learning
Raising the Tides: Open Source Analytics for Data Science

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
DevOps & Developer Experience Summer BBQ
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
CroxyProxy Instagram Access id login.pptx
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
REPORT: Heating appliances market in Poland 2024
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
Understanding_Digital_Forensics_Presentation.pptx
DevOps & Developer Experience Summer BBQ
Big Data Technologies - Introduction.pptx
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
CroxyProxy Instagram Access id login.pptx
Event Presentation Google Cloud Next Extended 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Sensors and Actuators in IoT Systems using pdf
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
REPORT: Heating appliances market in Poland 2024
“AI and Expert System Decision Support & Business Intelligence Systems”
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
madgavkar20181017ppt McKinsey Presentation.pdf
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
GamePlan Trading System Review: Professional Trader's Honest Take

Data Analysis and Statistics in Python using pandas and statsmodels

  • 1. Statistics and Data Analysis in Python with pandas and statsmodels Wes McKinney @wesmckinn NYC Open Statistical Programming Meetup 9/14/2011 Thursday, September 15,
  • 2. Talk Overview • Statistical Computing Big Picture • Scientific Python Stack • pandas • statsmodels • Ideas for the (near) future Thursday, September 15,
  • 3. Who am I? MIT Math AQR: Quant Finance Back to NYC Statistics Thursday, September 15,
  • 4. The Big Picture • Building the “next generation” statistical computing environment • Making data analysis / statistics more intuitive, flexible, powerful • Closing the “research-production” gap Thursday, September 15,
  • 5. Application areas • General data munging, manipulation • Financial modeling and analytics • Statistical modeling and econometrics • “Enterprise” / “Big Data” analytics? Thursday, September 15,
  • 6. R, the solution? Hadley Wickham (ggplot2, plyr, reshape, ...) “R is the most powerful statistical computing language on the planet” Thursday, September 15,
  • 7. Easy to miss the point Thursday, September 15,
  • 8. R, the solution? Ross Ihaka (One of creators of R) “I have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) ... I have come to the conclusion that rather than ‘fixing’ R, it would be much more productive to simply start over and build something better” Thursday, September 15,
  • 9. Some of my gripes about R • Wonky, highly idiosyncratic programming language* • Poor speed and memory usage • General purpose libraries and software development tools lacking • The GPL * But yes, really great libraries Thursday, September 15,
  • 10. R: great libraries and deep connections to academia Example R superstars Jeff Ryan Hadley Wickham xts, quantmod ggplot2, plyr, reshape Thursday, September 15,
  • 11. Uniting against common enemies Thursday, September 15,
  • 12. “Research-Production” Gap • Best data analysis / statistics tools: often least well-suited for building production systems • The “Black Box”: embedding or RPC • High productivity <=> Low productivity Thursday, September 15,
  • 13. “Research-Production” Gap • Production: much more than crunching data and making pretty plots • Code readability, debuggability, maintainability matter a lot in the long run • Integration with other systems Thursday, September 15,
  • 16. My assertion Python is the best (only?) viable solution to the Research-Production gap Thursday, September 15,
  • 17. Scientific Python Stack • Incredible growth in libraries and tools over the last 5 years • NumPy: the cornerstone • Killer app: IPython • Cython: C speedups, 80+% less dev time • Other exciting high-profile projects: scikit- learn, theano, sympy Thursday, September 15,
  • 18. Uniting the Python Community • Fragmentation is a (big) problem / risk • Statistical libraries need to be able to talk to each other easily • R’s success: S-Plus legacy + quality CRAN packages built around cohesive base R / data structures Thursday, September 15,
  • 19. pandas • Foundational rich data structures and data analysis tools • Arrays with labeled axes and support for heterogeneous data • Similar to R data.frame, but with many more built-in features • Missing data, time series support Thursday, September 15,
  • 20. pandas • Milestone: 0.4 release 9/12/2011 • Dozens of new features and enhancements • Completely rewritten docs: pandas.sf.net • Many more new features planned for the future Thursday, September 15,
  • 22. Little did I know... Thursday, September 15,
  • 23. pandas: some key features • Automatic and explicit data alignment • Label-based (inc hierarchical) indexing • GroupBy, pivoting, and reshaping • Missing data support • Time series functionality Thursday, September 15,
  • 25. statsmodels • Statistics and econometrics in Python • Focused on estimation of statistical models • Regression models (GLS, Robust LM, ...) • Time series models (AR/ARMA,VAR, Kalman Filter, ...) • Non-parametric models (e.g. KDE) Thursday, September 15,
  • 26. statsmodels • Development has been largely focused on computation • Correct, tested results • In progress: better user interface • Formula frameworks (e.g. similar to R) • pandas integration Thursday, September 15,
  • 28. Ideas for the future • ggpy: ggplot2 for Python • Statistical Python Distribution / Umbrella project • Interactive GUI widgets to visualize / explore data and statsmodels results Thursday, September 15,
  • 29. Thanks • pandas: http://pandas.sf.net • statsmodels: http://statsmodels.sf.net • Twitter: @wesmckinn • E-mail: wesmckinn (at) gmail (dot) com • Blog: http://blog.wesmckinney.com Thursday, September 15,