SlideShare a Scribd company logo
pandas: a Foundational Python library for Data Analysis
                    and Statistics

                                   Wes McKinney


                            PyHPC 2011, 18 November 2011




Wes McKinney (@wesmckinn)          Data analysis with pandas   PyHPC 2011   1 / 25
An alternate title




       High Performance Structured Data
            Manipulation in Python




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   2 / 25
My background



     Former quant hacker at AQR Capital, now entrepreneur
     Background: math, statistics, computer science, quant finance.
     Shaken, not stirred
     Active in scientific Python community
     My blog: http://blog.wesmckinney.com
     Twitter: @wesmckinn
     Book! “Python for Data Analysis”, to hit the shelves later next year
     from O’Reilly




 Wes McKinney (@wesmckinn)   Data analysis with pandas       PyHPC 2011   3 / 25
Structured data



      cname              year   agefrom      ageto             ls     lsc    pop   ccode
0     Australia          1950   15           19                64.3   15.4   558   AUS
1     Australia          1950   20           24                48.4   26.4   645   AUS
2     Australia          1950   25           29                47.9   26.2   681   AUS
3     Australia          1950   30           34                44     23.8   614   AUS
4     Australia          1950   35           39                42.1   21.9   625   AUS
5     Australia          1950   40           44                38.9   20.1   555   AUS
6     Australia          1950   45           49                34     16.9   491   AUS
7     Australia          1950   50           54                29.6   14.6   439   AUS
8     Australia          1950   55           59                28     12.9   408   AUS
9     Australia          1950   60           64                26.3   12.1   356   AUS




    Wes McKinney (@wesmckinn)      Data analysis with pandas                 PyHPC 2011   4 / 25
Structured data



     A familiar data model
           Heterogeneous columns or hyperslabs
           Each column/hyperslab is homogeneously typed
           Relational databases (SQL, etc.) are just a special case
     Need good performance in row- and column-oriented operations
     Support for axis metadata
     Data alignment is critical
     Seamless integration with Python data structures and NumPy




 Wes McKinney (@wesmckinn)       Data analysis with pandas            PyHPC 2011   5 / 25
Structured data challenges



     Table modification: column insertion/deletion
     Axis indexing and data alignment
     Aggregation and transformation by group (“group by”)
     Missing data handling
     Pivoting and reshaping
     Merging and joining
     Time series-specific manipulations
     Fast IO: flat files, databases, HDF5, ...




 Wes McKinney (@wesmckinn)    Data analysis with pandas     PyHPC 2011   6 / 25
Not all fun and games




     We care nearly equally about
           Performance
           Ease-of-use (syntax / API fits your mental model)
           Expressiveness
     Clean, consistent API design is hard and underappreciated




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   7 / 25
The big picture



     Build a foundation for data analysis and statistical computing
     Craft the most expressive / flexible in-memory data manipulation tool
     in any language
           Preferably also one of the fastest, too
     Vastly simplify the data preparation, munging, and integration process
     Comfortable abstractions: master data-fu without needing to be a
     computer scientist
     Later: extend API with distributed computing backend for
     larger-than-memory datasets




 Wes McKinney (@wesmckinn)       Data analysis with pandas   PyHPC 2011   8 / 25
pandas: a brief history




     Starting building April 2008 back at AQR
     Open-sourced (BSD license) mid-2009
     29075 lines of Python/Cython code as of yesterday, and growing fast
     Heavily tested, being used by many companies (inc. lots of financial
     firms) in production




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   9 / 25
Cython: getting good performance



     My choice tool for writing performant code
     High level access to NumPy C API internals
     Buffer syntax/protocol abstracts away striding details of
     non-contiguous arrays, very low overhead vs. working with raw C
     pointers
     Reduce/remove interpreter overhead associated with working with
     Python data structures
     Interface directly with C/C++ code when necessary




 Wes McKinney (@wesmckinn)   Data analysis with pandas     PyHPC 2011   10 / 25
Axis indexing




     Key pandas feature
     The axis index is a data structure itself, which can be customized to
     support things like:
           1-1 O(1) indexing with hashable Python objects
           Datetime indexing for time series data
           Hierarchical (multi-level) indexing
     Use Python dict to support O(1) lookups and O(n) realignment ops.
     Can specialize to get better performance and memory usage




 Wes McKinney (@wesmckinn)      Data analysis with pandas    PyHPC 2011   11 / 25
Axis indexing



     Every axis has an index
     Automatic alignment between differently-indexed objects: makes it
     nearly impossible to accidentally combine misaligned data
     Hierarchical indexing provides an intuitive way of structuring and
     working with higher-dimensional data
     Natural way of expressing “group by” and join-type operations
     As good or in many cases much more integrated/flexible than
     commercial or open-source alternatives to pandas/Python




 Wes McKinney (@wesmckinn)     Data analysis with pandas     PyHPC 2011   12 / 25
The trouble with Python dicts...




     Python dict memory footprint can be quite large
           1MM key-value pairs: something like 70mb on a 64-bit system
           Even though sizeof(PyObject*) == 8
     Python dict is great, but should use a faster, threadsafe hash table for
     primitive C types (like 64-bit integer)
     BUT: using a hash table only necessary in the general case. With
     monotonic indexes you don’t need one for realignment ops




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   13 / 25
Some alignment numbers



     Hardware: Macbook Pro Core i7 laptop, Python 2.7.2
     Outer-join 500k-length indexes chosen from 1MM elements
           Dict-based with random strings: 2.2 seconds
           Sorted strings: 400ms (5.5x faster)
           Sorted int64: 19ms (115x faster)
     Fortunately, time series data falls into this last category
     Alignment ops with C primitives could be fairly easily parallelized with
     OpenMP in Cython




 Wes McKinney (@wesmckinn)      Data analysis with pandas          PyHPC 2011   14 / 25
DataFrame, the pandas workhorse




     A 2D tabular data structure with row and column indexes
     Hierarchical indexing one way to support higher-dimensional data in a
     lower-dimensional structure
     Simplified NumPy type system: float, int, boolean, object
     Rich indexing operations, SQL-like join/merges, etc.
     Support heterogeneous columns WITHOUT sacrificing performance in
     the homogeneous (e.g. floating point only) case




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   15 / 25
DataFrame, under the hood




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   16 / 25
Supporting size mutability



     In order to have good row-oriented performance, need to store
     like-typed columns in a single ndarray
     “Column” insertion: accumulate 1 × N × . . . homogeneous columns,
     later consolidate with other like-typed into a single block
     I.e. avoid reallocate-copy or array concatenation steps as long as
     possible
     Column deletions can be no-copy events (since ndarrays support
     views)




 Wes McKinney (@wesmckinn)    Data analysis with pandas       PyHPC 2011   17 / 25
Hierarchical indexing




     New this year, but really should have done long ago
     Natural result of multi-key groupby
     An intuitive way to work with higher-dimensional data
     Much less ad hoc way of expressing reshaping operations
     Once you have it, things like Excel-style pivot tables just “fall out”




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   18 / 25
Reshaping




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   19 / 25
Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   20 / 25
Reshaping implementation nuances




     Must deal with unbalanced group sizes / missing data
     Play vectorization tricks with the NumPy C-contiguous memory
     layout: no Python for loops allowed
     Care must be taken to handle heterogeneous and homogeneous data
     cases




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   21 / 25
GroupBy




     High level process
           split data set into groups
           apply function to each group (an aggregation or a transformation)
           combine results intelligently into a result data structure
     Can be used to emulate SQL GROUP BY operations




 Wes McKinney (@wesmckinn)      Data analysis with pandas        PyHPC 2011    22 / 25
GroupBy



     Grouping closely related to indexing
     Create correspondence between axis labels and group labels using one
     of:
           Array of group labels (like a DataFrame column)
           Python function to be applied to each axis tick
     Can group by multiple keys
     For a hierarchically indexed axis, can select a level and group by that
     (or some transformation thereof)




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   23 / 25
GroupBy implementation challenges


     Computing the group labels from arbitrary Python objects is very
     expensive
           77ms for 1MM strings with 1K groups
           107ms for 1MM strings with 10K groups
           350ms for 1MM strings with 100K groups
     To sort or not to sort (for iteration)?
           Once you have the labels, can reorder the data set in O(n) (with a
           much smaller constant than computing the labels)
           Roughly 35ms to reorder 1MM float64 data points given the labels
     (By contrast, computing the mean of 1MM elements takes 1.4ms)
     Python function call overhead is significant in cases with lots of small
     groups; much better (orders of magnitude speedup) to write
     specialized Cython routines


 Wes McKinney (@wesmckinn)      Data analysis with pandas         PyHPC 2011    24 / 25
Demo, time permitting




Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   25 / 25

More Related Content

PPTX
Basics of Object Oriented Programming in Python
PPTX
Pandas
PDF
pandas - Python Data Analysis
PDF
Introduction to Python Pandas for Data Analytics
PPTX
Introduction to pandas
PPTX
Data Analysis with Python Pandas
PDF
Data Analysis and Statistics in Python using pandas and statsmodels
ODP
Data Analysis in Python
Basics of Object Oriented Programming in Python
Pandas
pandas - Python Data Analysis
Introduction to Python Pandas for Data Analytics
Introduction to pandas
Data Analysis with Python Pandas
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis in Python

What's hot (20)

PDF
NoSQL
PPTX
Python Seaborn Data Visualization
PDF
Python for Data Science
PPTX
PPT on Data Science Using Python
PPTX
DataFrame in Python Pandas
PDF
Java Linked List Tutorial | Edureka
PDF
Data Streaming For Big Data
PPTX
Introduction to OOP in Python
PDF
Data Analytics with Pandas and Numpy - Python
PPTX
Apache HBase™
PDF
Data Analysis and Visualization using Python
PDF
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
PPTX
Scikit Learn intro
KEY
Testing Hadoop jobs with MRUnit
PPTX
Introduction to numpy Session 1
PDF
Python NumPy Tutorial | NumPy Array | Edureka
PDF
Data visualization in Python
PPTX
Datastructures in python
PDF
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
NoSQL
Python Seaborn Data Visualization
Python for Data Science
PPT on Data Science Using Python
DataFrame in Python Pandas
Java Linked List Tutorial | Edureka
Data Streaming For Big Data
Introduction to OOP in Python
Data Analytics with Pandas and Numpy - Python
Apache HBase™
Data Analysis and Visualization using Python
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Scikit Learn intro
Testing Hadoop jobs with MRUnit
Introduction to numpy Session 1
Python NumPy Tutorial | NumPy Array | Edureka
Data visualization in Python
Datastructures in python
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Ad

Similar to pandas: a Foundational Python Library for Data Analysis and Statistics (20)

PDF
Python for Financial Data Analysis with pandas
PDF
Slides 111017220255-phpapp01
PDF
pandas: Powerful data analysis tools for Python
PDF
A look inside pandas design and development
PDF
What's new in pandas and the SciPy stack for financial users
PDF
ytuiuyuyuyryuuytryuryruyrjgjhgfnyfpug.pdf
PPTX
Dc python meetup
PDF
Data Wrangling and Visualization Using Python
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Structured Data Challenges in Finance and Statistics
PDF
Getting started with pandas
PPTX
Unit 3_Numpy_VP.pptx
PPTX
Meetup Junio Data Analysis with python 2018
PDF
Panda data structures and its importance in Python.pdf
PPTX
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
PPTX
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
PDF
Pandas: a high-level, data-centric, Python extension and plotting library
PDF
Slide presentation pycassa_upload
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Python pandas I .pdf gugugigg88iggigigih
Python for Financial Data Analysis with pandas
Slides 111017220255-phpapp01
pandas: Powerful data analysis tools for Python
A look inside pandas design and development
What's new in pandas and the SciPy stack for financial users
ytuiuyuyuyryuuytryuryruyrjgjhgfnyfpug.pdf
Dc python meetup
Data Wrangling and Visualization Using Python
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Structured Data Challenges in Finance and Statistics
Getting started with pandas
Unit 3_Numpy_VP.pptx
Meetup Junio Data Analysis with python 2018
Panda data structures and its importance in Python.pdf
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Pandas: a high-level, data-centric, Python extension and plotting library
Slide presentation pycassa_upload
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Python pandas I .pdf gugugigg88iggigigih
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
New Directions for Apache Arrow
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Ursa Labs and Apache Arrow in 2019
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Memory Interoperability in Analytics and Machine Learning
PPTX
Raising the Tides: Open Source Analytics for Data Science
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow: Leveling Up the Data Science Stack
Ursa Labs and Apache Arrow in 2019
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Memory Interoperability in Analytics and Machine Learning
Raising the Tides: Open Source Analytics for Data Science
Improving Python and Spark (PySpark) Performance and Interoperability

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
KodekX | Application Modernization Development
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KodekX | Application Modernization Development

pandas: a Foundational Python Library for Data Analysis and Statistics

  • 1. pandas: a Foundational Python library for Data Analysis and Statistics Wes McKinney PyHPC 2011, 18 November 2011 Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25
  • 2. An alternate title High Performance Structured Data Manipulation in Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 2 / 25
  • 3. My background Former quant hacker at AQR Capital, now entrepreneur Background: math, statistics, computer science, quant finance. Shaken, not stirred Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! “Python for Data Analysis”, to hit the shelves later next year from O’Reilly Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 3 / 25
  • 4. Structured data cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 4 / 25
  • 5. Structured data A familiar data model Heterogeneous columns or hyperslabs Each column/hyperslab is homogeneously typed Relational databases (SQL, etc.) are just a special case Need good performance in row- and column-oriented operations Support for axis metadata Data alignment is critical Seamless integration with Python data structures and NumPy Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 5 / 25
  • 6. Structured data challenges Table modification: column insertion/deletion Axis indexing and data alignment Aggregation and transformation by group (“group by”) Missing data handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast IO: flat files, databases, HDF5, ... Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 6 / 25
  • 7. Not all fun and games We care nearly equally about Performance Ease-of-use (syntax / API fits your mental model) Expressiveness Clean, consistent API design is hard and underappreciated Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 7 / 25
  • 8. The big picture Build a foundation for data analysis and statistical computing Craft the most expressive / flexible in-memory data manipulation tool in any language Preferably also one of the fastest, too Vastly simplify the data preparation, munging, and integration process Comfortable abstractions: master data-fu without needing to be a computer scientist Later: extend API with distributed computing backend for larger-than-memory datasets Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 8 / 25
  • 9. pandas: a brief history Starting building April 2008 back at AQR Open-sourced (BSD license) mid-2009 29075 lines of Python/Cython code as of yesterday, and growing fast Heavily tested, being used by many companies (inc. lots of financial firms) in production Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 9 / 25
  • 10. Cython: getting good performance My choice tool for writing performant code High level access to NumPy C API internals Buffer syntax/protocol abstracts away striding details of non-contiguous arrays, very low overhead vs. working with raw C pointers Reduce/remove interpreter overhead associated with working with Python data structures Interface directly with C/C++ code when necessary Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 10 / 25
  • 11. Axis indexing Key pandas feature The axis index is a data structure itself, which can be customized to support things like: 1-1 O(1) indexing with hashable Python objects Datetime indexing for time series data Hierarchical (multi-level) indexing Use Python dict to support O(1) lookups and O(n) realignment ops. Can specialize to get better performance and memory usage Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 11 / 25
  • 12. Axis indexing Every axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations As good or in many cases much more integrated/flexible than commercial or open-source alternatives to pandas/Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 12 / 25
  • 13. The trouble with Python dicts... Python dict memory footprint can be quite large 1MM key-value pairs: something like 70mb on a 64-bit system Even though sizeof(PyObject*) == 8 Python dict is great, but should use a faster, threadsafe hash table for primitive C types (like 64-bit integer) BUT: using a hash table only necessary in the general case. With monotonic indexes you don’t need one for realignment ops Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 13 / 25
  • 14. Some alignment numbers Hardware: Macbook Pro Core i7 laptop, Python 2.7.2 Outer-join 500k-length indexes chosen from 1MM elements Dict-based with random strings: 2.2 seconds Sorted strings: 400ms (5.5x faster) Sorted int64: 19ms (115x faster) Fortunately, time series data falls into this last category Alignment ops with C primitives could be fairly easily parallelized with OpenMP in Cython Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 14 / 25
  • 15. DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Hierarchical indexing one way to support higher-dimensional data in a lower-dimensional structure Simplified NumPy type system: float, int, boolean, object Rich indexing operations, SQL-like join/merges, etc. Support heterogeneous columns WITHOUT sacrificing performance in the homogeneous (e.g. floating point only) case Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 15 / 25
  • 16. DataFrame, under the hood Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 16 / 25
  • 17. Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray “Column” insertion: accumulate 1 × N × . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 17 / 25
  • 18. Hierarchical indexing New this year, but really should have done long ago Natural result of multi-key groupby An intuitive way to work with higher-dimensional data Much less ad hoc way of expressing reshaping operations Once you have it, things like Excel-style pivot tables just “fall out” Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 18 / 25
  • 19. Reshaping Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 19 / 25
  • 20. Reshaping In [5]: df.unstack(’agefrom’).stack(’year’) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 20 / 25
  • 21. Reshaping implementation nuances Must deal with unbalanced group sizes / missing data Play vectorization tricks with the NumPy C-contiguous memory layout: no Python for loops allowed Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 21 / 25
  • 22. GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 22 / 25
  • 23. GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 23 / 25
  • 24. GroupBy implementation challenges Computing the group labels from arbitrary Python objects is very expensive 77ms for 1MM strings with 1K groups 107ms for 1MM strings with 10K groups 350ms for 1MM strings with 100K groups To sort or not to sort (for iteration)? Once you have the labels, can reorder the data set in O(n) (with a much smaller constant than computing the labels) Roughly 35ms to reorder 1MM float64 data points given the labels (By contrast, computing the mean of 1MM elements takes 1.4ms) Python function call overhead is significant in cases with lots of small groups; much better (orders of magnitude speedup) to write specialized Cython routines Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 24 / 25
  • 25. Demo, time permitting Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 25 / 25