pandas: a Foundational Python Library for Data Analysis and Statistics

pandas: a Foundational Python library for Data Analysis
and Statistics

Wes McKinney

PyHPC 2011, 18 November 2011

Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25

An alternate title

High Performance Structured Data
Manipulation in Python


My background

Former quant hacker at AQR Capital, now entrepreneur
Background: math, statistics, computer science, quant ﬁnance.
Shaken, not stirred
Active in scientiﬁc Python community
My blog: http://blog.wesmckinney.com
Twitter: @wesmckinn
Book! “Python for Data Analysis”, to hit the shelves later next year
from O’Reilly


Structured data

cname year agefrom ageto ls lsc pop ccode
0 Australia 1950 15 19 64.3 15.4 558 AUS
1 Australia 1950 20 24 48.4 26.4 645 AUS
2 Australia 1950 25 29 47.9 26.2 681 AUS
3 Australia 1950 30 34 44 23.8 614 AUS
4 Australia 1950 35 39 42.1 21.9 625 AUS
5 Australia 1950 40 44 38.9 20.1 555 AUS
6 Australia 1950 45 49 34 16.9 491 AUS
7 Australia 1950 50 54 29.6 14.6 439 AUS
8 Australia 1950 55 59 28 12.9 408 AUS
9 Australia 1950 60 64 26.3 12.1 356 AUS


Structured data

A familiar data model
Heterogeneous columns or hyperslabs
Each column/hyperslab is homogeneously typed
Relational databases (SQL, etc.) are just a special case
Need good performance in row- and column-oriented operations
Support for axis metadata
Data alignment is critical
Seamless integration with Python data structures and NumPy


Structured data challenges

Table modification: column insertion/deletion
Axis indexing and data alignment
Aggregation and transformation by group (“group by”)
Missing data handling
Pivoting and reshaping
Merging and joining
Time series-specific manipulations
Fast IO: flat files, databases, HDF5, ...


Not all fun and games

We care nearly equally about
Performance
Ease-of-use (syntax / API ﬁts your mental model)
Expressiveness
Clean, consistent API design is hard and underappreciated


The big picture

Build a foundation for data analysis and statistical computing
Craft the most expressive / ﬂexible in-memory data manipulation tool
in any language
Preferably also one of the fastest, too
Vastly simplify the data preparation, munging, and integration process
Comfortable abstractions: master data-fu without needing to be a
computer scientist
Later: extend API with distributed computing backend for
larger-than-memory datasets


pandas: a brief history

Starting building April 2008 back at AQR
Open-sourced (BSD license) mid-2009
29075 lines of Python/Cython code as of yesterday, and growing fast
Heavily tested, being used by many companies (inc. lots of ﬁnancial
ﬁrms) in production


Cython: getting good performance

My choice tool for writing performant code
High level access to NumPy C API internals
Buﬀer syntax/protocol abstracts away striding details of
non-contiguous arrays, very low overhead vs. working with raw C
pointers
Reduce/remove interpreter overhead associated with working with
Python data structures
Interface directly with C/C++ code when necessary


Axis indexing

Key pandas feature
The axis index is a data structure itself, which can be customized to
support things like:
1-1 O(1) indexing with hashable Python objects
Datetime indexing for time series data
Hierarchical (multi-level) indexing
Use Python dict to support O(1) lookups and O(n) realignment ops.
Can specialize to get better performance and memory usage


Axis indexing

Every axis has an index
Automatic alignment between diﬀerently-indexed objects: makes it
nearly impossible to accidentally combine misaligned data
Hierarchical indexing provides an intuitive way of structuring and
working with higher-dimensional data
Natural way of expressing “group by” and join-type operations
As good or in many cases much more integrated/ﬂexible than
commercial or open-source alternatives to pandas/Python


The trouble with Python dicts...

Python dict memory footprint can be quite large
1MM key-value pairs: something like 70mb on a 64-bit system
Even though sizeof(PyObject*) == 8
Python dict is great, but should use a faster, threadsafe hash table for
primitive C types (like 64-bit integer)
BUT: using a hash table only necessary in the general case. With
monotonic indexes you don’t need one for realignment ops


Some alignment numbers

Hardware: Macbook Pro Core i7 laptop, Python 2.7.2
Outer-join 500k-length indexes chosen from 1MM elements
Dict-based with random strings: 2.2 seconds
Sorted strings: 400ms (5.5x faster)
Sorted int64: 19ms (115x faster)
Fortunately, time series data falls into this last category
Alignment ops with C primitives could be fairly easily parallelized with
OpenMP in Cython


DataFrame, the pandas workhorse

A 2D tabular data structure with row and column indexes
Hierarchical indexing one way to support higher-dimensional data in a
lower-dimensional structure
Simplified NumPy type system: float, int, boolean, object
Rich indexing operations, SQL-like join/merges, etc.
Support heterogeneous columns WITHOUT sacrificing performance in
the homogeneous (e.g. floating point only) case


DataFrame, under the hood


Supporting size mutability

In order to have good row-oriented performance, need to store
like-typed columns in a single ndarray
“Column” insertion: accumulate 1 × N × . . . homogeneous columns,
later consolidate with other like-typed into a single block
I.e. avoid reallocate-copy or array concatenation steps as long as
possible
Column deletions can be no-copy events (since ndarrays support
views)


Hierarchical indexing

New this year, but really should have done long ago
Natural result of multi-key groupby
An intuitive way to work with higher-dimensional data
Much less ad hoc way of expressing reshaping operations
Once you have it, things like Excel-style pivot tables just “fall out”


Reshaping


Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)


Reshaping implementation nuances

Must deal with unbalanced group sizes / missing data
Play vectorization tricks with the NumPy C-contiguous memory
layout: no Python for loops allowed
Care must be taken to handle heterogeneous and homogeneous data
cases


GroupBy

High level process
split data set into groups
apply function to each group (an aggregation or a transformation)
combine results intelligently into a result data structure
Can be used to emulate SQL GROUP BY operations


GroupBy

Grouping closely related to indexing
Create correspondence between axis labels and group labels using one
of:
Array of group labels (like a DataFrame column)
Python function to be applied to each axis tick
Can group by multiple keys
For a hierarchically indexed axis, can select a level and group by that
(or some transformation thereof)


GroupBy implementation challenges

Computing the group labels from arbitrary Python objects is very
expensive
77ms for 1MM strings with 1K groups
To sort or not to sort (for iteration)?
Once you have the labels, can reorder the data set in O(n) (with a
much smaller constant than computing the labels)
Roughly 35ms to reorder 1MM ﬂoat64 data points given the labels
(By contrast, computing the mean of 1MM elements takes 1.4ms)
Python function call overhead is signiﬁcant in cases with lots of small
groups; much better (orders of magnitude speedup) to write
specialized Cython routines


Demo, time permitting


pandas: a Foundational Python Library for Data Analysis and Statistics

More Related Content

What's hot (20)

Similar to pandas: a Foundational Python Library for Data Analysis and Statistics (20)

More from Wes McKinney (20)

Recently uploaded (20)

pandas: a Foundational Python Library for Data Analysis and Statistics