SciPy 2011 pandas lightning talk

What’s new and awesome
in pandas

pandas?
In [13]: foo
Out[13]:
methyl1 age edu something indic
0 38.36 30to39 geCollege 1 False
1 37.85 lt30 geCollege 1 False
2 38.57 30to39 geCollege 1 False
3 39.75 30to39 geCollege 1 True
4 43.83 30to39 geCollege 1 True
5 39.08 30to39 ltHS 1 True

Size-mutable “labeled arrays” that
can handle heterogeneous data

Kinda like a structured array??

•  Automatic data alignment with lots of
reshaping and indexing methods

•  Implicit and explicit handling of missing
data

•  Easy time series functionality
–  Far less fuss than scikits.timeseries

•  Lots of in-memory SQL-like operations
(group by, join, etc.)

pandas?
•  Extremely good for financial data
–  StackOverflow: “this is a beast of a financial
analysis tool”

•  One of the better relational data
munging tools in any language?

•  But also has maybe 60+% of what R
users expect when they come to
Python

1. Heavily redesigned
internals
•  Merged old DataFrame and DataMatrix
into a single DataFrame: retain
optimal performance where possible

•  Internal BlockManager class manages
homogeneous ndarrays for optimal
performance and reshaping

1. Heavily redesigned
internals
•  Better handling of missing data for
non-floating point dtypes

•  Soon: DataFrame variant with N-dim
“hyperslabs”

2. Fancier indexing
Mix boolean / integer / label /
slice-based indexing

df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]

Setting works too

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

3. More robust IO
data_frame = read_csv(‘mydata.csv’)

data_frame2 = read_table(‘mydata.txt’, sep=‘t’,
skiprows=[1,2],
na_values=[‘#N/A NA’])

store = HDFStore(‘pytables.h5’)
store[‘a’] = data_frame
store[‘b’] = data_frame2

4. Better pivoting / reshaping

foo bar A B C
0 one a -0.0524 1.664 1.171
1 one a 0.2514 0.8306 -1.396
2 one b 0.1256 0.3897 0.5227
3 one b -0.9301 0.6513 -0.2313
4 one c 2.037 1.938 -0.3454
5 two a 0.2073 0.7857 0.9051
6 two a -1.032 -0.8615 1.028
7 two b -0.7319 -1.846 0.9294
8 two b 0.1004 -1.19 0.6043
9 two c -1.008 -0.3339 0.09522


In [29]: pivoted = df.pivot('bar', 'foo')

In [30]: pivoted['B']
Out[30]:
one two
a 1.664 0.7857
b 0.8306 -0.8615
c 0.3897 -1.846
d 0.6513 -1.19
e 1.938 -0.3339


In [31]: pivoted.major_xs('a')
Out[31]:
A B C
one -0.0524 1.664 1.171
two 0.2073 0.7857 0.9051

In [32]: pivoted.minor_xs('one')
Out[32]:
A B C
a -0.0524 1.664 1.171
b 0.2514 0.8306 -1.396
c 0.1256 0.3897 0.5227
d -0.9301 0.6513 -0.2313
e 2.037 1.938 -0.3454


In [30]: pivoted['B']
Out[30]:
one two
a 1.664 0.7857
b 0.8306 -0.8615
c 0.3897 -1.846
d 0.6513 -1.19
e 1.938 -0.3339

4. Some other things
•  “Sparse” (mostly NA) versions of
data structures
•  Time zone support in DateRange
•  Generic moving window function
rolling_apply

Near future
•  More powerful Group By

•  Flexible, fast frequency (time series) conversions

•  More integration with statsmodels

Thanks!
•  Hack: github.com/wesm/pandas

•  Twitter: @wesmckinn

•  Blog: blog.wesmckinney.com

SciPy 2011 pandas lightning talk

More Related Content

What's hot (20)

Similar to SciPy 2011 pandas lightning talk (20)

More from Wes McKinney (20)

Recently uploaded (20)

SciPy 2011 pandas lightning talk