SlideShare a Scribd company logo
What’s new and awesome
       in pandas
pandas?
In [13]: foo
Out[13]:
    methyl1    age      edu         something   indic
0   38.36      30to39   geCollege   1           False
1   37.85      lt30     geCollege   1           False
2   38.57      30to39   geCollege   1           False
3   39.75      30to39   geCollege   1           True
4   43.83      30to39   geCollege   1           True
5   39.08      30to39   ltHS        1           True


  Size-mutable “labeled arrays” that
     can handle heterogeneous data
Kinda like a structured array??

•  Automatic data alignment with lots of
   reshaping and indexing methods

•  Implicit and explicit handling of missing
   data

•  Easy time series functionality
    –  Far less fuss than scikits.timeseries

•  Lots of in-memory SQL-like operations
   (group by, join, etc.)
pandas?
•  Extremely good for financial data
  –  StackOverflow: “this is a beast of a financial
     analysis tool”



•  One of the better relational data
   munging tools in any language?

•  But also has maybe 60+% of what R
   users expect when they come to
   Python
1. Heavily redesigned
         internals
•  Merged old DataFrame and DataMatrix
   into a single DataFrame: retain
   optimal performance where possible

•  Internal BlockManager class manages
   homogeneous ndarrays for optimal
   performance and reshaping
1. Heavily redesigned
         internals
•  Better handling of missing data for
   non-floating point dtypes

•  Soon: DataFrame variant with N-dim
   “hyperslabs”
2. Fancier indexing
Mix boolean / integer / label /
slice-based indexing

df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]


Setting works too

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
3. More robust IO
data_frame = read_csv(‘mydata.csv’)

data_frame2 = read_table(‘mydata.txt’, sep=‘t’,
                         skiprows=[1,2],
                         na_values=[‘#N/A NA’])



store = HDFStore(‘pytables.h5’)
store[‘a’] = data_frame
store[‘b’] = data_frame2
4. Better pivoting / reshaping

    foo   bar    A         B         C
0   one   a     -0.0524    1.664     1.171
1   one   a      0.2514    0.8306   -1.396
2   one   b      0.1256    0.3897    0.5227
3   one   b     -0.9301    0.6513   -0.2313
4   one   c      2.037     1.938    -0.3454
5   two   a      0.2073    0.7857    0.9051
6   two   a     -1.032    -0.8615    1.028
7   two   b     -0.7319   -1.846     0.9294
8   two   b      0.1004   -1.19      0.6043
9   two   c     -1.008    -0.3339    0.09522
4. Better pivoting / reshaping

In [29]: pivoted = df.pivot('bar', 'foo')

In [30]: pivoted['B']
Out[30]:
    one      two
a   1.664    0.7857
b   0.8306 -0.8615
c   0.3897 -1.846
d   0.6513 -1.19
e   1.938   -0.3339
4. Better pivoting / reshaping

In [31]: pivoted.major_xs('a')
Out[31]:
      A        B        C
one -0.0524    1.664    1.171
two   0.2073   0.7857   0.9051


In [32]: pivoted.minor_xs('one')
Out[32]:
    A        B        C
a -0.0524    1.664    1.171
b   0.2514   0.8306 -1.396
c   0.1256   0.3897   0.5227
d -0.9301    0.6513 -0.2313
e   2.037    1.938   -0.3454
4. Better pivoting / reshaping

In [30]: pivoted['B']
Out[30]:
    one      two
a   1.664    0.7857
b   0.8306 -0.8615
c   0.3897 -1.846
d   0.6513 -1.19
e   1.938   -0.3339
4. Some other things
•  “Sparse” (mostly NA) versions of
   data structures
•  Time zone support in DateRange
•  Generic moving window function
   rolling_apply
Near future
•  More powerful Group By

•  Flexible, fast frequency (time series) conversions

•  More integration with statsmodels
Thanks!
•  Hack: github.com/wesm/pandas

•  Twitter: @wesmckinn

•  Blog: blog.wesmckinney.com

More Related Content

PDF
What's new in pandas and the SciPy stack for financial users
PDF
pandas: Powerful data analysis tools for Python
PDF
Improving data interoperability in Python and R
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
PDF
Visualizing big data in the browser using spark
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
What's new in pandas and the SciPy stack for financial users
pandas: Powerful data analysis tools for Python
Improving data interoperability in Python and R
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Visualizing big data in the browser using spark
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

What's hot (20)

PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PDF
A look inside pandas design and development
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
PDF
Enabling Python to be a Better Big Data Citizen
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
PDF
Enabling exploratory data science with Spark and R
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Overview of the Hive Stinger Initiative
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Koalas: Pandas on Apache Spark
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
KEY
Large Scale Data Analysis Tools
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PDF
New Directions for Spark in 2015 - Spark Summit East
PDF
Spark what's new what's coming
PPTX
Spark - Philly JUG
PDF
Distributed ML in Apache Spark
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
A look inside pandas design and development
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Enabling Python to be a Better Big Data Citizen
Spark Application Carousel: Highlights of Several Applications Built with Spark
Enabling exploratory data science with Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Overview of the Hive Stinger Initiative
Jump Start into Apache® Spark™ and Databricks
Introduction to Spark (Intern Event Presentation)
Koalas: Pandas on Apache Spark
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Large Scale Data Analysis Tools
Apache Arrow: Leveling Up the Data Science Stack
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
New Directions for Spark in 2015 - Spark Summit East
Spark what's new what's coming
Spark - Philly JUG
Distributed ML in Apache Spark
Ad

Similar to SciPy 2011 pandas lightning talk (20)

PDF
lecture14DATASCIENCE AND MACHINE LER.pdf
PDF
Pandas in Python for Data Exploration .pdf
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
PDF
Python Interview Questions PDF By ScholarHat
PPTX
Lecture 9.pptx
PDF
Data Wrangling with Pandas
PDF
Pandas cheat sheet
PDF
Pandas Cheat Sheet
PDF
Pandas cheat sheet_data science
PDF
Panda data structures and its importance in Python.pdf
PDF
pandas.pdf
PDF
pandas (1).pdf
PDF
330 Pandas Interview Questions and Answers MCQ Format 1st Edition Manish Salunke
PPTX
Meetup Junio Data Analysis with python 2018
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PPTX
interenship.pptx
PPTX
DataStructures in Pyhton Pandas and numpy.pptx
PPT
Python Panda Library for python programming.ppt
PPTX
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
PPTX
introduction to data structures in pandas
lecture14DATASCIENCE AND MACHINE LER.pdf
Pandas in Python for Data Exploration .pdf
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
Python Interview Questions PDF By ScholarHat
Lecture 9.pptx
Data Wrangling with Pandas
Pandas cheat sheet
Pandas Cheat Sheet
Pandas cheat sheet_data science
Panda data structures and its importance in Python.pdf
pandas.pdf
pandas (1).pdf
330 Pandas Interview Questions and Answers MCQ Format 1st Edition Manish Salunke
Meetup Junio Data Analysis with python 2018
python-pandas-For-Data-Analysis-Manipulate.pptx
interenship.pptx
DataStructures in Pyhton Pandas and numpy.pptx
Python Panda Library for python programming.ppt
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
introduction to data structures in pandas
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
New Directions for Apache Arrow
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Ursa Labs and Apache Arrow in 2019
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Memory Interoperability in Analytics and Machine Learning
PPTX
Raising the Tides: Open Source Analytics for Data Science
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Python Data Wrangling: Preparing for the Future
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Ursa Labs and Apache Arrow in 2019
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Memory Interoperability in Analytics and Machine Learning
Raising the Tides: Open Source Analytics for Data Science
Improving Python and Spark (PySpark) Performance and Interoperability
Python Data Wrangling: Preparing for the Future

Recently uploaded (20)

PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Building Integrated photovoltaic BIPV_UPV.pdf
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Programs and apps: productivity, graphics, security and other tools
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

SciPy 2011 pandas lightning talk

  • 1. What’s new and awesome in pandas
  • 2. pandas? In [13]: foo Out[13]: methyl1 age edu something indic 0 38.36 30to39 geCollege 1 False 1 37.85 lt30 geCollege 1 False 2 38.57 30to39 geCollege 1 False 3 39.75 30to39 geCollege 1 True 4 43.83 30to39 geCollege 1 True 5 39.08 30to39 ltHS 1 True Size-mutable “labeled arrays” that can handle heterogeneous data
  • 3. Kinda like a structured array?? •  Automatic data alignment with lots of reshaping and indexing methods •  Implicit and explicit handling of missing data •  Easy time series functionality –  Far less fuss than scikits.timeseries •  Lots of in-memory SQL-like operations (group by, join, etc.)
  • 4. pandas? •  Extremely good for financial data –  StackOverflow: “this is a beast of a financial analysis tool” •  One of the better relational data munging tools in any language? •  But also has maybe 60+% of what R users expect when they come to Python
  • 5. 1. Heavily redesigned internals •  Merged old DataFrame and DataMatrix into a single DataFrame: retain optimal performance where possible •  Internal BlockManager class manages homogeneous ndarrays for optimal performance and reshaping
  • 6. 1. Heavily redesigned internals •  Better handling of missing data for non-floating point dtypes •  Soon: DataFrame variant with N-dim “hyperslabs”
  • 7. 2. Fancier indexing Mix boolean / integer / label / slice-based indexing df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] Setting works too df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
  • 8. 3. More robust IO data_frame = read_csv(‘mydata.csv’) data_frame2 = read_table(‘mydata.txt’, sep=‘t’, skiprows=[1,2], na_values=[‘#N/A NA’]) store = HDFStore(‘pytables.h5’) store[‘a’] = data_frame store[‘b’] = data_frame2
  • 9. 4. Better pivoting / reshaping foo bar A B C 0 one a -0.0524 1.664 1.171 1 one a 0.2514 0.8306 -1.396 2 one b 0.1256 0.3897 0.5227 3 one b -0.9301 0.6513 -0.2313 4 one c 2.037 1.938 -0.3454 5 two a 0.2073 0.7857 0.9051 6 two a -1.032 -0.8615 1.028 7 two b -0.7319 -1.846 0.9294 8 two b 0.1004 -1.19 0.6043 9 two c -1.008 -0.3339 0.09522
  • 10. 4. Better pivoting / reshaping In [29]: pivoted = df.pivot('bar', 'foo') In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339
  • 11. 4. Better pivoting / reshaping In [31]: pivoted.major_xs('a') Out[31]: A B C one -0.0524 1.664 1.171 two 0.2073 0.7857 0.9051 In [32]: pivoted.minor_xs('one') Out[32]: A B C a -0.0524 1.664 1.171 b 0.2514 0.8306 -1.396 c 0.1256 0.3897 0.5227 d -0.9301 0.6513 -0.2313 e 2.037 1.938 -0.3454
  • 12. 4. Better pivoting / reshaping In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339
  • 13. 4. Some other things •  “Sparse” (mostly NA) versions of data structures •  Time zone support in DateRange •  Generic moving window function rolling_apply
  • 14. Near future •  More powerful Group By •  Flexible, fast frequency (time series) conversions •  More integration with statsmodels
  • 15. Thanks! •  Hack: github.com/wesm/pandas •  Twitter: @wesmckinn •  Blog: blog.wesmckinney.com