SlideShare a Scribd company logo
2
Most read
3
Most read
7
Most read
Pandas

                                Maik Röder
                         Python Barcelona Meetup
                             7. February 2013

                           Python Consultant
                         maikroeder@gmail.com


Friday, February 8, 13
Pandas
                         • Powerful and productive Python data
                           analysis and management library
                         • Panel Data System
                         • Open Sourced by AQR Capital
                           Management, LLC in late 2009
                         • 30.000 lines of tested Python/Cython code
                         • Used in production in many companies
Friday, February 8, 13
Pandas
                         • Rich data structures and functions to make
                           working with structured data fast, easy, and
                           expressive
                         • Built on top of Numpy with its high
                           performance array-computing features
                         • flexible data manipulation capabilities of
                           spreadsheets and relational databases
                         • Sophisticated indexing functionality
                          • slice, dice, perform aggregations, select
                             subsets of data
Friday, February 8, 13
The ideal tool for data
                               scientists
                         • Munging data
                         • Cleaning data
                         • Analyzing data
                         • Modeling data
                         • Organizing the results of the analysis into a
                           form suitable for plotting or tabular display


Friday, February 8, 13
Series
                 • one-dimensional array-like object
                         >>> s = Series((1,2,3,4,5))
                 • Contains an array of data (of any Numpy
                         data type)
                         >>> s.values
                 • Has an associated array of data labels, the
                         index (Default index from 0 to N - 1)
                         >>> s.index
Friday, February 8, 13
Series data structure
        >>> import numpy
      >>> randn = numpy.random.randn
      >>> from pandas import *
      >>> s = Series(randn(3),('a','b','c'))
      >>> s
      a    -0.889880
      b     1.102135
      c    -2.187296
      >>> s.mean()
      -0.65834710697853194

Friday, February 8, 13
Series to/from dict
          • Series to Python dict - No more explicit order
          >>> dict(s)
          {'a': -0.88988001423312313,
           'c': -2.1872960440695666,
           'b': 1.1021347373670938}
          • Back to a Series with a new Index from sorted
                   dictionary keys
          >>> Series(dict(s))
          a   -0.889880
          b    1.102135
          c   -2.187296
Friday, February 8, 13
Reindexing labels
                 >>> s
                 a   -0.496848
                 b     0.607173
                 c   -1.570596
                 >>> s.index
                 Index([a, b, c], dtype=object)
                 >>> s.reindex(['c','b','a'])
                 c   -1.570596
                 b     0.607173
                 a   -0.496848
Friday, February 8, 13
Vectorization
                 >>> s + s
                 a   -1.779760
                 b    2.204269
                 c   -4.374592
                 • Series work with Numpy
                 >>> numpy.exp(s)
                 a       0.410705
                 b       3.010586
                 c       0.112220
Friday, February 8, 13
DataFrame
                         • Like data.frame in the statistical
                           language/package R
                         • 2-dimensional tabular data structure
                         • Data manipulation with integrated
                           indexing
                         • Support heterogeneous columns
                         • Homogeneous columns
Friday, February 8, 13
DataFrame
                     >>> d = {'one': s*s, 'two': s+s}
                     >>> DataFrame(d)
                             one       two
                     a 0.791886 -1.779760
                     b 1.214701 2.204269
                     c 4.784264 -4.374592
                     >>> df.index
                     Index([a, b, c], dtype=object)
                     >>> df.columns
                     Index([one, two], dtype=objec)

Friday, February 8, 13
Dataframe add column
                 • Add a third column
                 >>> df['three'] = s * 3
                 • It will share the existing index
                 >>> df
                        one       two     three
                 a 0.791886 -1.779760 -2.669640
                 b 1.214701 2.204269 3.306404
                 c 4.784264 -4.374592 -6.561888

Friday, February 8, 13
Access to columns

                 • Access by attribute   • Access by dict like
                                           notation
                 >>> df.one              >>> df['one']
                        one                     one
                 a 0.791886              a 0.791886
                 b 1.214701              b 1.214701
                 c 4.784264              c 4.784264


Friday, February 8, 13
Reindexing

                 >>> df.reindex(['c','b','a'])
                 >>> df
                        one       two     three
                 c 4.784264 -4.374592 -6.561888
                 b 1.214701 2.204269 3.306404
                 a 0.791886 -1.779760 -2.669640



Friday, February 8, 13
Drop entries from an axis

            >>> df.drop('c')
            b 1.214701 2.204269 3.306404
            a 0.791886 -1.779760 -2.669640
            >>> df.drop(['b,'a'])
                   one       two     three
            c 4.784264 -4.374592 -6.561888


Friday, February 8, 13
Descriptive statistics
                 >>> df.mean()
                 one      2.263617
                 two     -1.316694
                 three   -1.975041
                 • Also: count, sum, median, min,
                         max, abs, prod, std, var,
                         skew, kurt, quantile, cumsum,
                         cumprod, cummax, cummin


Friday, February 8, 13
Computational Tools
                 • Covariance
                         >>> s1 = Series(randn(1000))
                         >>> s2 = Series(randn(1000))
                         >>> s1.cov(s2)
                         0.013973709323221539
                 • Also: pearson, kendall, spearman

Friday, February 8, 13
This and much more...
                         • Group by: split-apply-combine
                         • Merge, join and aggregate
                         • Reshaping and Pivot Tables
                         • Time Series / Date functionality
                         • Plotting with matplotlib
                         • IO Tools (Text, CSV, HDF5, ...)
                         • Sparse data structures
Friday, February 8, 13
Resources


                         • http://pypi.python.org/pypi/pandas
                         • http://code.google.com/p/pandas


Friday, February 8, 13
Out now...




Friday, February 8, 13

More Related Content

PPTX
Data Analysis with Python Pandas
PDF
pandas - Python Data Analysis
PPTX
NumPy.pptx
PPTX
Python pandas Library
PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
PPTX
Data Analysis in Python-NumPy
PDF
Introduction to Python Pandas for Data Analytics
Data Analysis with Python Pandas
pandas - Python Data Analysis
NumPy.pptx
Python pandas Library
Introduction to Pandas and Time Series Analysis [PyCon DE]
Data Analysis in Python-NumPy
Introduction to Python Pandas for Data Analytics

What's hot (20)

PPT
Python Pandas
PPTX
Pandas
PPTX
Introduction to pandas
PDF
Data visualization in Python
PDF
Data Visualization in Python
PPTX
MatplotLib.pptx
ODP
Python Modules
PPTX
Introduction to numpy Session 1
PPTX
Introduction to matplotlib
PPTX
Data Wrangling
PPT
K mean-clustering algorithm
PPTX
Machine learning with scikitlearn
PPTX
Python: Modules and Packages
PDF
Model selection and cross validation techniques
PDF
Python NumPy Tutorial | NumPy Array | Edureka
PPTX
Data Structures in Python
PPTX
Chapter 05 classes and objects
PPTX
Python Seaborn Data Visualization
PPTX
Divide and conquer 1
Python Pandas
Pandas
Introduction to pandas
Data visualization in Python
Data Visualization in Python
MatplotLib.pptx
Python Modules
Introduction to numpy Session 1
Introduction to matplotlib
Data Wrangling
K mean-clustering algorithm
Machine learning with scikitlearn
Python: Modules and Packages
Model selection and cross validation techniques
Python NumPy Tutorial | NumPy Array | Edureka
Data Structures in Python
Chapter 05 classes and objects
Python Seaborn Data Visualization
Divide and conquer 1
Ad

Viewers also liked (20)

PPT
Panda Presentation
PPT
Presentation On Pandas
PPTX
Panda powerpoint
PDF
Introduction to Pandas
PPTX
PANDA BEAR
PPT
Pandas
PPT
Hannah donaldson red panda!
PPT
Red panda Chayse Musolf
PPTX
Giant pandas
DOCX
PPTX
Giant panda
PPT
The Panda Bear
PDF
Getting started with pandas
PPTX
Red panda powerpoint
PPT
Giant Pandas
PPTX
Panda bear
PDF
Introduction to NumPy (PyData SV 2013)
PDF
pandas: Powerful data analysis tools for Python
PPT
Pandas
PPT
El Oso Panda
Panda Presentation
Presentation On Pandas
Panda powerpoint
Introduction to Pandas
PANDA BEAR
Pandas
Hannah donaldson red panda!
Red panda Chayse Musolf
Giant pandas
Giant panda
The Panda Bear
Getting started with pandas
Red panda powerpoint
Giant Pandas
Panda bear
Introduction to NumPy (PyData SV 2013)
pandas: Powerful data analysis tools for Python
Pandas
El Oso Panda
Ad

Similar to Pandas (20)

PPTX
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
PPTX
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
PDF
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
PPTX
DATA ANALYSIS AND VISUALISATION using python
PDF
Panda data structures and its importance in Python.pdf
PDF
Pandas numpy Related Presentation.pptx.pdf
PPTX
Unit 3_Numpy_Vsp.pptx
PPTX
data science for engineering reference pdf
PPTX
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
PPT
Python Panda Library for python programming.ppt
PDF
pandas dataframe notes.pdf
PDF
A look inside pandas design and development
PPTX
Python Cheat Sheet Presentation Learning
PPTX
Python Library-Series.pptx
PPTX
Chapter 5-Numpy-Pandas.pptx python programming
PPTX
Unit 3_Numpy_VP.pptx
PPTX
Meetup Junio Data Analysis with python 2018
PDF
Python pandas I .pdf gugugigg88iggigigih
PDF
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
PPTX
Numpy_Pandas_for beginners_________.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python (3).pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
DATA ANALYSIS AND VISUALISATION using python
Panda data structures and its importance in Python.pdf
Pandas numpy Related Presentation.pptx.pdf
Unit 3_Numpy_Vsp.pptx
data science for engineering reference pdf
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
Python Panda Library for python programming.ppt
pandas dataframe notes.pdf
A look inside pandas design and development
Python Cheat Sheet Presentation Learning
Python Library-Series.pptx
Chapter 5-Numpy-Pandas.pptx python programming
Unit 3_Numpy_VP.pptx
Meetup Junio Data Analysis with python 2018
Python pandas I .pdf gugugigg88iggigigih
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
Numpy_Pandas_for beginners_________.pptx

More from maikroeder (6)

PDF
Google charts
PDF
Encode RNA Dashboard
PDF
Introduction to ggplot2
PDF
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
ODP
Cms - Content Management System Utilities for Django
PDF
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Google charts
Encode RNA Dashboard
Introduction to ggplot2
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Cms - Content Management System Utilities for Django
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Digital-Transformation-Roadmap-for-Companies.pptx

Pandas

  • 1. Pandas Maik Röder Python Barcelona Meetup 7. February 2013 Python Consultant maikroeder@gmail.com Friday, February 8, 13
  • 2. Pandas • Powerful and productive Python data analysis and management library • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companies Friday, February 8, 13
  • 3. Pandas • Rich data structures and functions to make working with structured data fast, easy, and expressive • Built on top of Numpy with its high performance array-computing features • flexible data manipulation capabilities of spreadsheets and relational databases • Sophisticated indexing functionality • slice, dice, perform aggregations, select subsets of data Friday, February 8, 13
  • 4. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing data • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular display Friday, February 8, 13
  • 5. Series • one-dimensional array-like object >>> s = Series((1,2,3,4,5)) • Contains an array of data (of any Numpy data type) >>> s.values • Has an associated array of data labels, the index (Default index from 0 to N - 1) >>> s.index Friday, February 8, 13
  • 6. Series data structure >>> import numpy >>> randn = numpy.random.randn >>> from pandas import * >>> s = Series(randn(3),('a','b','c')) >>> s a -0.889880 b 1.102135 c -2.187296 >>> s.mean() -0.65834710697853194 Friday, February 8, 13
  • 7. Series to/from dict • Series to Python dict - No more explicit order >>> dict(s) {'a': -0.88988001423312313, 'c': -2.1872960440695666, 'b': 1.1021347373670938} • Back to a Series with a new Index from sorted dictionary keys >>> Series(dict(s)) a -0.889880 b 1.102135 c -2.187296 Friday, February 8, 13
  • 8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.index Index([a, b, c], dtype=object) >>> s.reindex(['c','b','a']) c -1.570596 b 0.607173 a -0.496848 Friday, February 8, 13
  • 9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 • Series work with Numpy >>> numpy.exp(s) a 0.410705 b 3.010586 c 0.112220 Friday, February 8, 13
  • 10. DataFrame • Like data.frame in the statistical language/package R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columns Friday, February 8, 13
  • 11. DataFrame >>> d = {'one': s*s, 'two': s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592 >>> df.index Index([a, b, c], dtype=object) >>> df.columns Index([one, two], dtype=objec) Friday, February 8, 13
  • 12. Dataframe add column • Add a third column >>> df['three'] = s * 3 • It will share the existing index >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888 Friday, February 8, 13
  • 13. Access to columns • Access by attribute • Access by dict like notation >>> df.one >>> df['one'] one one a 0.791886 a 0.791886 b 1.214701 b 1.214701 c 4.784264 c 4.784264 Friday, February 8, 13
  • 14. Reindexing >>> df.reindex(['c','b','a']) >>> df one two three c 4.784264 -4.374592 -6.561888 b 1.214701 2.204269 3.306404 a 0.791886 -1.779760 -2.669640 Friday, February 8, 13
  • 15. Drop entries from an axis >>> df.drop('c') b 1.214701 2.204269 3.306404 a 0.791886 -1.779760 -2.669640 >>> df.drop(['b,'a']) one two three c 4.784264 -4.374592 -6.561888 Friday, February 8, 13
  • 16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cummin Friday, February 8, 13
  • 17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearman Friday, February 8, 13
  • 18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structures Friday, February 8, 13
  • 19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandas Friday, February 8, 13