SlideShare a Scribd company logo
4
Most read
15
Most read
16
Most read
Python for Data Analysis
Andrew Henshaw
Georgia Tech Research Institute
• PowerPoint
• IPython (ipython –pylab=inline)
• Custom bridge (ipython2powerpoint)
• Python for Data
Analysis
• Wes McKinney
• Lead developer of
pandas
• Quantitative Financial
Analyst
• Python library to provide data analysis features, similar to:
• R
• MATLAB
• SAS
• Built on NumPy, SciPy, and to some extent, matplotlib
• Key components provided by pandas:
• Series
• DataFrame
• One-dimensional
array-like object
containing data
and labels (or
index)
• Lots of ways to
build a Series
>>> import pandas as pd
>>> s = pd.Series(list('abcdef'))
>>> s
0 a
1 b
2 c
3 d
4 e
5 f
>>> s = pd.Series([2, 4, 6, 8])
>>> s
0 2
1 4
2 6
3 8
• A Series index can be
specified
• Single values can be
selected by index
• Multiple values can be
selected with multiple
indexes
>>> s = pd.Series([2, 4, 6, 8],
index = ['f', 'a', 'c', 'e'])
>>>
>>> s
f 2
a 4
c 6
e 8
>>> s['a']
4
>>> s[['a', 'c']]
a 4
c 6
• Think of a Series as a
fixed-length, ordered
dict
• However, unlike a dict,
index items don't have
to be unique
>>> s2 = pd.Series(range(4),
index = list('abab'))
>>> s2
a 0
b 1
a 2
b 3
>>> s['a]
>>>
>>> s['a']
4
>>> s2['a']
a 0
a 2
>>> s2['a'][0]
0
• Filtering
• NumPy-type operations
on data
>>> s
f 2
a 4
c 6
e 8
>>> s[s > 4]
c 6
e 8
>>> s>4
f False
a False
c True
e True
>>> s*2
f 4
a 8
c 12
e 16
• pandas can
accommodate
incomplete data
>>> sdata = {'b':100, 'c':150, 'd':200}
>>> s = pd.Series(sdata)
>>> s
b 100
c 150
d 200
>>> s = pd.Series(sdata, list('abcd'))
>>> s
a NaN
b 100
c 150
d 200
>>> s*2
a NaN
b 200
c 300
d 400
• Unlike in a NumPy
ndarray, data is
automatically aligned
>>> s2 = pd.Series([1, 2, 3],
index = ['c', 'b', 'a'])
>>> s2
c 1
b 2
a 3
>>> s
a NaN
b 100
c 150
d 200
>>> s*s2
a NaN
b 200
c 150
d NaN
• Spreadsheet-like data structure containing an
ordered collection of columns
• Has both a row and column index
• Consider as dict of Series (with shared index)
>>> data = {'state': ['FL', 'FL', 'GA', 'GA', 'GA'],
'year': [2010, 2011, 2008, 2010, 2011],
'pop': [18.8, 19.1, 9.7, 9.7, 9.8]}
>>> frame = pd.DataFrame(data)
>>> frame
pop state year
0 18.8 FL 2010
1 19.1 FL 2011
2 9.7 GA 2008
3 9.7 GA 2010
4 9.8 GA 2011
>>> pop_data = {'FL': {2010:18.8, 2011:19.1},
'GA': {2008: 9.7, 2010: 9.7, 2011:9.8}}
>>> pop = pd.DataFrame(pop_data)
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
• Columns can be retrieved
as a Series
• dict notation
• attribute notation
• Rows can retrieved by
position or by name (using
ix attribute)
>>> frame['state']
0 FL
1 FL
2 GA
3 GA
4 GA
Name: state
>>> frame.describe
<bound method DataFrame.describe
of pop state year
0 18.8 FL 2010
1 19.1 FL 2011
2 9.7 GA 2008
3 9.7 GA 2010
4 9.8 GA 2011>
• New columns can be
added (by computation
or direct assignment)
>>> frame['other'] = NaN
>>> frame
pop state year other
0 18.8 FL 2010 NaN
1 19.1 FL 2011 NaN
2 9.7 GA 2008 NaN
3 9.7 GA 2010 NaN
4 9.8 GA 2011 NaN
>>> frame['calc'] = frame['pop'] * 2
>>> frame
pop state year other calc
0 18.8 FL 2010 NaN 37.6
1 19.1 FL 2011 NaN 38.2
2 9.7 GA 2008 NaN 19.4
3 9.7 GA 2010 NaN 19.4
4 9.8 GA 2011 NaN 19.6
>>> obj = pd.Series(['blue', 'purple', 'red'],
index=[0,2,4])
>>> obj
0 blue
2 purple
4 red
>>> obj.reindex(range(4))
0 blue
1 NaN
2 purple
3 NaN
>>> obj.reindex(range(5), fill_value='black')
0 blue
1 black
2 purple
3 black
4 red
>>> obj.reindex(range(5), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 red
• Creation of new
object with the
data conformed
to a new index
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
>>> pop.sum()
FL 37.9
GA 29.2
>>> pop.mean()
FL 18.950000
GA 9.733333
>>> pop.describe()
FL GA
count 2.000000 3.000000
mean 18.950000 9.733333
std 0.212132 0.057735
min 18.800000 9.700000
25% 18.875000 9.700000
50% 18.950000 9.700000
75% 19.025000 9.750000
max 19.100000 9.800000
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
>>> pop < 9.8
FL GA
2008 False True
2010 False True
2011 False False
>>> pop[pop < 9.8] = 0
>>> pop
FL GA
2008 NaN 0.0
2010 18.8 0.0
2011 19.1 9.8
• pandas supports several ways to handle data loading
• Text file data
• read_csv
• read_table
• Structured data (JSON, XML, HTML)
• works well with existing libraries
• Excel (depends upon xlrd and openpyxl packages)
• Database
• pandas.io.sql module (read_frame)
>>> tips = pd.read_csv('/users/ah6/Desktop/pandas
talk/data/tips.csv')
>>> tips.ix[:2]
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
>>> party_counts = pd.crosstab(tips.day, tips.size)
>>> party_counts
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3
>>> sum_by_day = party_counts.sum(1).astype(float)
>>> party_pcts = party_counts.div(sum_by_day, axis=0)
>>> party_pcts
size 1 2 3 4 5 6
day
Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000
Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000
Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158
Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387
>>> party_pcts.plot(kind='bar', stacked=True)
<matplotlib.axes.AxesSubplot at 0x6bf2ed0>
>>> tips['tip_pct'] = tips['tip'] / tips['total_bill']
>>> tips['tip_pct'].hist(bins=50)
<matplotlib.axes.AxesSubplot at 0x6c10d30>
>>> tips['tip_pct'].describe()
count 244.000000
mean 0.160803
std 0.061072
min 0.035638
25% 0.129127
50% 0.154770
75% 0.191475
max 0.710345
• Data Aggregation
• GroupBy
• Pivot Tables
• Time Series
• Periods/Frequencies
• Operations with Time Series with Different Frequencies
• Downsampling/Upsampling
• Plotting with TimeSeries (auto-adjust scale)
• Advanced Analysis
• Decile and Quartile Analysis
• Signal Frontier Analysis
• Future Contract Rolling
• Rolling Correlation and Linear Regression

More Related Content

PDF
Pandas
PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
PPTX
Data Analysis with Python Pandas
PPTX
Python Seaborn Data Visualization
PDF
Introduction to Python Pandas for Data Analytics
PPTX
Pandas
ODP
Data Analysis in Python
Pandas
Introduction to Pandas and Time Series Analysis [PyCon DE]
Data Analysis with Python Pandas
Python Seaborn Data Visualization
Introduction to Python Pandas for Data Analytics
Pandas
Data Analysis in Python

What's hot (20)

PPTX
Python pandas Library
ODP
Python Modules
PDF
Data Visualization in Python
PDF
Introduction to NumPy (PyData SV 2013)
PPT
Python Pandas
PPTX
Introduction to numpy Session 1
PPTX
Introduction to pandas
PPTX
MatplotLib.pptx
PPTX
Data Structures in Python
PDF
Python NumPy Tutorial | NumPy Array | Edureka
PPTX
Introduction to matplotlib
PDF
Data Analysis and Visualization using Python
PDF
Introduction to NumPy
PDF
pandas: Powerful data analysis tools for Python
PDF
Data visualization in Python
PPTX
Python Lambda Function
PPTX
Machine learning with scikitlearn
PPTX
Chapter 05 classes and objects
PPTX
Fundamentals of Python Programming
Python pandas Library
Python Modules
Data Visualization in Python
Introduction to NumPy (PyData SV 2013)
Python Pandas
Introduction to numpy Session 1
Introduction to pandas
MatplotLib.pptx
Data Structures in Python
Python NumPy Tutorial | NumPy Array | Edureka
Introduction to matplotlib
Data Analysis and Visualization using Python
Introduction to NumPy
pandas: Powerful data analysis tools for Python
Data visualization in Python
Python Lambda Function
Machine learning with scikitlearn
Chapter 05 classes and objects
Fundamentals of Python Programming
Ad

Similar to pandas - Python Data Analysis (20)

PPTX
pandasppt with informative topics coverage.pptx
PDF
PyData Paris 2015 - Track 1.2 Gilles Louppe
PPTX
Presentation on the basic of numpy and Pandas
PDF
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
PPTX
dvdxsfdxfdfdfdffddvfbgbesseesesgesesseseggesges
PDF
Pandas in Python for Data Exploration .pdf
PPT
Python Panda Library for python programming.ppt
ODP
Data analysis using python
PPTX
Lecture 9.pptx
PPTX
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
PPTX
introduction to data structures in pandas
PPTX
DataStructures in Pyhton Pandas and numpy.pptx
PDF
pandas dataframe notes.pdf
PDF
Panda data structures and its importance in Python.pdf
PPTX
Python libraries for analysis Pandas.pptx
PDF
Pandas numpy Related Presentation.pptx.pdf
PDF
pandas-221217084954-937bb582.pdf
PPTX
Pandas.pptx
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PPTX
pandasppt with informative topics coverage.pptx
PyData Paris 2015 - Track 1.2 Gilles Louppe
Presentation on the basic of numpy and Pandas
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
dvdxsfdxfdfdfdffddvfbgbesseesesgesesseseggesges
Pandas in Python for Data Exploration .pdf
Python Panda Library for python programming.ppt
Data analysis using python
Lecture 9.pptx
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
introduction to data structures in pandas
DataStructures in Pyhton Pandas and numpy.pptx
pandas dataframe notes.pdf
Panda data structures and its importance in Python.pdf
Python libraries for analysis Pandas.pptx
Pandas numpy Related Presentation.pptx.pdf
pandas-221217084954-937bb582.pdf
Pandas.pptx
python-pandas-For-Data-Analysis-Manipulate.pptx
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Modernizing your data center with Dell and AMD
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Modernizing your data center with Dell and AMD
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Monthly Chronicles - July 2025
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation

pandas - Python Data Analysis

  • 1. Python for Data Analysis Andrew Henshaw Georgia Tech Research Institute
  • 2. • PowerPoint • IPython (ipython –pylab=inline) • Custom bridge (ipython2powerpoint)
  • 3. • Python for Data Analysis • Wes McKinney • Lead developer of pandas • Quantitative Financial Analyst
  • 4. • Python library to provide data analysis features, similar to: • R • MATLAB • SAS • Built on NumPy, SciPy, and to some extent, matplotlib • Key components provided by pandas: • Series • DataFrame
  • 5. • One-dimensional array-like object containing data and labels (or index) • Lots of ways to build a Series >>> import pandas as pd >>> s = pd.Series(list('abcdef')) >>> s 0 a 1 b 2 c 3 d 4 e 5 f >>> s = pd.Series([2, 4, 6, 8]) >>> s 0 2 1 4 2 6 3 8
  • 6. • A Series index can be specified • Single values can be selected by index • Multiple values can be selected with multiple indexes >>> s = pd.Series([2, 4, 6, 8], index = ['f', 'a', 'c', 'e']) >>> >>> s f 2 a 4 c 6 e 8 >>> s['a'] 4 >>> s[['a', 'c']] a 4 c 6
  • 7. • Think of a Series as a fixed-length, ordered dict • However, unlike a dict, index items don't have to be unique >>> s2 = pd.Series(range(4), index = list('abab')) >>> s2 a 0 b 1 a 2 b 3 >>> s['a] >>> >>> s['a'] 4 >>> s2['a'] a 0 a 2 >>> s2['a'][0] 0
  • 8. • Filtering • NumPy-type operations on data >>> s f 2 a 4 c 6 e 8 >>> s[s > 4] c 6 e 8 >>> s>4 f False a False c True e True >>> s*2 f 4 a 8 c 12 e 16
  • 9. • pandas can accommodate incomplete data >>> sdata = {'b':100, 'c':150, 'd':200} >>> s = pd.Series(sdata) >>> s b 100 c 150 d 200 >>> s = pd.Series(sdata, list('abcd')) >>> s a NaN b 100 c 150 d 200 >>> s*2 a NaN b 200 c 300 d 400
  • 10. • Unlike in a NumPy ndarray, data is automatically aligned >>> s2 = pd.Series([1, 2, 3], index = ['c', 'b', 'a']) >>> s2 c 1 b 2 a 3 >>> s a NaN b 100 c 150 d 200 >>> s*s2 a NaN b 200 c 150 d NaN
  • 11. • Spreadsheet-like data structure containing an ordered collection of columns • Has both a row and column index • Consider as dict of Series (with shared index)
  • 12. >>> data = {'state': ['FL', 'FL', 'GA', 'GA', 'GA'], 'year': [2010, 2011, 2008, 2010, 2011], 'pop': [18.8, 19.1, 9.7, 9.7, 9.8]} >>> frame = pd.DataFrame(data) >>> frame pop state year 0 18.8 FL 2010 1 19.1 FL 2011 2 9.7 GA 2008 3 9.7 GA 2010 4 9.8 GA 2011
  • 13. >>> pop_data = {'FL': {2010:18.8, 2011:19.1}, 'GA': {2008: 9.7, 2010: 9.7, 2011:9.8}} >>> pop = pd.DataFrame(pop_data) >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8
  • 14. • Columns can be retrieved as a Series • dict notation • attribute notation • Rows can retrieved by position or by name (using ix attribute) >>> frame['state'] 0 FL 1 FL 2 GA 3 GA 4 GA Name: state >>> frame.describe <bound method DataFrame.describe of pop state year 0 18.8 FL 2010 1 19.1 FL 2011 2 9.7 GA 2008 3 9.7 GA 2010 4 9.8 GA 2011>
  • 15. • New columns can be added (by computation or direct assignment) >>> frame['other'] = NaN >>> frame pop state year other 0 18.8 FL 2010 NaN 1 19.1 FL 2011 NaN 2 9.7 GA 2008 NaN 3 9.7 GA 2010 NaN 4 9.8 GA 2011 NaN >>> frame['calc'] = frame['pop'] * 2 >>> frame pop state year other calc 0 18.8 FL 2010 NaN 37.6 1 19.1 FL 2011 NaN 38.2 2 9.7 GA 2008 NaN 19.4 3 9.7 GA 2010 NaN 19.4 4 9.8 GA 2011 NaN 19.6
  • 16. >>> obj = pd.Series(['blue', 'purple', 'red'], index=[0,2,4]) >>> obj 0 blue 2 purple 4 red >>> obj.reindex(range(4)) 0 blue 1 NaN 2 purple 3 NaN >>> obj.reindex(range(5), fill_value='black') 0 blue 1 black 2 purple 3 black 4 red >>> obj.reindex(range(5), method='ffill') 0 blue 1 blue 2 purple 3 purple 4 red • Creation of new object with the data conformed to a new index
  • 17. >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8 >>> pop.sum() FL 37.9 GA 29.2 >>> pop.mean() FL 18.950000 GA 9.733333 >>> pop.describe() FL GA count 2.000000 3.000000 mean 18.950000 9.733333 std 0.212132 0.057735 min 18.800000 9.700000 25% 18.875000 9.700000 50% 18.950000 9.700000 75% 19.025000 9.750000 max 19.100000 9.800000
  • 18. >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8 >>> pop < 9.8 FL GA 2008 False True 2010 False True 2011 False False >>> pop[pop < 9.8] = 0 >>> pop FL GA 2008 NaN 0.0 2010 18.8 0.0 2011 19.1 9.8
  • 19. • pandas supports several ways to handle data loading • Text file data • read_csv • read_table • Structured data (JSON, XML, HTML) • works well with existing libraries • Excel (depends upon xlrd and openpyxl packages) • Database • pandas.io.sql module (read_frame)
  • 20. >>> tips = pd.read_csv('/users/ah6/Desktop/pandas talk/data/tips.csv') >>> tips.ix[:2] total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 >>> party_counts = pd.crosstab(tips.day, tips.size) >>> party_counts size 1 2 3 4 5 6 day Fri 1 16 1 1 0 0 Sat 2 53 18 13 1 0 Sun 0 39 15 18 3 1 Thur 1 48 4 5 1 3 >>> sum_by_day = party_counts.sum(1).astype(float)
  • 21. >>> party_pcts = party_counts.div(sum_by_day, axis=0) >>> party_pcts size 1 2 3 4 5 6 day Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000 Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000 Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158 Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387 >>> party_pcts.plot(kind='bar', stacked=True) <matplotlib.axes.AxesSubplot at 0x6bf2ed0>
  • 22. >>> tips['tip_pct'] = tips['tip'] / tips['total_bill'] >>> tips['tip_pct'].hist(bins=50) <matplotlib.axes.AxesSubplot at 0x6c10d30> >>> tips['tip_pct'].describe() count 244.000000 mean 0.160803 std 0.061072 min 0.035638 25% 0.129127 50% 0.154770 75% 0.191475 max 0.710345
  • 23. • Data Aggregation • GroupBy • Pivot Tables • Time Series • Periods/Frequencies • Operations with Time Series with Different Frequencies • Downsampling/Upsampling • Plotting with TimeSeries (auto-adjust scale) • Advanced Analysis • Decile and Quartile Analysis • Signal Frontier Analysis • Future Contract Rolling • Rolling Correlation and Linear Regression