SlideShare a Scribd company logo
Statistics in Python Pandas Exercise
Big Data and Automated Content Analysis
Week 5 – Wednesday
»Statistics with Python«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
7 March 2018
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Today
1 Statistics in Python
General considerations
Useful packages
2 Pandas
Working with dataframes
Plotting and calculating with Pandas
3 Exercise
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python
General considerations
Statistics in Python Pandas Exercise
General considerations
General considerations
After having done all your nice text processing (and got numbers
instead of text!), you probably want to analyse this further.
You can always export to .csv and use R or Stata or SPSS or
whatever. . .
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
General considerations
General considerations
After having done all your nice text processing (and got numbers
instead of text!), you probably want to analyse this further.
You can always export to .csv and use R or Stata or SPSS or
whatever. . .
BUT:
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
General considerations
Reasons for not exporting and analyzing somewhere else
• the dataset might be too big
• it’s cumbersome and wastes your time
• it may introduce errors and makes it harder to reproduce
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
General considerations
What statistics capabilities does Python have?
• Basically all standard stuff (bivariate and multivariate
statistics) you know from SPSS
• Some advanced stuff (e.g., time series analysis)
• However, for some fancy statistical modelling (e.g., structural
equation modelling), you can better look somewhere else (R)
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python
Useful packages
Statistics in Python Pandas Exercise
Useful packages
Useful packages
numpy (numerical python) Provides a lot of frequently used
functions, like mean, standard deviation, correlation,
. . .
scipy (scientic python) More of that ;-)
statsmodels Statistical models (e.g., regression or time series)
matplotlib Plotting
seaborn Even nicer plotting
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Useful packages
Example 1: basic numpy
1 import numpy as np
2 x = [1,2,3,4,3,2]
3 y = [2,2,4,3,4,2]
4 z = [9.7, 10.2, 1.2, 3.3, 2.2, 55.6]
5 np.mean(x)
1 2.5
1 np.std(x)
1 0.9574271077563381
1 np.corrcoef([x,y,z])
1 array([[ 1. , 0.67883359, -0.37256219],
2 [ 0.67883359, 1. , -0.56886529],
3 [-0.37256219, -0.56886529, 1. ]])
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Useful packages
Characteristics
• Operates (also) on simple lists
• Returns output in standard datatypes (you can print it, store
it, calculate with it, . . . )
• it’s fast! np.mean(x) is faster than sum(x)/len(x)
• it is more accurate (less rounding errors)
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Useful packages
Example 2: basic plotting
1 import matplotlib.pyplot as plt
2 x = [1,2,3,4,3,2]
3 y = [2,2,4,3,4,2]
4 plt.hist(x)
5 plt.plot(x,y)
6 plt.scatter(x,y)
Figure: Examples of plots generated with matplotlib
Big Data and Automated Content Analysis Damian Trilling
Pandas
Working with dataframes
Statistics in Python Pandas Exercise
Working with dataframes
When to use dataframes
Native Python data structures
(lists, dicts, generators)
pro:
• flexible (especially dicts!)
• fast
• straightforward and easy to
understand
con:
• if your data is a table, modeling
this as, e.g., lists of lists feels
unintuitive
• very low-level: you need to do
much stuff ‘by hand’
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Working with dataframes
When to use dataframes
Native Python data structures
(lists, dicts, generators)
pro:
• flexible (especially dicts!)
• fast
• straightforward and easy to
understand
con:
• if your data is a table, modeling
this as, e.g., lists of lists feels
unintuitive
• very low-level: you need to do
much stuff ‘by hand’
Pandas dataframes
pro:
• like an R dataframe or a STATA
or SPSS dataset
• many convenience functions
(descriptive statistics, plotting
over time, grouping and
subsetting, . . . )
con:
• not always necessary (‘overkill’)
• if you deal with really large
datasets, you don’t want to load
them fully into memory (which
pandas does)
Big Data and Automated Content Analysis Damian Trilling
Pandas
Plotting and calculating with Pandas
Statistics in Python Pandas Exercise
Plotting and calculating with Pandas
More examples here: https://github.com/damian0604/bdaca/
blob/master/ipynb/basic_statistics.ipynb
Big Data and Automated Content Analysis Damian Trilling
Statistics in Python Pandas Exercise
Plotting and calculating with Pandas
OLS regression in pandas
1 import pandas as pd
2 import statsmodels.formula.api as smf
3
4 df = pd.DataFrame({’income’: [10,20,30,40,50], ’age’: [20, 30, 10, 40,
50], ’facebooklikes’: [32, 234, 23, 23, 42523]})
5
6 # alternative: read from CSV file (or stata...):
7 # df = pd.read_csv(’mydata.csv’)
8
9 myfittedregression = smf.ols(formula=’income ~ age + facebooklikes’,
data=df).fit()
10 print(myfittedregression.summary())
Big Data and Automated Content Analysis Damian Trilling
1 OLS Regression Results
2 ==============================================================================
3 Dep. Variable: income R-squared: 0.579
4 Model: OLS Adj. R-squared: 0.158
5 Method: Least Squares F-statistic: 1.375
6 Date: Mon, 05 Mar 2018 Prob (F-statistic): 0.421
7 Time: 18:07:29 Log-Likelihood: -18.178
8 No. Observations: 5 AIC: 42.36
9 Df Residuals: 2 BIC: 41.19
10 Df Model: 2
11 Covariance Type: nonrobust
12 =================================================================================
13 coef std err t P>|t| [95.0% Conf. Int.]
14 ---------------------------------------------------------------------------------
15 Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
16 age 0.4012 0.650 0.617 0.600 -2.394 3.197
17 facebooklikes 0.0004 0.001 0.650 0.583 -0.002 0.003
18 ==============================================================================
19 Omnibus: nan Durbin-Watson: 1.061
20 Prob(Omnibus): nan Jarque-Bera (JB): 0.498
21 Skew: -0.123 Prob(JB): 0.780
22 Kurtosis: 1.474 Cond. No. 5.21e+04
23 ==============================================================================
Statistics in Python Pandas Exercise
Plotting and calculating with Pandas
Other cool df operations
df[’age’].plot() to plot a column
df[’age’].describe() to get descriptive statistics
df[’age’].value_counts() to get a frequency table
and MUCH more. . .
Big Data and Automated Content Analysis Damian Trilling
Joanna will introduce you to the exercise
... and of course you can also ask questions about the last weeks if
you still have some!

More Related Content

What's hot (20)

PDF
Analyzing social media with Python and other tools (1/4)
PDF
Python cheat-sheet
PPTX
Python for Big Data Analytics
PPTX
Ground Gurus - Python Code Camp - Day 3 - Classes
PPT
Searching algorithm
PPTX
Programming for Everybody in Python
PPTX
Introduction to Python for Data Science and Machine Learning
PDF
Python interview questions
PDF
Python Interview Questions And Answers 2019 | Edureka
PPTX
Introduction to python
PDF
pycon-2015-liza-daly
PDF
Most Asked Python Interview Questions
Analyzing social media with Python and other tools (1/4)
Python cheat-sheet
Python for Big Data Analytics
Ground Gurus - Python Code Camp - Day 3 - Classes
Searching algorithm
Programming for Everybody in Python
Introduction to Python for Data Science and Machine Learning
Python interview questions
Python Interview Questions And Answers 2019 | Edureka
Introduction to python
pycon-2015-liza-daly
Most Asked Python Interview Questions
Ad

Similar to BDACA - Tutorial5 (20)

PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PPTX
Python for statistical analysis
PPTX
Python for Data Analytics and ML examples
PPTX
Basic of python for data analysis
PPTX
Python ml
PPTX
Meetup Junio Data Analysis with python 2018
PPTX
Data analysis with pandas
PDF
De-Cluttering-ML | TechWeekends
PPTX
Python short notes on modules and applications
PPTX
this includes basics about python modules and packages introduction
PPTX
Comparing EDA with classical and Bayesian analysis.pptx
PDF
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
PDF
2Essential-Python-Libraries-for-Data-Analytics[1].pdf
PPTX
Data Science Using Python.pptx
PPTX
More on Pandas.pptx
PPTX
To understand the importance of Python libraries in data analysis.
PPTX
Pythonggggg. Ghhhjj-for-Data-Analysis.pptx
PDF
An Overview of Python for Data Analytics
PDF
Python for Data Science: A Comprehensive Guide
PDF
Python for Data Science 1 / converted Edition Yuli Vasiliev
Data Science With Python | Python For Data Science | Python Data Science Cour...
Python for statistical analysis
Python for Data Analytics and ML examples
Basic of python for data analysis
Python ml
Meetup Junio Data Analysis with python 2018
Data analysis with pandas
De-Cluttering-ML | TechWeekends
Python short notes on modules and applications
this includes basics about python modules and packages introduction
Comparing EDA with classical and Bayesian analysis.pptx
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
2Essential-Python-Libraries-for-Data-Analytics[1].pdf
Data Science Using Python.pptx
More on Pandas.pptx
To understand the importance of Python libraries in data analysis.
Pythonggggg. Ghhhjj-for-Data-Analysis.pptx
An Overview of Python for Data Analytics
Python for Data Science: A Comprehensive Guide
Python for Data Science 1 / converted Edition Yuli Vasiliev
Ad

More from Department of Communication Science, University of Amsterdam (18)

PDF
Media diets in an age of apps and social media: Dealing with a third layer of...
PDF
Conceptualizing and measuring news exposure as network of users and news items
PDF
Data Science: Case "Political Communication 2/2"
PDF
Data Science: Case "Political Communication 1/2"
PPTX
Media diets in an age of apps and social media: Dealing with a third layer of...
Conceptualizing and measuring news exposure as network of users and news items
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 1/2"

Recently uploaded (20)

PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Pharma ospi slides which help in ospi learning
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Classroom Observation Tools for Teachers
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Basic Mud Logging Guide for educational purpose
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Lesson notes of climatology university.
PDF
01-Introduction-to-Information-Management.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
RMMM.pdf make it easy to upload and study
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPH.pptx obstetrics and gynecology in nursing
Pharma ospi slides which help in ospi learning
O7-L3 Supply Chain Operations - ICLT Program
O5-L3 Freight Transport Ops (International) V1.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Classroom Observation Tools for Teachers
Supply Chain Operations Speaking Notes -ICLT Program
Anesthesia in Laparoscopic Surgery in India
Basic Mud Logging Guide for educational purpose
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Sports Quiz easy sports quiz sports quiz
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Lesson notes of climatology university.
01-Introduction-to-Information-Management.pdf
Microbial disease of the cardiovascular and lymphatic systems
RMMM.pdf make it easy to upload and study
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF

BDACA - Tutorial5

  • 1. Statistics in Python Pandas Exercise Big Data and Automated Content Analysis Week 5 – Wednesday »Statistics with Python« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 7 March 2018 Big Data and Automated Content Analysis Damian Trilling
  • 2. Statistics in Python Pandas Exercise Today 1 Statistics in Python General considerations Useful packages 2 Pandas Working with dataframes Plotting and calculating with Pandas 3 Exercise Big Data and Automated Content Analysis Damian Trilling
  • 4. Statistics in Python Pandas Exercise General considerations General considerations After having done all your nice text processing (and got numbers instead of text!), you probably want to analyse this further. You can always export to .csv and use R or Stata or SPSS or whatever. . . Big Data and Automated Content Analysis Damian Trilling
  • 5. Statistics in Python Pandas Exercise General considerations General considerations After having done all your nice text processing (and got numbers instead of text!), you probably want to analyse this further. You can always export to .csv and use R or Stata or SPSS or whatever. . . BUT: Big Data and Automated Content Analysis Damian Trilling
  • 6. Statistics in Python Pandas Exercise General considerations Reasons for not exporting and analyzing somewhere else • the dataset might be too big • it’s cumbersome and wastes your time • it may introduce errors and makes it harder to reproduce Big Data and Automated Content Analysis Damian Trilling
  • 7. Statistics in Python Pandas Exercise General considerations What statistics capabilities does Python have? • Basically all standard stuff (bivariate and multivariate statistics) you know from SPSS • Some advanced stuff (e.g., time series analysis) • However, for some fancy statistical modelling (e.g., structural equation modelling), you can better look somewhere else (R) Big Data and Automated Content Analysis Damian Trilling
  • 9. Statistics in Python Pandas Exercise Useful packages Useful packages numpy (numerical python) Provides a lot of frequently used functions, like mean, standard deviation, correlation, . . . scipy (scientic python) More of that ;-) statsmodels Statistical models (e.g., regression or time series) matplotlib Plotting seaborn Even nicer plotting Big Data and Automated Content Analysis Damian Trilling
  • 10. Statistics in Python Pandas Exercise Useful packages Example 1: basic numpy 1 import numpy as np 2 x = [1,2,3,4,3,2] 3 y = [2,2,4,3,4,2] 4 z = [9.7, 10.2, 1.2, 3.3, 2.2, 55.6] 5 np.mean(x) 1 2.5 1 np.std(x) 1 0.9574271077563381 1 np.corrcoef([x,y,z]) 1 array([[ 1. , 0.67883359, -0.37256219], 2 [ 0.67883359, 1. , -0.56886529], 3 [-0.37256219, -0.56886529, 1. ]]) Big Data and Automated Content Analysis Damian Trilling
  • 11. Statistics in Python Pandas Exercise Useful packages Characteristics • Operates (also) on simple lists • Returns output in standard datatypes (you can print it, store it, calculate with it, . . . ) • it’s fast! np.mean(x) is faster than sum(x)/len(x) • it is more accurate (less rounding errors) Big Data and Automated Content Analysis Damian Trilling
  • 12. Statistics in Python Pandas Exercise Useful packages Example 2: basic plotting 1 import matplotlib.pyplot as plt 2 x = [1,2,3,4,3,2] 3 y = [2,2,4,3,4,2] 4 plt.hist(x) 5 plt.plot(x,y) 6 plt.scatter(x,y) Figure: Examples of plots generated with matplotlib Big Data and Automated Content Analysis Damian Trilling
  • 14. Statistics in Python Pandas Exercise Working with dataframes When to use dataframes Native Python data structures (lists, dicts, generators) pro: • flexible (especially dicts!) • fast • straightforward and easy to understand con: • if your data is a table, modeling this as, e.g., lists of lists feels unintuitive • very low-level: you need to do much stuff ‘by hand’ Big Data and Automated Content Analysis Damian Trilling
  • 15. Statistics in Python Pandas Exercise Working with dataframes When to use dataframes Native Python data structures (lists, dicts, generators) pro: • flexible (especially dicts!) • fast • straightforward and easy to understand con: • if your data is a table, modeling this as, e.g., lists of lists feels unintuitive • very low-level: you need to do much stuff ‘by hand’ Pandas dataframes pro: • like an R dataframe or a STATA or SPSS dataset • many convenience functions (descriptive statistics, plotting over time, grouping and subsetting, . . . ) con: • not always necessary (‘overkill’) • if you deal with really large datasets, you don’t want to load them fully into memory (which pandas does) Big Data and Automated Content Analysis Damian Trilling
  • 17. Statistics in Python Pandas Exercise Plotting and calculating with Pandas More examples here: https://github.com/damian0604/bdaca/ blob/master/ipynb/basic_statistics.ipynb Big Data and Automated Content Analysis Damian Trilling
  • 18. Statistics in Python Pandas Exercise Plotting and calculating with Pandas OLS regression in pandas 1 import pandas as pd 2 import statsmodels.formula.api as smf 3 4 df = pd.DataFrame({’income’: [10,20,30,40,50], ’age’: [20, 30, 10, 40, 50], ’facebooklikes’: [32, 234, 23, 23, 42523]}) 5 6 # alternative: read from CSV file (or stata...): 7 # df = pd.read_csv(’mydata.csv’) 8 9 myfittedregression = smf.ols(formula=’income ~ age + facebooklikes’, data=df).fit() 10 print(myfittedregression.summary()) Big Data and Automated Content Analysis Damian Trilling
  • 19. 1 OLS Regression Results 2 ============================================================================== 3 Dep. Variable: income R-squared: 0.579 4 Model: OLS Adj. R-squared: 0.158 5 Method: Least Squares F-statistic: 1.375 6 Date: Mon, 05 Mar 2018 Prob (F-statistic): 0.421 7 Time: 18:07:29 Log-Likelihood: -18.178 8 No. Observations: 5 AIC: 42.36 9 Df Residuals: 2 BIC: 41.19 10 Df Model: 2 11 Covariance Type: nonrobust 12 ================================================================================= 13 coef std err t P>|t| [95.0% Conf. Int.] 14 --------------------------------------------------------------------------------- 15 Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386 16 age 0.4012 0.650 0.617 0.600 -2.394 3.197 17 facebooklikes 0.0004 0.001 0.650 0.583 -0.002 0.003 18 ============================================================================== 19 Omnibus: nan Durbin-Watson: 1.061 20 Prob(Omnibus): nan Jarque-Bera (JB): 0.498 21 Skew: -0.123 Prob(JB): 0.780 22 Kurtosis: 1.474 Cond. No. 5.21e+04 23 ==============================================================================
  • 20. Statistics in Python Pandas Exercise Plotting and calculating with Pandas Other cool df operations df[’age’].plot() to plot a column df[’age’].describe() to get descriptive statistics df[’age’].value_counts() to get a frequency table and MUCH more. . . Big Data and Automated Content Analysis Damian Trilling
  • 21. Joanna will introduce you to the exercise ... and of course you can also ask questions about the last weeks if you still have some!