Scipy 2011 Time Series Analysis in Python

Time Series Analysis in Python with statsmodels

Wes McKinney1 Josef Perktold2 Skipper Seabold3

1 Departmentof Statistical Science
Duke University
2 Department of Economics

University of North Carolina at Chapel Hill
3 Departmentof Economics
American University

10th Python in Science Conference, 13 July 2011

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29

What is statsmodels?

A library for statistical modeling, implementing standard statistical
models in Python using NumPy and SciPy
Includes:
Linear (regression) models of many forms
Descriptive statistics
Statistical tests
Time series analysis
...and much more


What is Time Series Analysis?

Statistical modeling of time-ordered data observations
Inferring structure, forecasting and simulation, and testing
distributional assumptions about the data
Modeling dynamic relationships among multiple time series
Broad applications e.g. in economics, ﬁnance, neuroscience, signal
processing...


Talk Overview

Brief update on statsmodels development
Aside: user interface and data structures
Descriptive statistics and tests
Auto-regressive moving average models (ARMA)
Vector autoregression (VAR) models
Filtering tools (Hodrick-Prescott and others)
Near future: Bayesian dynamic linear models (DLMs), ARCH /
GARCH volatility models and beyond


Statsmodels development update

We’re now on GitHub! Join us:

http://github.com/statsmodels/statsmodels

Check out the slick Sphinx docs:

http://statsmodels.sourceforge.net

Development focus has been largely computational, i.e. writing
correct, tested implementations of all the common classes of
statistical models


Statsmodels development update

Major work to be done on providing a nice integrated user interface
We must work together to close the gap between R and Python!
Some important areas:
Formula framework, for specifying model design matrices
Need integrated rich statistical data structures (pandas)
Data visualization of results should always be a few keystrokes away
Write a “Statsmodels for R users” guide


Aside: statistical data structures and user interface

While I have a captive audience...
Controversial fact: pandas is the only Python library currently
providing data structures matching (and in many places exceeding)
the richness of R’s data structures (for statistics)
Let’s have a BoF session so I can justify this statement
Feedback I hear is that end users ﬁnd the fragmented, incohesive set
of Python tools for data analysis and statistics to be confusing,
frustrating, and certainly not compelling them to use Python...
(Not to mention the packaging headaches)


Aside: statistical data structures and user interface

We need to “commit” ASAP (not 12 months from now) to a high
level data structure(s) as the “primary data structure(s) for statistical
data analysis” and communicate that clearly to end users
Or we might as well all start programming in R...


Example data: EEG trace data

300

200

100

0

100

200

300

400

500

600
0 500 0 0 0 0 0 0 0
100 150 200 250 300 350 400


Example data: Macroeconomic data

5.5
5.0 cpi
4.5
4.0
3.5
3.0
7.5
7.0 m1
6.5
6.0
5.5
5.0
4.5
9.5
9.0
realgdp
8.5
8.0
0 4 8 2 6 0 4 8 2 6 0 4 8
196 196 196 197 197 198 198 198 199 199 200 200 200


Example data: Stock data

800
AAPL
700 GOOG
MSFT
600 YHOO
500
400
300
200
100
0
1 2 3 4 5 6 7 8 9
200 200 200 200 200 200 200 200 200


Descriptive statistics
Autocorrelation, partial autocorrelation plots
Commonly used for identiﬁcation in ARMA(p,q) and ARIMA(p,d,q)
models
acf = tsa . acf ( eeg , 50)
pacf = tsa . pacf ( eeg , 50)

1.0 Autocorrelation 1.0 Partial Autocorrelation

0.5 0.5

0.0 0.0

0.5 0.5

1.00 10 20 30 40 50 1.00 10 20 30 40 50


Statistical tests

Ljung-Box test for zero autocorrelation
Unit root test for cointegration (Augmented Dickey-Fuller test)
Granger-causality
Whiteness (iid-ness) and normality
See our conference paper (when the proceedings get published!)


Autoregressive moving average (ARMA) models
One of most common univariate time series models:

yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q
2
where E ( t , s ) = 0, for t = s and t ∼ N (0, σ )

Exact log-likelihood can be evaluated via the Kalman filter, but the
“conditional” likelihood is easier and commonly used
statsmodels has tools for simulating ARMA processes with known
coefficients ai , bi and also estimation given specified lag orders
import scikits.statsmodels.tsa.arima_process as ap
ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]
nobs = 100
y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)
y += 4 # add in constant


ARMA Estimation

Several likelihood-based estimators implemented (see docs)
model = tsa.ARMA(y)
result = model.fit(order=(2, 1), trend=’c’,
method=’css-mle’, disp=-1)
result.params
# array([ 3.97, -0.97, -0.05, -0.13])

Standard model diagnostics, standard errors, information criteria
(AIC, BIC, ...), etc available in the returned ARMAResults object


Vector Autoregression (VAR) models

Widely used model for modeling multiple (K -variate) time series,
especially in macroeconomics:

Yt = A1 Yt−1 + . . . + Ap Yt−p + t, t ∼ N (0, Σ)

Matrices Ai are K × K .
Yt must be a stationary process (sometimes achieved by
diﬀerencing). Related class of models (VECM) for modeling
nonstationary (including cointegrated) processes



>>> model = VAR(data); model.select_order(8)
VAR Order Selection
=====================================================
aic bic fpe hqic
-----------------------------------------------------
0 -27.83 -27.78 8.214e-13 -27.81
1 -28.77 -28.57 3.189e-13 -28.69
2 -29.00 -28.64* 2.556e-13 -28.85
3 -29.10 -28.60 2.304e-13 -28.90*
4 -29.09 -28.43 2.330e-13 -28.82
5 -29.13 -28.33 2.228e-13 -28.81
6 -29.14* -28.18 2.213e-13* -28.75
7 -29.07 -27.96 2.387e-13 -28.62
=====================================================
* Minimum



>>> result = model.fit(2)
>>> result.summary() # print summary for each variable
<snip>
Results for equation m1
====================================================
coefficient std. error t-stat prob
----------------------------------------------------
const 0.004968 0.001850 2.685 0.008
L1.m1 0.363636 0.071307 5.100 0.000
L1.realgdp -0.077460 0.092975 -0.833 0.406
L1.cpi -0.052387 0.128161 -0.409 0.683
L2.m1 0.250589 0.072050 3.478 0.001
L2.realgdp -0.085874 0.092032 -0.933 0.352
L2.cpi 0.169803 0.128376 1.323 0.188
====================================================
<snip>



>>> result = model.fit(2)
>>> result.summary() # print summary for each variable
<snip>
Correlation matrix of residuals
m1 realgdp cpi
m1 1.000000 -0.055690 -0.297494
realgdp -0.055690 1.000000 0.115597
cpi -0.297494 0.115597 1.000000


VAR: Impulse Response analysis
Analyze systematic impact of unit “shock” to a single variable

irf = result.irf(10)
irf.plot()

Impulse responses
m1 → m1 realgdp → m1 cpi → m1
1.0 0.2 0.4
0.8 0.1 0.3
0.2
0.6 0.0 0.1
0.4 0.1 0.0
0.2 0.2 0.1
0.2
0.0 0.3 0.3
0.20 4 0.40 4 10 0.40
2 6
m1 → realgdp 8 10 2 realgdp → realgdp 8
6 2 cpi4→ realgdp
6 8 10
0.20 1.0 0.2
0.15 0.8 0.1
0.10 0.6 0.0
0.05
0.4 0.1
0.00
0.05 0.2 0.2
0.10 0.0 0.3
0.150 2 4 6 8 10 0.20 2 4 0.40 4 → cpi
m1 → cpi realgdp →6
cpi 8 10 2 cpi 6 8 10
0.20 0.15 1.0
0.15 0.10 0.8
0.10 0.05 0.6
0.05 0.00
0.00 0.05 0.4
0.05 0.10 0.2
0.100 2 4 6 8 10 0.150 2 4 6 8 10 0.00 2 4 6 8 10


VAR: Forecast Error Variance Decomposition
Analyze contribution of each variable to forecasting error

fevd = result.fevd(20)
fevd.plot()

Forecast error variance decomposition (FEVD) m1
1.0 m1 realgdp
0.8 cpi
0.6
0.4
0.2
0.00 5 10 15 20
1.2 realgdp
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20
1.2 cpi
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20


VAR: Statistical tests

In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])
Granger causality f-test
=========================================================
Test statistic Critical Value p-value df
---------------------------------------------------------
1.248787 2.387325 0.289 (4, 579)
=========================================================
H_0: [’cpi’, ’realgdp’] do not Granger-cause m1
Conclusion: fail to reject H_0 at 5.00% significance level


Filtering

Hodrick-Prescott (HP) ﬁlter separates a time series yt into a trend τt
and a cyclical component ζt , so that yt = τt + ζt .

14
Inflation
12 Cyclical component
10 Trend component
8
6
4
2
0
2
4
2 6 0 4 8 2 6 0 4 8 2 6
196 196 197 197 197 198 198 199 199 199 200 200


Filtering

In addition to the HP filter, 2 other filters popular in finance and
economics, Baxter-King and Christiano-Fitzgerald, are available
We refer you to our paper and the documentation for details on these:

Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered
INFL INFL
4 4 UNEMP
UNEMP

2 2

0 0

2 2

4 4
63

73

83

93
68

78

88

98

03
71

81

91

08
66

76

86

96

01

06

19

19

19

19
19

19

19

19
19

19

19

20
19

19

19

19

20
20

20


Preview: Bayesian dynamic linear models (DLM)

A state space model by another name:

yt = Ft θt + νt , νt ∼ N (0, Vt )
θt = G θt−1 + ωt , ωt ∼ N (0, Wt )

Estimation of basic model by Kalman ﬁlter recursions. Provides
elegant way to do time-varying linear regressions for forecasting
Extensions: multivariate DLMs, stochastic volatility (SV) models,
MCMC-based posterior sampling, mixtures of DLMs


Preview: DLM Example (Constant+Trend model)

model = Polynomial(2)
dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model
m0=m0, C0=C0, n0=n0, s0=s0, # priors
state_discount=.95) # discount factor
Constant + Trend DLM

200

150

100

50
8 9 009 9 009 9 9
200 200 2 200 Jul 2 200 200
Nov Jan Mar May Sep Nov


Preview: Stochastic volatility models

1.6 JPY-USD Exchange Rate Volatility Process

1.4

1.2

1.0

0.8

0.6

0.4

0.20 200 400 600 800 1000


Future: sandbox and beyond

ARCH / GARCH models for volatility
Structural VAR and error correction models (ECM) for cointegrated
processes
Models with non-normally distributed errors
Better data description, visualization, and interactive research tools
More sophisticated Bayesian time series models


Conclusions

We’ve implemented many foundational models for time series
analysis, but the ﬁeld is very broad
User interface can and should be much improved
Repo: http://github.com/statsmodels/statsmodels
Docs: http://statsmodels.sourceforge.net
Contact: pystatsmodels@googlegroups.com


Scipy 2011 Time Series Analysis in Python

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Scipy 2011 Time Series Analysis in Python (20)

More from Wes McKinney (20)

Recently uploaded (20)

Scipy 2011 Time Series Analysis in Python