SlideShare a Scribd company logo
Time Series Analysis in Python with statsmodels

                   Wes McKinney1                 Josef Perktold2               Skipper Seabold3

                                            1 Departmentof Statistical Science
                                                    Duke University
                                            2 Department of Economics

                                    University of North Carolina at Chapel Hill
                                               3 Departmentof Economics
                                                  American University


                       10th Python in Science Conference, 13 July 2011



McKinney, Perktold, Seabold (statsmodels)        Python Time Series Analysis          SciPy Conference 2011   1 / 29
What is statsmodels?




          A library for statistical modeling, implementing standard statistical
          models in Python using NumPy and SciPy
          Includes:
                  Linear (regression) models of many forms
                  Descriptive statistics
                  Statistical tests
                  Time series analysis
                  ...and much more




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   2 / 29
What is Time Series Analysis?




          Statistical modeling of time-ordered data observations
          Inferring structure, forecasting and simulation, and testing
          distributional assumptions about the data
          Modeling dynamic relationships among multiple time series
          Broad applications e.g. in economics, finance, neuroscience, signal
          processing...




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   3 / 29
Talk Overview



          Brief update on statsmodels development
          Aside: user interface and data structures
          Descriptive statistics and tests
          Auto-regressive moving average models (ARMA)
          Vector autoregression (VAR) models
          Filtering tools (Hodrick-Prescott and others)
          Near future: Bayesian dynamic linear models (DLMs), ARCH /
          GARCH volatility models and beyond




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   4 / 29
Statsmodels development update



          We’re now on GitHub! Join us:

                         http://github.com/statsmodels/statsmodels

          Check out the slick Sphinx docs:

                                http://statsmodels.sourceforge.net

          Development focus has been largely computational, i.e. writing
          correct, tested implementations of all the common classes of
          statistical models




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   5 / 29
Statsmodels development update




          Major work to be done on providing a nice integrated user interface
          We must work together to close the gap between R and Python!
          Some important areas:
                  Formula framework, for specifying model design matrices
                  Need integrated rich statistical data structures (pandas)
                  Data visualization of results should always be a few keystrokes away
                  Write a “Statsmodels for R users” guide




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   6 / 29
Aside: statistical data structures and user interface



          While I have a captive audience...
          Controversial fact: pandas is the only Python library currently
          providing data structures matching (and in many places exceeding)
          the richness of R’s data structures (for statistics)
                  Let’s have a BoF session so I can justify this statement
          Feedback I hear is that end users find the fragmented, incohesive set
          of Python tools for data analysis and statistics to be confusing,
          frustrating, and certainly not compelling them to use Python...
                  (Not to mention the packaging headaches)




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   7 / 29
Aside: statistical data structures and user interface




          We need to “commit” ASAP (not 12 months from now) to a high
          level data structure(s) as the “primary data structure(s) for statistical
          data analysis” and communicate that clearly to end users
                  Or we might as well all start programming in R...




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   8 / 29
Example data: EEG trace data


               300

               200

               100

                 0

               100

               200

               300

               400

               500

               600
                  0         500           0      0           0              0      0          0             0
                                      100     150         200         250       300        350        400




McKinney, Perktold, Seabold (statsmodels)     Python Time Series Analysis              SciPy Conference 2011    9 / 29
Example data: Macroeconomic data


              5.5
              5.0      cpi
              4.5
              4.0
              3.5
              3.0
              7.5
              7.0      m1
              6.5
              6.0
              5.5
              5.0
              4.5
              9.5
              9.0
                       realgdp
              8.5
              8.0
                  0   4     8  2  6   0   4   8   2   6   0   4    8
               196 196 196 197 197 198 198 198 199 199 200 200 200




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   10 / 29
Example data: Stock data


              800
                         AAPL
              700        GOOG
                         MSFT
              600        YHOO
              500
              400
              300
              200
              100
                0
                          1         2          3        4           5      6           7      8       9
                       200       200        200      200      200       200      200       200     200




McKinney, Perktold, Seabold (statsmodels)          Python Time Series Analysis              SciPy Conference 2011   11 / 29
Descriptive statistics
            Autocorrelation, partial autocorrelation plots
            Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q)
            models
            acf = tsa . acf ( eeg , 50)
            pacf = tsa . pacf ( eeg , 50)

     1.0                  Autocorrelation                     1.0               Partial Autocorrelation


     0.5                                                      0.5


     0.0                                                      0.0


     0.5                                                      0.5


     1.00         10        20        30    40        50      1.00         10        20        30         40    50

McKinney, Perktold, Seabold (statsmodels)    Python Time Series Analysis               SciPy Conference 2011   12 / 29
Statistical tests




          Ljung-Box test for zero autocorrelation
          Unit root test for cointegration (Augmented Dickey-Fuller test)
          Granger-causality
          Whiteness (iid-ness) and normality
          See our conference paper (when the proceedings get published!)




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   13 / 29
Autoregressive moving average (ARMA) models
          One of most common univariate time series models:

                   yt = µ + a1 yt−1 + ... + ak yt−p +                t    + b1   t−1   + ... + bq       t−q
                                                                                           2
                   where E ( t , s ) = 0, for t = s and                   t   ∼ N (0, σ )


          Exact log-likelihood can be evaluated via the Kalman filter, but the
          “conditional” likelihood is easier and commonly used
          statsmodels has tools for simulating ARMA processes with known
          coefficients ai , bi and also estimation given specified lag orders
              import scikits.statsmodels.tsa.arima_process as ap
              ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]
              nobs = 100
              y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)
              y += 4 # add in constant

McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis                SciPy Conference 2011   14 / 29
ARMA Estimation



          Several likelihood-based estimators implemented (see docs)
              model = tsa.ARMA(y)
              result = model.fit(order=(2, 1), trend=’c’,
                                 method=’css-mle’, disp=-1)
              result.params
              # array([ 3.97, -0.97, -0.05, -0.13])


          Standard model diagnostics, standard errors, information criteria
          (AIC, BIC, ...), etc available in the returned ARMAResults object




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   15 / 29
Vector Autoregression (VAR) models



          Widely used model for modeling multiple (K -variate) time series,
          especially in macroeconomics:

                           Yt = A1 Yt−1 + . . . + Ap Yt−p +               t,   t   ∼ N (0, Σ)

          Matrices Ai are K × K .
          Yt must be a stationary process (sometimes achieved by
          differencing). Related class of models (VECM) for modeling
          nonstationary (including cointegrated) processes




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis            SciPy Conference 2011   16 / 29
Vector Autoregression (VAR) models

   >>> model = VAR(data); model.select_order(8)
                    VAR Order Selection
   =====================================================
              aic          bic          fpe         hqic
   -----------------------------------------------------
   0       -27.83       -27.78    8.214e-13       -27.81
   1       -28.77       -28.57    3.189e-13       -28.69
   2       -29.00      -28.64*    2.556e-13       -28.85
   3       -29.10       -28.60    2.304e-13      -28.90*
   4       -29.09       -28.43    2.330e-13       -28.82
   5       -29.13       -28.33    2.228e-13       -28.81
   6      -29.14*       -28.18   2.213e-13*       -28.75
   7       -29.07       -27.96    2.387e-13       -28.62
   =====================================================
   * Minimum

McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   17 / 29
Vector Autoregression (VAR) models

   >>> result = model.fit(2)
   >>> result.summary() # print summary for each variable
   <snip>
   Results for equation m1
   ====================================================
               coefficient    std. error t-stat    prob
   ----------------------------------------------------
   const          0.004968      0.001850   2.685 0.008
   L1.m1          0.363636      0.071307   5.100 0.000
   L1.realgdp    -0.077460      0.092975 -0.833 0.406
   L1.cpi        -0.052387      0.128161 -0.409 0.683
   L2.m1          0.250589      0.072050   3.478 0.001
   L2.realgdp    -0.085874      0.092032 -0.933 0.352
   L2.cpi         0.169803      0.128376   1.323 0.188
   ====================================================
   <snip>


McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   18 / 29
Vector Autoregression (VAR) models




   >>> result = model.fit(2)
   >>> result.summary() # print summary for each variable
   <snip>
   Correlation matrix of residuals
                    m1   realgdp       cpi
   m1         1.000000 -0.055690 -0.297494
   realgdp   -0.055690 1.000000 0.115597
   cpi       -0.297494 0.115597 1.000000




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   19 / 29
VAR: Impulse Response analysis
          Analyze systematic impact of unit “shock” to a single variable

   irf = result.irf(10)
   irf.plot()

                                                                  Impulse responses
                                      m1 → m1                         realgdp → m1                        cpi → m1
                         1.0                               0.2                               0.4
                         0.8                               0.1                               0.3
                                                                                             0.2
                         0.6                               0.0                               0.1
                         0.4                               0.1                               0.0
                         0.2                               0.2                               0.1
                                                                                             0.2
                         0.0                               0.3                               0.3
                         0.20        4                     0.40          4                10 0.40
                                2            6
                                    m1 → realgdp   8   10         2 realgdp → realgdp 8
                                                                                6                   2   cpi4→ realgdp
                                                                                                                  6     8   10
                        0.20                               1.0                               0.2
                        0.15                               0.8                               0.1
                        0.10                               0.6                               0.0
                        0.05
                                                           0.4                               0.1
                        0.00
                        0.05                               0.2                               0.2
                        0.10                               0.0                               0.3
                        0.150   2     4      6     8   10 0.20    2     4                    0.40         4 → cpi
                                      m1 → cpi                        realgdp →6
                                                                               cpi   8    10        2     cpi 6         8   10
                        0.20                              0.15                               1.0
                        0.15                              0.10                               0.8
                        0.10                              0.05                               0.6
                        0.05                              0.00
                        0.00                              0.05                               0.4
                        0.05                              0.10                               0.2
                        0.100   2     4     6      8   10 0.150   2     4      6     8    10 0.00   2     4      6      8   10



McKinney, Perktold, Seabold (statsmodels)                 Python Time Series Analysis                                SciPy Conference 2011   20 / 29
VAR: Forecast Error Variance Decomposition
          Analyze contribution of each variable to forecasting error

   fevd = result.fevd(20)
   fevd.plot()

                                                Forecast error variance decomposition (FEVD)         m1
                         1.0                                 m1                                      realgdp
                         0.8                                                                         cpi
                         0.6
                         0.4
                         0.2
                         0.00               5                 10                        15     20
                         1.2                               realgdp
                         1.0
                         0.8
                         0.6
                         0.4
                         0.2
                         0.00               5                10                         15     20
                         1.2                                 cpi
                         1.0
                         0.8
                         0.6
                         0.4
                         0.2
                         0.00               5                 10                        15     20



McKinney, Perktold, Seabold (statsmodels)       Python Time Series Analysis                     SciPy Conference 2011   21 / 29
VAR: Statistical tests



   In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])
   Granger causality f-test
   =========================================================
      Test statistic   Critical Value      p-value        df
   ---------------------------------------------------------
            1.248787         2.387325        0.289 (4, 579)
   =========================================================
   H_0: [’cpi’, ’realgdp’] do not Granger-cause m1
   Conclusion: fail to reject H_0 at 5.00% significance level




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   22 / 29
Filtering

          Hodrick-Prescott (HP) filter separates a time series yt into a trend τt
          and a cyclical component ζt , so that yt = τt + ζt .

              14
                                                                                       Inflation
              12                                                                       Cyclical component
              10                                                                       Trend component
               8
               6
               4
                2
               0
                2
                4
                       2      6      0      4      8       2       6       0       4      8        2       6
                    196    196    197    197    197    198     198     199     199     199      200    200

McKinney, Perktold, Seabold (statsmodels)        Python Time Series Analysis                  SciPy Conference 2011   23 / 29
Filtering

          In addition to the HP filter, 2 other filters popular in finance and
          economics, Baxter-King and Christiano-Fitzgerald, are available
          We refer you to our paper and the documentation for details on these:

                          Inflation and Unemployment: BK Filtered                           Inflation and Unemployment: CF Filtered
                                                                    INFL                                                              INFL
              4                                                               4                                                       UNEMP
                                                                    UNEMP

              2                                                               2


              0                                                               0


              2                                                               2


              4                                                               4
                                                                                  63



                                                                                               73



                                                                                                           83



                                                                                                                       93
                                                                                       68



                                                                                                     78



                                                                                                                 88



                                                                                                                             98

                                                                                                                                      03
                         71




                                      81




                                                    91




                                                                                                                                           08
                    66




                                76




                                              86




                                                           96

                                                                    01

                                                                         06



                                                                                  19



                                                                                              19



                                                                                                          19



                                                                                                                      19
                                                                                       19



                                                                                                    19



                                                                                                                19



                                                                                                                            19
                         19




                                     19




                                                   19




                                                                                                                                  20
                  19




                              19




                                            19




                                                         19




                                                                                                                                           20
                                                                20

                                                                         20




McKinney, Perktold, Seabold (statsmodels)                   Python Time Series Analysis                         SciPy Conference 2011           24 / 29
Preview: Bayesian dynamic linear models (DLM)



          A state space model by another name:

                                      yt = Ft θt + νt ,       νt ∼ N (0, Vt )
                                      θt = G θt−1 + ωt ,          ωt ∼ N (0, Wt )

          Estimation of basic model by Kalman filter recursions. Provides
          elegant way to do time-varying linear regressions for forecasting
          Extensions: multivariate DLMs, stochastic volatility (SV) models,
          MCMC-based posterior sampling, mixtures of DLMs




McKinney, Perktold, Seabold (statsmodels)    Python Time Series Analysis        SciPy Conference 2011   25 / 29
Preview: DLM Example (Constant+Trend model)

   model = Polynomial(2)
   dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model
             m0=m0, C0=C0, n0=n0, s0=s0, # priors
             state_discount=.95) # discount factor
                                                                Constant + Trend DLM



                        200



                        150



                        100



                         50
                                       8            9        009            9        009               9               9
                                    200          200        2            200    Jul 2            200             200
                              Nov          Jan          Mar        May                     Sep             Nov

McKinney, Perktold, Seabold (statsmodels)                 Python Time Series Analysis                              SciPy Conference 2011   26 / 29
Preview: Stochastic volatility models


              1.6                       JPY-USD Exchange Rate Volatility Process

              1.4

              1.2

              1.0

              0.8

              0.6

              0.4

              0.20                200             400               600            800             1000



McKinney, Perktold, Seabold (statsmodels)      Python Time Series Analysis          SciPy Conference 2011   27 / 29
Future: sandbox and beyond




          ARCH / GARCH models for volatility
          Structural VAR and error correction models (ECM) for cointegrated
          processes
          Models with non-normally distributed errors
          Better data description, visualization, and interactive research tools
          More sophisticated Bayesian time series models




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   28 / 29
Conclusions




          We’ve implemented many foundational models for time series
          analysis, but the field is very broad
          User interface can and should be much improved
          Repo: http://github.com/statsmodels/statsmodels
          Docs: http://statsmodels.sourceforge.net
          Contact: pystatsmodels@googlegroups.com




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   29 / 29

More Related Content

PDF
Data science
PDF
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
PDF
Data Analysis and Visualization using Python
PDF
Future of Data Engineering
PDF
Model selection and cross validation techniques
ODP
Data Analysis in Python
PPTX
Predictive Analytics - An Overview
PPTX
Feature selection
Data science
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analysis and Visualization using Python
Future of Data Engineering
Model selection and cross validation techniques
Data Analysis in Python
Predictive Analytics - An Overview
Feature selection

What's hot (20)

PPTX
Bibliometrix Seminar
PDF
Feature Engineering
PPTX
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
PPTX
Telecom Churn Prediction Presentation
PDF
Introduction to Knowledge Graphs and Semantic AI
PPT
Graph Analytics for big data
PPTX
Statistics for data science
PPTX
Prediction of customer propensity to churn - Telecom Industry
PPTX
Introduction to data science
PPTX
BIG MART SALES PRIDICTION PROJECT.pptx
PDF
Naive Bayes
PDF
Decision tree
PDF
Big Data in FinTech
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Data visualization
PDF
Webinar: Real-time Business Intelligence
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PDF
Data Analytics PowerPoint Presentation Slides
PDF
Data platform architecture
PDF
Stock Price Trend Forecasting using Supervised Learning
Bibliometrix Seminar
Feature Engineering
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Telecom Churn Prediction Presentation
Introduction to Knowledge Graphs and Semantic AI
Graph Analytics for big data
Statistics for data science
Prediction of customer propensity to churn - Telecom Industry
Introduction to data science
BIG MART SALES PRIDICTION PROJECT.pptx
Naive Bayes
Decision tree
Big Data in FinTech
Federated Learning: ML with Privacy on the Edge 11.15.18
Data visualization
Webinar: Real-time Business Intelligence
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Data Analytics PowerPoint Presentation Slides
Data platform architecture
Stock Price Trend Forecasting using Supervised Learning
Ad

Viewers also liked (20)

PDF
Python for Financial Data Analysis with pandas
PDF
pandas: a Foundational Python Library for Data Analysis and Statistics
PDF
Data Structures for Statistical Computing in Python
PDF
Time travel and time series analysis with pandas + statsmodels
PPTX
Revenue Growth through Machine Learning
PDF
SciPy 2011 pandas lightning talk
PPTX
PyDataDC- Forecasting critical food violations at restaurants using open data
PDF
ET_with_EEG
PPTX
How Chile used social media during the Earthquake
PDF
Laughing Squid Opportunity Analysis Project
PDF
Structured Data Challenges in Finance and Statistics
PDF
Multivariate time series
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Productive Data Tools for Quants
PDF
Analysis of EEG data Using ICA and Algorithm Development for Energy Comparison
PPT
Time series Forecasting using svm
PDF
Predicting Stock Market Price Using Support Vector Regression
PDF
Time series database, InfluxDB & PHP
PPTX
ForecastIT 4. Holt's Exponential Smoothing
Python for Financial Data Analysis with pandas
pandas: a Foundational Python Library for Data Analysis and Statistics
Data Structures for Statistical Computing in Python
Time travel and time series analysis with pandas + statsmodels
Revenue Growth through Machine Learning
SciPy 2011 pandas lightning talk
PyDataDC- Forecasting critical food violations at restaurants using open data
ET_with_EEG
How Chile used social media during the Earthquake
Laughing Squid Opportunity Analysis Project
Structured Data Challenges in Finance and Statistics
Multivariate time series
What's new in pandas and the SciPy stack for financial users
Productive Data Tools for Quants
Analysis of EEG data Using ICA and Algorithm Development for Energy Comparison
Time series Forecasting using svm
Predicting Stock Market Price Using Support Vector Regression
Time series database, InfluxDB & PHP
ForecastIT 4. Holt's Exponential Smoothing
Ad

Similar to Scipy 2011 Time Series Analysis in Python (20)

PPT
A brief introduction to 'R' statistical package
PDF
timeseries cheat sheet with example code for R
PPTX
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
PDF
Pandas
PPT
Matlab tme series benni
PDF
Unit 6-Introduction of Python Libraries.pdf
PDF
Getting started with pandas
PDF
Data science
PDF
RDataMining slides-r-programming
PDF
Time Series for FRAM-Second_Sem_2021-22 (1).pdf
PDF
Slides 111017220255-phpapp01
PDF
Time Series For Data Science Wayne A Woodward Bivin Philip Sadler
PDF
DS LAB MANUAL.pdf
PDF
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
PDF
Time series and forecasting from wikipedia
PPTX
Data analysis using python in Jupyter notebook.pptx
PDF
Introduction to R programming
KEY
R for Pirates. ESCCONF October 27, 2011
PDF
Data assimilation with OpenDA
PDF
12 Introduction to Modeling Libraries in Python.pdf
A brief introduction to 'R' statistical package
timeseries cheat sheet with example code for R
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Pandas
Matlab tme series benni
Unit 6-Introduction of Python Libraries.pdf
Getting started with pandas
Data science
RDataMining slides-r-programming
Time Series for FRAM-Second_Sem_2021-22 (1).pdf
Slides 111017220255-phpapp01
Time Series For Data Science Wayne A Woodward Bivin Philip Sadler
DS LAB MANUAL.pdf
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
Time series and forecasting from wikipedia
Data analysis using python in Jupyter notebook.pptx
Introduction to R programming
R for Pirates. ESCCONF October 27, 2011
Data assimilation with OpenDA
12 Introduction to Modeling Libraries in Python.pdf

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
New Directions for Apache Arrow
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Ursa Labs and Apache Arrow in 2019
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Memory Interoperability in Analytics and Machine Learning
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow: Leveling Up the Data Science Stack
Ursa Labs and Apache Arrow in 2019
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Memory Interoperability in Analytics and Machine Learning

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
sap open course for s4hana steps from ECC to s4
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
The AUB Centre for AI in Media Proposal.docx
MIND Revenue Release Quarter 2 2025 Press Release
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity

Scipy 2011 Time Series Analysis in Python

  • 1. Time Series Analysis in Python with statsmodels Wes McKinney1 Josef Perktold2 Skipper Seabold3 1 Departmentof Statistical Science Duke University 2 Department of Economics University of North Carolina at Chapel Hill 3 Departmentof Economics American University 10th Python in Science Conference, 13 July 2011 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29
  • 2. What is statsmodels? A library for statistical modeling, implementing standard statistical models in Python using NumPy and SciPy Includes: Linear (regression) models of many forms Descriptive statistics Statistical tests Time series analysis ...and much more McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 2 / 29
  • 3. What is Time Series Analysis? Statistical modeling of time-ordered data observations Inferring structure, forecasting and simulation, and testing distributional assumptions about the data Modeling dynamic relationships among multiple time series Broad applications e.g. in economics, finance, neuroscience, signal processing... McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 3 / 29
  • 4. Talk Overview Brief update on statsmodels development Aside: user interface and data structures Descriptive statistics and tests Auto-regressive moving average models (ARMA) Vector autoregression (VAR) models Filtering tools (Hodrick-Prescott and others) Near future: Bayesian dynamic linear models (DLMs), ARCH / GARCH volatility models and beyond McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 4 / 29
  • 5. Statsmodels development update We’re now on GitHub! Join us: http://github.com/statsmodels/statsmodels Check out the slick Sphinx docs: http://statsmodels.sourceforge.net Development focus has been largely computational, i.e. writing correct, tested implementations of all the common classes of statistical models McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 5 / 29
  • 6. Statsmodels development update Major work to be done on providing a nice integrated user interface We must work together to close the gap between R and Python! Some important areas: Formula framework, for specifying model design matrices Need integrated rich statistical data structures (pandas) Data visualization of results should always be a few keystrokes away Write a “Statsmodels for R users” guide McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 6 / 29
  • 7. Aside: statistical data structures and user interface While I have a captive audience... Controversial fact: pandas is the only Python library currently providing data structures matching (and in many places exceeding) the richness of R’s data structures (for statistics) Let’s have a BoF session so I can justify this statement Feedback I hear is that end users find the fragmented, incohesive set of Python tools for data analysis and statistics to be confusing, frustrating, and certainly not compelling them to use Python... (Not to mention the packaging headaches) McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 7 / 29
  • 8. Aside: statistical data structures and user interface We need to “commit” ASAP (not 12 months from now) to a high level data structure(s) as the “primary data structure(s) for statistical data analysis” and communicate that clearly to end users Or we might as well all start programming in R... McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 8 / 29
  • 9. Example data: EEG trace data 300 200 100 0 100 200 300 400 500 600 0 500 0 0 0 0 0 0 0 100 150 200 250 300 350 400 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 9 / 29
  • 10. Example data: Macroeconomic data 5.5 5.0 cpi 4.5 4.0 3.5 3.0 7.5 7.0 m1 6.5 6.0 5.5 5.0 4.5 9.5 9.0 realgdp 8.5 8.0 0 4 8 2 6 0 4 8 2 6 0 4 8 196 196 196 197 197 198 198 198 199 199 200 200 200 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 10 / 29
  • 11. Example data: Stock data 800 AAPL 700 GOOG MSFT 600 YHOO 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 200 200 200 200 200 200 200 200 200 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 11 / 29
  • 12. Descriptive statistics Autocorrelation, partial autocorrelation plots Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q) models acf = tsa . acf ( eeg , 50) pacf = tsa . pacf ( eeg , 50) 1.0 Autocorrelation 1.0 Partial Autocorrelation 0.5 0.5 0.0 0.0 0.5 0.5 1.00 10 20 30 40 50 1.00 10 20 30 40 50 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 12 / 29
  • 13. Statistical tests Ljung-Box test for zero autocorrelation Unit root test for cointegration (Augmented Dickey-Fuller test) Granger-causality Whiteness (iid-ness) and normality See our conference paper (when the proceedings get published!) McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 13 / 29
  • 14. Autoregressive moving average (ARMA) models One of most common univariate time series models: yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q 2 where E ( t , s ) = 0, for t = s and t ∼ N (0, σ ) Exact log-likelihood can be evaluated via the Kalman filter, but the “conditional” likelihood is easier and commonly used statsmodels has tools for simulating ARMA processes with known coefficients ai , bi and also estimation given specified lag orders import scikits.statsmodels.tsa.arima_process as ap ar_coef = [1, .75, -.25]; ma_coef = [1, -.5] nobs = 100 y = ap.arma_generate_sample(ar_coef, ma_coef, nobs) y += 4 # add in constant McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 14 / 29
  • 15. ARMA Estimation Several likelihood-based estimators implemented (see docs) model = tsa.ARMA(y) result = model.fit(order=(2, 1), trend=’c’, method=’css-mle’, disp=-1) result.params # array([ 3.97, -0.97, -0.05, -0.13]) Standard model diagnostics, standard errors, information criteria (AIC, BIC, ...), etc available in the returned ARMAResults object McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 15 / 29
  • 16. Vector Autoregression (VAR) models Widely used model for modeling multiple (K -variate) time series, especially in macroeconomics: Yt = A1 Yt−1 + . . . + Ap Yt−p + t, t ∼ N (0, Σ) Matrices Ai are K × K . Yt must be a stationary process (sometimes achieved by differencing). Related class of models (VECM) for modeling nonstationary (including cointegrated) processes McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 16 / 29
  • 17. Vector Autoregression (VAR) models >>> model = VAR(data); model.select_order(8) VAR Order Selection ===================================================== aic bic fpe hqic ----------------------------------------------------- 0 -27.83 -27.78 8.214e-13 -27.81 1 -28.77 -28.57 3.189e-13 -28.69 2 -29.00 -28.64* 2.556e-13 -28.85 3 -29.10 -28.60 2.304e-13 -28.90* 4 -29.09 -28.43 2.330e-13 -28.82 5 -29.13 -28.33 2.228e-13 -28.81 6 -29.14* -28.18 2.213e-13* -28.75 7 -29.07 -27.96 2.387e-13 -28.62 ===================================================== * Minimum McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 17 / 29
  • 18. Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Results for equation m1 ==================================================== coefficient std. error t-stat prob ---------------------------------------------------- const 0.004968 0.001850 2.685 0.008 L1.m1 0.363636 0.071307 5.100 0.000 L1.realgdp -0.077460 0.092975 -0.833 0.406 L1.cpi -0.052387 0.128161 -0.409 0.683 L2.m1 0.250589 0.072050 3.478 0.001 L2.realgdp -0.085874 0.092032 -0.933 0.352 L2.cpi 0.169803 0.128376 1.323 0.188 ==================================================== <snip> McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 18 / 29
  • 19. Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Correlation matrix of residuals m1 realgdp cpi m1 1.000000 -0.055690 -0.297494 realgdp -0.055690 1.000000 0.115597 cpi -0.297494 0.115597 1.000000 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 19 / 29
  • 20. VAR: Impulse Response analysis Analyze systematic impact of unit “shock” to a single variable irf = result.irf(10) irf.plot() Impulse responses m1 → m1 realgdp → m1 cpi → m1 1.0 0.2 0.4 0.8 0.1 0.3 0.2 0.6 0.0 0.1 0.4 0.1 0.0 0.2 0.2 0.1 0.2 0.0 0.3 0.3 0.20 4 0.40 4 10 0.40 2 6 m1 → realgdp 8 10 2 realgdp → realgdp 8 6 2 cpi4→ realgdp 6 8 10 0.20 1.0 0.2 0.15 0.8 0.1 0.10 0.6 0.0 0.05 0.4 0.1 0.00 0.05 0.2 0.2 0.10 0.0 0.3 0.150 2 4 6 8 10 0.20 2 4 0.40 4 → cpi m1 → cpi realgdp →6 cpi 8 10 2 cpi 6 8 10 0.20 0.15 1.0 0.15 0.10 0.8 0.10 0.05 0.6 0.05 0.00 0.00 0.05 0.4 0.05 0.10 0.2 0.100 2 4 6 8 10 0.150 2 4 6 8 10 0.00 2 4 6 8 10 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 20 / 29
  • 21. VAR: Forecast Error Variance Decomposition Analyze contribution of each variable to forecasting error fevd = result.fevd(20) fevd.plot() Forecast error variance decomposition (FEVD) m1 1.0 m1 realgdp 0.8 cpi 0.6 0.4 0.2 0.00 5 10 15 20 1.2 realgdp 1.0 0.8 0.6 0.4 0.2 0.00 5 10 15 20 1.2 cpi 1.0 0.8 0.6 0.4 0.2 0.00 5 10 15 20 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 21 / 29
  • 22. VAR: Statistical tests In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’]) Granger causality f-test ========================================================= Test statistic Critical Value p-value df --------------------------------------------------------- 1.248787 2.387325 0.289 (4, 579) ========================================================= H_0: [’cpi’, ’realgdp’] do not Granger-cause m1 Conclusion: fail to reject H_0 at 5.00% significance level McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 22 / 29
  • 23. Filtering Hodrick-Prescott (HP) filter separates a time series yt into a trend τt and a cyclical component ζt , so that yt = τt + ζt . 14 Inflation 12 Cyclical component 10 Trend component 8 6 4 2 0 2 4 2 6 0 4 8 2 6 0 4 8 2 6 196 196 197 197 197 198 198 199 199 199 200 200 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 23 / 29
  • 24. Filtering In addition to the HP filter, 2 other filters popular in finance and economics, Baxter-King and Christiano-Fitzgerald, are available We refer you to our paper and the documentation for details on these: Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered INFL INFL 4 4 UNEMP UNEMP 2 2 0 0 2 2 4 4 63 73 83 93 68 78 88 98 03 71 81 91 08 66 76 86 96 01 06 19 19 19 19 19 19 19 19 19 19 19 20 19 19 19 19 20 20 20 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 24 / 29
  • 25. Preview: Bayesian dynamic linear models (DLM) A state space model by another name: yt = Ft θt + νt , νt ∼ N (0, Vt ) θt = G θt−1 + ωt , ωt ∼ N (0, Wt ) Estimation of basic model by Kalman filter recursions. Provides elegant way to do time-varying linear regressions for forecasting Extensions: multivariate DLMs, stochastic volatility (SV) models, MCMC-based posterior sampling, mixtures of DLMs McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 25 / 29
  • 26. Preview: DLM Example (Constant+Trend model) model = Polynomial(2) dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model m0=m0, C0=C0, n0=n0, s0=s0, # priors state_discount=.95) # discount factor Constant + Trend DLM 200 150 100 50 8 9 009 9 009 9 9 200 200 2 200 Jul 2 200 200 Nov Jan Mar May Sep Nov McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 26 / 29
  • 27. Preview: Stochastic volatility models 1.6 JPY-USD Exchange Rate Volatility Process 1.4 1.2 1.0 0.8 0.6 0.4 0.20 200 400 600 800 1000 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 27 / 29
  • 28. Future: sandbox and beyond ARCH / GARCH models for volatility Structural VAR and error correction models (ECM) for cointegrated processes Models with non-normally distributed errors Better data description, visualization, and interactive research tools More sophisticated Bayesian time series models McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 28 / 29
  • 29. Conclusions We’ve implemented many foundational models for time series analysis, but the field is very broad User interface can and should be much improved Repo: http://github.com/statsmodels/statsmodels Docs: http://statsmodels.sourceforge.net Contact: pystatsmodels@googlegroups.com McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 29 / 29