📰 📈 TIP 1: Introduction to Time Series Analysis with statsmodels

📰 📈 TIP 1: Introduction to Time Series Analysis with statsmodels

📰 Edição #46 — TIP 1: Introduction to Time Series Analysis with statsmodels

Article content

✨ Introduction

Time series analysis is essential to uncover temporal patterns and make data-driven forecasts. The statsmodels library in Python allows you to quickly model trends, seasonality, and generate future scenarios with ease and power.

🎯 Objective

Show how to prepare a time series in Python, fit a basic ARIMA model, and interpret its output.

📚 Theoretical Concept

A time series is an ordered sequence of observations over time. ARIMA models combine three parts:

  • AR (AutoRegressive): linear dependence on past values.
  • I (Integrated): differencing to achieve stationarity.
  • MA (Moving Average): moving average of errors to capture noise.

💡 Python Example with statsmodels

#!/usr/bin/env python3

"""
TIP 1 – Introduction to Time Series Analysis with statsmodels
"""


import argparse
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
 

def load_data(path: str = None) -> pd.Series:

    """
    Load time series from a CSV file or generate synthetic data if no path is provided.
    """

    if path:

        df = pd.read_csv(path, parse_dates=True, index_col=0)

        # expects a single column of values

        ts = df.iloc[:, 0].asfreq('ME').fillna(method='ffill')

        print(f"Loaded {len(ts)} records from {path}.")

    else:

        # generate 3 years of monthly synthetic data (using 'ME' no lugar de 'M')
        dates = pd.date_range(start='2020-01-01', periods=36, freq='ME')

        np.random.seed(42)

        data = np.random.normal(loc=100, scale=10, size=len(dates)).cumsum()

        ts = pd.Series(data, index=dates, name='value')

        print("Generated synthetic time series for demonstration (36 monthly points).")

    return ts
 

def fit_arima(ts: pd.Series, order=(1, 1, 1)):

    """
    Fit an ARIMA model to the time series and return the results.
    """

    model = sm.tsa.ARIMA(ts, order=order)
    results = model.fit()
    return results
 

def plot_diagnostics(results, output_file: str = "diagnostics.png"):

    """
    Generate and save diagnostic plots for the fitted model.
    """

    fig = results.plot_diagnostics(figsize=(10, 8))  # This is where statsmodels builds the four diagnostic plots.

 
    fig.tight_layout()
    fig.suptitle("ARIMA Fit Diagnostic Plots", y=1.02)
    plt.savefig(output_file)
    print(f"Diagnostic plots saved to {output_file}.")
 

def main():
    parser = argparse.ArgumentParser(
        description="TIP 1: Basic ARIMA with statsmodels"
    )

    parser.add_argument(

        "-i", "--input",
        dest="csv_path",
        help="Path to CSV file containing time series (single column)."

    )

    parser.add_argument(

        "-p", "--order",
        nargs=3,
        type=int,
        default=[1, 1, 1],
        metavar=('P', 'D', 'Q'),
        help="ARIMA order: p d q (default: 1 1 1)."

    )

    parser.add_argument(

        "-o", "--output",
        dest="output_file",
        default="diagnostics.png",
        help="Filename for saving diagnostic plots."
    )

    args = parser.parse_args()
 

    ts = load_data(args.csv_path)

    print("\n--- Series Summary ---")
    print(ts.describe())
 

    print(f"\nFitting ARIMA{tuple(args.order)} model...")
    results = fit_arima(ts, order=tuple(args.order))
    print("\n--- Model Summary ---")
    print(results.summary())
 
    plot_diagnostics(results, output_file=args.output_file)
    print("\nDone! Use python tip1_timeseries.py -h for help.")
 
if name == "__main__":
    main()        

🖥️ How to Use in VSCode

1. Install dependencies

pip install pandas numpy statsmodels matplotlib

2. Run the script

2.1 To use the synthetic series:

python dica1_timeseries.py

2.2 To load your own CSV and set ARIMA order:

python dica1_timeseries.py --input my_data.csv --order 2 1 2 --output diag.png

3. View output and plots

  • The output from print() will appear in the VSCode terminal.
  • The diagnostic plot will be saved as diagnostics.png (or the name you choose).
  • To view the plot directly in VSCode, use the Plots panel (when available) or open the PNG file by clicking it in the Explorer. 

🖼️ Generated Images

Article content

🔍 Results Interpretation

  • Coefficients: AR and MA values capturing dependencies.
  • AIC/BIC: information criteria for model comparison.
  • Residuals: should behave like white noise (no pattern).

 🛠️ Practical Applications

Monthly sales forecasting, anomaly detection in financial data, demand planning.

💬 Extra Tip

Run results.plot_diagnostics() to visually assess model fit and residual autocorrelation.

 📅 Next Steps

Explore SARIMA for seasonal behavior or include exogenous variables with SARIMAX.

📣 Conclusion & CTA

You’ve just taken your first step into time series with statsmodels! Comment below what series you’d like to model next and share with your network.

💼 LinkedIn & Newsletters:

👉 https://www.linkedin.com/in/izairton-oliveira-de-vasconcelos-a1916351/

👉 https://www.linkedin.com/newsletters/scripts-em-python-produtividad-7287106727202742273

👉 https://www.linkedin.com/build-relation/newsletter-follow?entityUrn=7319069038595268608

💼 Company Page:

👉 https://www.linkedin.com/company/106356348/

💻 GitHub:

👉 https://github.com/IOVASCON

🏷️ Hashtags

#TimeSeries #Python #Statsmodels #DataScience #Forecasting

 

ANNEXES:


Model Summary Interpretation

Article content

  1. AR coefficient (ar.L1 ≈ 0.9999):

  • This is extremely close to 1. That suggests the series is nearly a random walk (i.e., very strong persistence). Because we used an I=1 difference, the differenced series might still be almost non‐stationary.
  • In practice, whenever you see an AR coefficient so close to 1, it’s a signal to check if you really need the same differencing order or perhaps try an alternative (e.g., seasonal differencing or even consider a model without a difference if the series becomes stationary by some transformation).

2. MA coefficient (ma.L1 ≈ -0.8760)

  • This negative moving‐average coefficient captures some of the short‐term noise structure. Its magnitude (~0.88) indicates a fairly strong single‐lag moving‐average effect in the differenced series.

3. σ² (sigma2 ≈ 93.6)

  • This is the estimated variance of the residuals. Lower values are generally “better” (less unexplained noise), but of course it depends on the scale of your data.

4. Information Criteria (AIC = 270.285, BIC = 274.951)

  • These allow model comparison: if you try ARIMA(1,1,0) or ARIMA(0,1,1), you can check which has a lower AIC or BIC. Lower is generally preferred.

5. Ljung‐Box, Jarque‐Bera, Heteroskedasticity tests

  • Ljung‐Box (Q) p-value ≈ 0.68 – A large p-value means we do not reject the null hypothesis of “no autocorrelation in residuals” at lag 1. In other words, at least at lag 1, residuals look like white noise.
  • Jarque-Bera p-value ≈ 0.68 – Fails to reject normality of residuals, so the residuals roughly follow a normal distribution.
  • Heteroskedasticity test (H) p-value ≈ 0.64 – No strong evidence of heteroskedasticity (i.e., the residual variance seems stable over time).

All three of these diagnostic tests having high p-values is a good sign: it means your residuals, at least by these tests, appear to be behaving like well‐behaved white noise (zero autocorrelation, approximately normal, homoscedastic).


2. Diagnostic Plot Overview

Your four diagnostic panels (saved in diagnostics.png) should look roughly like this:

  1. Standardized Residuals vs. Time You want this to look like a horizontal “cloud” around 0, without obvious trends or large clusters. If you see prolonged runs above or below zero, or a trend, it indicates under‐differencing or missing structure.
  2. Histogram + Estimated Density (with Normal(0,1) overlay) The histogram bars show the empirical distribution of residuals, and the orange curve is a kernel density estimate. The green curve is the theoretical standard normal density. Ideally, the orange KDE should align reasonably with the green Gaussian curve, implying roughly normal residuals.
  3. Normal Q-Q Plot This plots your standardized residual quantiles against theoretical normal quantiles. Points lying close to the straight red line → residuals ~ Normal(0,1). Deviations at the tails (very top or bottom) can indicate heavy tails or outliers.
  4. Correlogram (ACF of Residuals) Each blue bar at lag k shows the autocorrelation of residuals at that lag. The shaded region is the 95% confidence band. You want most (if not all) of these bars to lie within the shaded band, suggesting no significant autocorrelation remains.

Because your Ljung‐Box test was not significant and the Q-Q/Jarque-Bera also did not reject normality, you should see:

  • Residual points clustered around zero without obvious patterns.
  • A histogram roughly bell‐shaped.
  • A Q-Q scatter hugging the diagonal.
  • A correlogram with bars mostly inside the shaded region.


3. Next Steps & Recommendations

A) Check Stationarity in a Different Way

  1. Even though you differenced once, the AR coefficient ~1 suggests borderline nonstationarity.
  2. Run an Augmented Dickey‐Fuller (ADF) or KPSS test on the raw series and on the differenced series to confirm stationarity.

from statsmodels.tsa.stattools import adfuller, kpss


raw_adf = adfuller(ts, autolag='AIC')

diff_adf = adfuller(ts.diff().dropna(), autolag='AIC')

raw_kpss = kpss(ts, regression='c', nlags='auto')

diff_kpss = kpss(ts.diff().dropna(), regression='c', nlags='auto')


print("Raw ADF p-value:  ", raw_adf[1])

print("Diff ADF p-value: ", diff_adf[1])

print("Raw KPSS p-value: ", raw_kpss[1])

print("Diff KPSS p-value:", diff_kpss[1])        

  • f the differenced series still appears nonstationary, you might need to double‐difference or consider a different transformation (e.g., log‐difference).

B) Compare Alternative ARIMA Orders

  • Try ARIMA(1,1,0), ARIMA(0,1,1), or perhaps ARIMA(2,1,1). Examine which yields a lower AIC/BIC and at the same time passes residual diagnostics.

for order in [(1,1,0), (0,1,1), (2,1,1)]:
       m = sm.tsa.ARIMA(ts, order=order).fit()
       print(f"Order={order} → AIC={m.aic:.3f}, BIC={m.bic:.3f}")        

C) Try a SARIMA Model if Seasonality Exists

  • If your data has a clear yearly or quarterly pattern, you may need a Seasonal ARIMA (SARIMA). For example,

seasonal_order=(1, 1, 1, 12)  # (P, D, Q, s) for monthly seasonality
sarima_mod = sm.tsa.statespace.SARIMAX(ts, order=(1,1,1), seasonal_order=seasonal_order).fit()
print(sarima_mod.summary())
sarima_mod.plot_diagnostics(figsize=(10,8))        

D) Forecasting & Confidence Intervals

  • Once you’re satisfied with the in‐sample diagnostics, generate out‐of‐sample forecasts:

forecast_steps = 12  # next 12 months, for example
fc = results.get_forecast(steps=forecast_steps)
mean_forecast = fc.predicted_mean
conf_int = fc.conf_int(alpha=0.05)
 
# Plot:
ax = ts.plot(label='Observed', figsize=(10,6))

mean_forecast.plot(ax=ax, label='Forecast', color='red')
ax.fill_between(conf_int.index,
                conf_int.iloc[:, 0],
                conf_int.iloc[:, 1],
                color='pink', alpha=0.3)
ax.legend()
plt.show()        

 E) Validate on Real Data

If you switch from synthetic data to a real CSV of actual time‐series observations (e.g., monthly sales), be sure to check:

  • Missing values: use asfreq('ME').fillna(method='ffill') or another imputation method.
  • Outliers: visually inspect or use robust scaling if large spikes exist.
  • Exogenous variables: if you have external regressors (e.g., promotions, holiday flags), consider using SARIMAX with exog=.

To view or add a comment, sign in

Others also viewed

Explore topics