Time Series Analysis: Challenge Kaggle with TensorFlow

전태균, 전승현
Developer of Satrec Initiative
Taegyun Jeon and Seunghyun Jeon
시계열 분석: TensorFlow로
짜보고 Kaggle 도전하기

Time Series Analysis
Introduction to Kaggle
KaggleZeroToAll
Contents

코드랩을 다 듣고 나시면
1.시계열 문제에 대해 이해!
2.Kaggle에서 문제 풀기 가능!
3.Kaggle Leaderboard에 본인의 모델 업로드!

시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)

시계열 데이터
● Stock values
● Economic variables
● Weather
● Sensor: Internet-of-Things
● Energy demand
● Signal processing
● Sales forecasting

Time Series Analysis: Challenge Kaggle with TensorFlow

문제점
● Standard Supervised Learning
○ IID assumption
○ Same distribution for training and test data
○ Distributions fixed over time (stationarity)
● Time Series
○ 모두 해당 되지 않음!!

Autoregressive (AR) Models
● AR(p) model
: Linear generative model based on the pth order Markov assumption
○ : zero mean uncorrelated random variables with variance
○ : autoregressive coefficients
○ : observed stochastic process

Moving Average (MA)
● MA(q) model
: Linear generative model for noise term on the qth order Markov
assumption
○ : moving average coefficients

ARMA Model
● ARMA(p,q) model
: generative linear model that combines AR(p) and MA(q) models

Stationarity
● Definition: a sequence of random variables is stationary if its
distribution is invariant to shifting in time.

Lag Operator
● Definition: Lag operator is defined by
● ARMA model in terms of the lag operator:
● Characteristic polynomial
can be used to study properties of this stochastic process.

ARIMA Model
● Definition: Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.
● Characteristic polynomial with unit roots can be factored:
● ARIMA(p, D, q) model is an ARMA(p,q) model for

Other Extensions
● Further variants:
○ Models with seasonal components (SARIMA)
○ Models with side information (ARIMAX)
○ Models with long-memory (ARFIMA)
○ Multi-variate time series model (VAR)
○ Models with time-varing coefficients
○ other non-linear models

쉽게 구현 할 수 있는 방법?

TensorFlow TimeSeries
● tf.contrib.timeseries
○ Classic model (state space, autoregressive)
○ Flexible infrastructure
○ Data management
■ Chunking
■ Batching
■ Saving model
■ Truncated backpropagation

https://www.kaggle.com/

마음껏 데이터를 가지고 놀수있는
데이터 놀이터

Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기

Competitions 종류
1.Featured: 기업, 기관에서 돈을 걸고 경쟁
2.Research: 연구 목적 대회
3.Playground: 연습 문제
4.Getting Started: 연습 문제

몇 가지 일반적인 대회 규칙
1.하루 제출 횟수 제한
2.Test의 일정 비율만 Public Score에 노출
3.대회가 종료될때 최종 점수가 공개
4.대회가 끝나도 데이터셋 접근 가능!

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

오프라인 식료품점의 판매량 예측
하기

복잡하다면…
남이 잘 분석한걸 이용하자:
https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda

대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA
처음 대회 들어가면 EDA를 먼저 보는걸 추천

https://www.kaggle.com/towever/devfest

# -*- coding: utf-8 -*-
import datetime
from datetime import timedelta
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.contrib.timeseries.python.timeseries import NumpyReader
from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators
from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
Prepare

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'}
train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes,
parse_dates=['date'],
skiprows=range(1, 101688780) #Skip initial dates
)
train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives
train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion
train['dow'] = train['date'].dt.dayofweek
Read Dataset

# creating records for all items, in all markets on all dates
# for correct calculation of daily unit sales averages.
u_dates = train.date.unique()
u_stores = train.store_nbr.unique()
u_items = train.item_nbr.unique()
train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True)
train = train.reindex(
pd.MultiIndex.from_product(
(u_dates, u_stores, u_items),
names=['date','store_nbr','item_nbr']
)
)
Preprocess data

train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs
train.reset_index(inplace=True) # reset index and restoring unique columns
lastdate = train.iloc[train.shape[0]-1].date # get last day on data
train.head()
Preprocess data

tmp = train[['item_nbr','store_nbr','dow','unit_sales']]
ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw')
ma_dw.reset_index(inplace=True)
ma_dw.head()
Preprocess data

tmp = ma_dw[['item_nbr','store_nbr','madw']]
ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk')
ma_wk.reset_index(inplace=True)
ma_wk.head()
Preprocess data

tmp = train[['item_nbr','store_nbr','unit_sales']]
ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226')
Moving Average using Pandas

for i in [112,56,28,14,7,3,1]:
tmp = train[train.date>lastdate-timedelta(int(i))]
tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i))
ma_is = ma_is.join(tmpg, how='left')
del tmp,tmpg

ma_is['mais']=ma_is.median(axis=1)
ma_is.reset_index(inplace=True)
ma_is.head()

def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader:
unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr,
train['item_nbr'] == item_nbr)].unit_sales
x = np.asarray(range(len(unit_sales)))
y = np.asarray(unit_sales)
dataset = {
tf.contrib.timeseries.TrainEvalFeatures.TIMES: x,
tf.contrib.timeseries.TrainEvalFeatures.VALUES: y,
}
reader = NumpyReader(dataset)
return x, y, reader
Make data trainable

x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=32, window_size=40)
ar = tf.contrib.timeseries.ARRegressor(
periodicities=21, input_window_size=30, output_window_size=10,
num_features=1,
loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS
)
ar.train(input_fn=train_input_fn, steps=16000)
Tensorflow Timesereies - ARRegressor

evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
# keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple',
'times', 'global_step']
evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1)
(ar_predictions,) = tuple(ar.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))

plt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()

Tensorflow Timesereies - LSTM
get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py

x, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=16, window_size=21)
estimator = tfts_estimators.TimeSeriesRegressor(
model=_LSTMModel(num_features=1, num_units=32),
optimizer=tf.train.AdamOptimizer(0.001))
estimator.train(input_fn=train_input_fn, steps=16000)
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)

(lstm_predictions,) = tuple(estimator.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))

plt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()

Forecasting test data
# Read test dataset
test = pd.read_csv('../input/test.csv', dtype=dtypes,
parse_dates=['date'])
test['dow'] = test['date'].dt.dayofweek

# Moving Average
test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow'])
test['unit_sales'] = test.mais
# Autoregressive
ar_predictions['mean'][ar_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] =
ar_predictions['mean']
# LSTM
lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] =
lstm_predictions['mean']

pos_idx = test['mawk'] > 0
test_pos = test.loc[pos_idx]
test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk']
test.loc[:, "unit_sales"].fillna(0, inplace=True)
test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values

holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date'])
holiday = holiday.loc[holiday['transferred'] == False]
test = pd.merge(test, holiday, how = 'left', on =['date'] )
test['transferred'].fillna(True, inplace=True)
test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2
test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15
test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')

Time Series Analysis: Challenge Kaggle with TensorFlow

More Related Content

Similar to Time Series Analysis: Challenge Kaggle with TensorFlow (20)

Recently uploaded (20)

Time Series Analysis: Challenge Kaggle with TensorFlow