SlideShare a Scribd company logo
전태균, 전승현
Developer of Satrec Initiative
Taegyun Jeon and Seunghyun Jeon
시계열 분석: TensorFlow로
짜보고 Kaggle 도전하기
Time Series Analysis
Introduction to Kaggle
KaggleZeroToAll
Contents
코드랩을 다 듣고 나시면
1.시계열 문제에 대해 이해!
2.Kaggle에서 문제 풀기 가능!
3.Kaggle Leaderboard에 본인의 모델 업로드!
Time Series Analysis
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
시계열 분석
시계열 데이터
시계열 데이터
● Stock values
● Economic variables
● Weather
● Sensor: Internet-of-Things
● Energy demand
● Signal processing
● Sales forecasting
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
문제점
● Standard Supervised Learning
○ IID assumption
○ Same distribution for training and test data
○ Distributions fixed over time (stationarity)
● Time Series
○ 모두 해당 되지 않음!!
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
Autoregressive (AR) Models
● AR(p) model
: Linear generative model based on the pth order Markov assumption
○ : zero mean uncorrelated random variables with variance
○ : autoregressive coefficients
○ : observed stochastic process
Moving Average (MA)
● MA(q) model
: Linear generative model for noise term on the qth order Markov
assumption
○ : moving average coefficients
ARMA Model
● ARMA(p,q) model
: generative linear model that combines AR(p) and MA(q) models
Stationarity
● Definition: a sequence of random variables is stationary if its
distribution is invariant to shifting in time.
Lag Operator
● Definition: Lag operator is defined by
● ARMA model in terms of the lag operator:
● Characteristic polynomial
can be used to study properties of this stochastic process.
ARIMA Model
● Definition: Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.
● Characteristic polynomial with unit roots can be factored:
● ARIMA(p, D, q) model is an ARMA(p,q) model for
Other Extensions
● Further variants:
○ Models with seasonal components (SARIMA)
○ Models with side information (ARIMAX)
○ Models with long-memory (ARFIMA)
○ Multi-variate time series model (VAR)
○ Models with time-varing coefficients
○ other non-linear models
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks
시계열 분석
● Time Series Analysis
● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN
● TensorFlow TimeSeries API (TFTS)
쉽게 구현 할 수 있는 방법?
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
TensorFlow TimeSeries
● tf.contrib.timeseries
○ Classic model (state space, autoregressive)
○ Flexible infrastructure
○ Data management
■ Chunking
■ Batching
■ Saving model
■ Truncated backpropagation
과연 쉬울까요??
예제부터 살펴봅시다
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Introduction to Kaggle
https://www.kaggle.com/
What is the Kaggle?
마음껏 데이터를 가지고 놀수있는
데이터 놀이터
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
Competitions 종류
1.Featured: 기업, 기관에서 돈을 걸고 경쟁
2.Research: 연구 목적 대회
3.Playground: 연습 문제
4.Getting Started: 연습 문제
몇 가지 일반적인 대회 규칙
1.하루 제출 횟수 제한
2.Test의 일정 비율만 Public Score에 노출
3.대회가 종료될때 최종 점수가 공개
4.대회가 끝나도 데이터셋 접근 가능!
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
Time Series Analysis: Challenge Kaggle with TensorFlow
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
https://www.kaggle.com/c/favorita-grocery-sales-forecasting
오프라인 식료품점의 판매량 예측
하기
Time Series Analysis: Challenge Kaggle with TensorFlow
Time Series Analysis: Challenge Kaggle with TensorFlow
복잡하다면…
남이 잘 분석한걸 이용하자:
https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda
대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA
처음 대회 들어가면 EDA를 먼저 보는걸 추천
Kaggle에서 노는 법
1.대회 고르기
2.문제와 데이터를 확인하고 분석하기
3.다른 사람들은 어떻게 하나 구경하기
4.본인만의 솔루션 만들기
https://www.kaggle.com/towever/devfest
KaggleZeroToAll
Time Series Analysis: Challenge Kaggle with TensorFlow
# -*- coding: utf-8 -*-
import datetime
from datetime import timedelta
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.contrib.timeseries.python.timeseries import NumpyReader
from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators
from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
Prepare
dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'}
train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes,
parse_dates=['date'],
skiprows=range(1, 101688780) #Skip initial dates
)
train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives
train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion
train['dow'] = train['date'].dt.dayofweek
Read Dataset
# creating records for all items, in all markets on all dates
# for correct calculation of daily unit sales averages.
u_dates = train.date.unique()
u_stores = train.store_nbr.unique()
u_items = train.item_nbr.unique()
train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True)
train = train.reindex(
pd.MultiIndex.from_product(
(u_dates, u_stores, u_items),
names=['date','store_nbr','item_nbr']
)
)
Preprocess data
train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs
train.reset_index(inplace=True) # reset index and restoring unique columns
lastdate = train.iloc[train.shape[0]-1].date # get last day on data
train.head()
Preprocess data
train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs
train.reset_index(inplace=True) # reset index and restoring unique columns
lastdate = train.iloc[train.shape[0]-1].date # get last day on data
train.head()
Preprocess data
tmp = train[['item_nbr','store_nbr','dow','unit_sales']]
ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw')
ma_dw.reset_index(inplace=True)
ma_dw.head()
Preprocess data
tmp = ma_dw[['item_nbr','store_nbr','madw']]
ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk')
ma_wk.reset_index(inplace=True)
ma_wk.head()
Preprocess data
tmp = train[['item_nbr','store_nbr','unit_sales']]
ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226')
Moving Average using Pandas
for i in [112,56,28,14,7,3,1]:
tmp = train[train.date>lastdate-timedelta(int(i))]
tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i))
ma_is = ma_is.join(tmpg, how='left')
del tmp,tmpg
Moving Average using Pandas
ma_is['mais']=ma_is.median(axis=1)
ma_is.reset_index(inplace=True)
ma_is.head()
Moving Average using Pandas
def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader:
unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr,
train['item_nbr'] == item_nbr)].unit_sales
x = np.asarray(range(len(unit_sales)))
y = np.asarray(unit_sales)
dataset = {
tf.contrib.timeseries.TrainEvalFeatures.TIMES: x,
tf.contrib.timeseries.TrainEvalFeatures.VALUES: y,
}
reader = NumpyReader(dataset)
return x, y, reader
Make data trainable
x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=32, window_size=40)
ar = tf.contrib.timeseries.ARRegressor(
periodicities=21, input_window_size=30, output_window_size=10,
num_features=1,
loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS
)
ar.train(input_fn=train_input_fn, steps=16000)
Tensorflow Timesereies - ARRegressor
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
# keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple',
'times', 'global_step']
evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1)
(ar_predictions,) = tuple(ar.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))
Tensorflow Timesereies - ARRegressor
plt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()
Tensorflow Timesereies - ARRegressor
Tensorflow Timesereies - ARRegressor
Tensorflow Timesereies - LSTM
get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py
Tensorflow Timesereies - LSTM
x, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
reader, batch_size=16, window_size=21)
estimator = tfts_estimators.TimeSeriesRegressor(
model=_LSTMModel(num_features=1, num_units=32),
optimizer=tf.train.AdamOptimizer(0.001))
estimator.train(input_fn=train_input_fn, steps=16000)
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)
Tensorflow Timesereies - LSTM
(lstm_predictions,) = tuple(estimator.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=16)))
Tensorflow Timesereies - LSTM
plt.figure(figsize=(15, 5))
plt.plot(x.reshape(-1), y.reshape(-1), label='origin')
plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')
plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1),
label='prediction')
plt.xlabel('time_step')
plt.ylabel('values')
plt.legend(loc=4)
plt.show()
Tensorflow Timesereies - LSTM
Forecasting test data
# Read test dataset
test = pd.read_csv('../input/test.csv', dtype=dtypes,
parse_dates=['date'])
test['dow'] = test['date'].dt.dayofweek
Forecasting test data
# Moving Average
test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr'])
test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow'])
test['unit_sales'] = test.mais
# Autoregressive
ar_predictions['mean'][ar_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] =
ar_predictions['mean']
# LSTM
lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0
test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] =
lstm_predictions['mean']
Forecasting test data
pos_idx = test['mawk'] > 0
test_pos = test.loc[pos_idx]
test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk']
test.loc[:, "unit_sales"].fillna(0, inplace=True)
test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values
Forecasting test data
holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date'])
holiday = holiday.loc[holiday['transferred'] == False]
test = pd.merge(test, holiday, how = 'left', on =['date'] )
test['transferred'].fillna(True, inplace=True)
test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2
test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15
test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')
Time Series Analysis: Challenge Kaggle with TensorFlow
Thanks You!

More Related Content

PDF
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...
PDF
GDG DevFest Xiamen 2017
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PPTX
Cloudera Data Science Challenge
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
PDF
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using Tenso...
GDG DevFest Xiamen 2017
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Cloudera Data Science Challenge
Data Science Challenge presentation given to the CinBITools Meetup Group
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME

Similar to Time Series Analysis: Challenge Kaggle with TensorFlow (20)

PDF
Unit 5 Time series Data Analysis.pdf
PPTX
Seasonal Decomposition of Time Series Data
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PPTX
Pydata talk
PDF
Time series analysis on The daily closing price of bitcoin from the 27th of A...
PDF
Elasticsearch Performance Testing and Scaling @ Signal
PDF
PDF
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
PPTX
Powering a Graph Data System with Scylla + JanusGraph
PDF
Introduction to Spark Datasets - Functional and relational together at last
PPT
Basic terminologies & asymptotic notations
PPTX
Journey through high performance django application
PDF
GANs for Anti Money Laundering
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PDF
Mapreduce Algorithms
PDF
Time series representations for better data mining
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PPTX
Optimizing Performance - Clojure Remote - Nikola Peric
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
Unit 5 Time series Data Analysis.pdf
Seasonal Decomposition of Time Series Data
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Pydata talk
Time series analysis on The daily closing price of bitcoin from the 27th of A...
Elasticsearch Performance Testing and Scaling @ Signal
#OSSPARIS19 : Detecter des anomalies de séries temporelles à la volée avec Wa...
Powering a Graph Data System with Scylla + JanusGraph
Introduction to Spark Datasets - Functional and relational together at last
Basic terminologies & asymptotic notations
Journey through high performance django application
GANs for Anti Money Laundering
Hadoop France meetup Feb2016 : recommendations with spark
Mapreduce Algorithms
Time series representations for better data mining
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Optimizing Performance - Clojure Remote - Nikola Peric
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ad

Recently uploaded (20)

PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Construction Project Organization Group 2.pptx
PDF
Well-logging-methods_new................
PPTX
additive manufacturing of ss316l using mig welding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
web development for engineering and engineering
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Welding lecture in detail for understanding
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Construction Project Organization Group 2.pptx
Well-logging-methods_new................
additive manufacturing of ss316l using mig welding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
CH1 Production IntroductoryConcepts.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Automation-in-Manufacturing-Chapter-Introduction.pdf
web development for engineering and engineering
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Welding lecture in detail for understanding
Ad

Time Series Analysis: Challenge Kaggle with TensorFlow

  • 1. 전태균, 전승현 Developer of Satrec Initiative Taegyun Jeon and Seunghyun Jeon 시계열 분석: TensorFlow로 짜보고 Kaggle 도전하기
  • 2. Time Series Analysis Introduction to Kaggle KaggleZeroToAll Contents
  • 3. 코드랩을 다 듣고 나시면 1.시계열 문제에 대해 이해! 2.Kaggle에서 문제 풀기 가능! 3.Kaggle Leaderboard에 본인의 모델 업로드!
  • 5. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 6. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 9. 시계열 데이터 ● Stock values ● Economic variables ● Weather ● Sensor: Internet-of-Things ● Energy demand ● Signal processing ● Sales forecasting
  • 12. 문제점 ● Standard Supervised Learning ○ IID assumption ○ Same distribution for training and test data ○ Distributions fixed over time (stationarity) ● Time Series ○ 모두 해당 되지 않음!!
  • 13. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 14. Autoregressive (AR) Models ● AR(p) model : Linear generative model based on the pth order Markov assumption ○ : zero mean uncorrelated random variables with variance ○ : autoregressive coefficients ○ : observed stochastic process
  • 15. Moving Average (MA) ● MA(q) model : Linear generative model for noise term on the qth order Markov assumption ○ : moving average coefficients
  • 16. ARMA Model ● ARMA(p,q) model : generative linear model that combines AR(p) and MA(q) models
  • 17. Stationarity ● Definition: a sequence of random variables is stationary if its distribution is invariant to shifting in time.
  • 18. Lag Operator ● Definition: Lag operator is defined by ● ARMA model in terms of the lag operator: ● Characteristic polynomial can be used to study properties of this stochastic process.
  • 19. ARIMA Model ● Definition: Non-stationary processes can be modeled using processes whose characteristic polynomial has unit roots. ● Characteristic polynomial with unit roots can be factored: ● ARIMA(p, D, q) model is an ARMA(p,q) model for
  • 20. Other Extensions ● Further variants: ○ Models with seasonal components (SARIMA) ○ Models with side information (ARIMAX) ○ Models with long-memory (ARFIMA) ○ Multi-variate time series model (VAR) ○ Models with time-varing coefficients ○ other non-linear models
  • 30. 시계열 분석 ● Time Series Analysis ● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN ● TensorFlow TimeSeries API (TFTS)
  • 31. 쉽게 구현 할 수 있는 방법?
  • 34. TensorFlow TimeSeries ● tf.contrib.timeseries ○ Classic model (state space, autoregressive) ○ Flexible infrastructure ○ Data management ■ Chunking ■ Batching ■ Saving model ■ Truncated backpropagation
  • 59. What is the Kaggle?
  • 60. 마음껏 데이터를 가지고 놀수있는 데이터 놀이터
  • 61. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 64. Competitions 종류 1.Featured: 기업, 기관에서 돈을 걸고 경쟁 2.Research: 연구 목적 대회 3.Playground: 연습 문제 4.Getting Started: 연습 문제
  • 65. 몇 가지 일반적인 대회 규칙 1.하루 제출 횟수 제한 2.Test의 일정 비율만 Public Score에 노출 3.대회가 종료될때 최종 점수가 공개 4.대회가 끝나도 데이터셋 접근 가능!
  • 66. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 68. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 73. 복잡하다면… 남이 잘 분석한걸 이용하자: https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda
  • 74. 대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA 처음 대회 들어가면 EDA를 먼저 보는걸 추천
  • 75. Kaggle에서 노는 법 1.대회 고르기 2.문제와 데이터를 확인하고 분석하기 3.다른 사람들은 어떻게 하나 구경하기 4.본인만의 솔루션 만들기
  • 79. # -*- coding: utf-8 -*- import datetime from datetime import timedelta import numpy as np import pandas as pd import tensorflow as tf from tensorflow.contrib.timeseries.python.timeseries import NumpyReader from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model import matplotlib import matplotlib.pyplot as plt %matplotlib inline Prepare
  • 80. dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'} train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes, parse_dates=['date'], skiprows=range(1, 101688780) #Skip initial dates ) train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion train['dow'] = train['date'].dt.dayofweek Read Dataset
  • 81. # creating records for all items, in all markets on all dates # for correct calculation of daily unit sales averages. u_dates = train.date.unique() u_stores = train.store_nbr.unique() u_items = train.item_nbr.unique() train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True) train = train.reindex( pd.MultiIndex.from_product( (u_dates, u_stores, u_items), names=['date','store_nbr','item_nbr'] ) ) Preprocess data
  • 82. train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs train.reset_index(inplace=True) # reset index and restoring unique columns lastdate = train.iloc[train.shape[0]-1].date # get last day on data train.head() Preprocess data
  • 83. train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs train.reset_index(inplace=True) # reset index and restoring unique columns lastdate = train.iloc[train.shape[0]-1].date # get last day on data train.head() Preprocess data
  • 84. tmp = train[['item_nbr','store_nbr','dow','unit_sales']] ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw') ma_dw.reset_index(inplace=True) ma_dw.head() Preprocess data
  • 85. tmp = ma_dw[['item_nbr','store_nbr','madw']] ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk') ma_wk.reset_index(inplace=True) ma_wk.head() Preprocess data
  • 86. tmp = train[['item_nbr','store_nbr','unit_sales']] ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226') Moving Average using Pandas
  • 87. for i in [112,56,28,14,7,3,1]: tmp = train[train.date>lastdate-timedelta(int(i))] tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i)) ma_is = ma_is.join(tmpg, how='left') del tmp,tmpg Moving Average using Pandas
  • 89. def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader: unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr, train['item_nbr'] == item_nbr)].unit_sales x = np.asarray(range(len(unit_sales))) y = np.asarray(unit_sales) dataset = { tf.contrib.timeseries.TrainEvalFeatures.TIMES: x, tf.contrib.timeseries.TrainEvalFeatures.VALUES: y, } reader = NumpyReader(dataset) return x, y, reader Make data trainable
  • 90. x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574) train_input_fn = tf.contrib.timeseries.RandomWindowInputFn( reader, batch_size=32, window_size=40) ar = tf.contrib.timeseries.ARRegressor( periodicities=21, input_window_size=30, output_window_size=10, num_features=1, loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS ) ar.train(input_fn=train_input_fn, steps=16000) Tensorflow Timesereies - ARRegressor
  • 91. evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader) # keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple', 'times', 'global_step'] evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1) (ar_predictions,) = tuple(ar.predict( input_fn=tf.contrib.timeseries.predict_continuation_input_fn( evaluation, steps=16))) Tensorflow Timesereies - ARRegressor
  • 92. plt.figure(figsize=(15, 5)) plt.plot(x.reshape(-1), y.reshape(-1), label='origin') plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation') plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1), label='prediction') plt.xlabel('time_step') plt.ylabel('values') plt.legend(loc=4) plt.show() Tensorflow Timesereies - ARRegressor
  • 94. Tensorflow Timesereies - LSTM get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py
  • 95. Tensorflow Timesereies - LSTM x, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574) train_input_fn = tf.contrib.timeseries.RandomWindowInputFn( reader, batch_size=16, window_size=21) estimator = tfts_estimators.TimeSeriesRegressor( model=_LSTMModel(num_features=1, num_units=32), optimizer=tf.train.AdamOptimizer(0.001)) estimator.train(input_fn=train_input_fn, steps=16000) evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader) evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)
  • 96. Tensorflow Timesereies - LSTM (lstm_predictions,) = tuple(estimator.predict( input_fn=tf.contrib.timeseries.predict_continuation_input_fn( evaluation, steps=16)))
  • 97. Tensorflow Timesereies - LSTM plt.figure(figsize=(15, 5)) plt.plot(x.reshape(-1), y.reshape(-1), label='origin') plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation') plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1), label='prediction') plt.xlabel('time_step') plt.ylabel('values') plt.legend(loc=4) plt.show()
  • 99. Forecasting test data # Read test dataset test = pd.read_csv('../input/test.csv', dtype=dtypes, parse_dates=['date']) test['dow'] = test['date'].dt.dayofweek
  • 100. Forecasting test data # Moving Average test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr']) test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr']) test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow']) test['unit_sales'] = test.mais # Autoregressive ar_predictions['mean'][ar_predictions['mean'] < 0] = 0 test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] = ar_predictions['mean'] # LSTM lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0 test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] = lstm_predictions['mean']
  • 101. Forecasting test data pos_idx = test['mawk'] > 0 test_pos = test.loc[pos_idx] test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk'] test.loc[:, "unit_sales"].fillna(0, inplace=True) test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values
  • 102. Forecasting test data holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date']) holiday = holiday.loc[holiday['transferred'] == False] test = pd.merge(test, holiday, how = 'left', on =['date'] ) test['transferred'].fillna(True, inplace=True) test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2 test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15 test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')