SlideShare a Scribd company logo
Exploring Support Vector Regression
for Predictive Data Analysis
Daniel Kuntz∗, Surya Chandra† and Jon Pritchard‡
Department of Electrical Engineering and Computer Science
Colorado School of Mines: Golden, CO
Email: ∗dkuntz@mines.edu, †schandra@mines.edu, ‡jpritcha@mines.edu
Abstract—The purpose of this paper is to demonstrate the
use of Support Vector Regression (SVR) in the context of
predicting the hourly use of bikes in Washington D.C.’s bike
share program. An abridged derivation of the SVR scheme is
given along with an explanation of kernel functions which are
vital to the performance of this method. Bike share data is
provided as part of a Kaggle
TM
competition, meaning we get
a firm qualitative benchmark for it’s predictive performance
against an array of other competitors, also, we show a direct
comparison between SVR and a naive linear regression to further
intuitive comprehension of the concepts. Our results indicate good
performance vs. linear regression and competitive performance
in the overall contest.
I. INTRODUCTION
Advances in predictive modelling are providing new in-
sights into critical data for businesses, governments and indi-
viduals. One of the most popular of these methods is SVR. It
is an efficient, highly configurable, and mathematically sound
solution for gaining this insight. In short, it is designed for the
task of fitting a non-linear function to approximate an outcome
(e.g. number of bikes rented) based on data that this outcome
is perceived to be a function of (e.g. time, season, weather, ...)
which are usually called ”explanatory variables”.
As a test case, our team will compete in the Washington
D.C. Bike Share competition hosted by Kaggle. In this contest,
it is of interest to the city to determine when and why people
are using their bike share program. This information will allow
them to properly plan for future growth a well as provide and
analysis of customer use patterns. Our team has decided to
use SVR modelling to compete in the competition and our
approach is documented herein.
A. How The Competition Works
Kaggle provides two sets of data. One set, generally
referred to as the ”training” set provides a set of explanatory
variables along with the outcome for each. For this particular
competition the given variables and outcomes are provided
in TABLES I and II respectively. This set of data is used to
train a prediction algorithm. The second set of data, referred
to as the ”test” set provides explanatory variables but not their
outcome. This outcome is hidden from contestants who’s job
it is to predict these outcomes. Once a prediction is made, it’s
accuracy is scored with equation (1).
TABLE I. EXPLANATORY VARIABLES [1]
Name Description
datetime Date and Time (YYYY-MM-DD HH:MM:SS)
season Season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
holiday Whether the day is considered a holiday
weather 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist +
Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3:
Light Snow, Light Rain + Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets
+ Thunderstorm + Mist, Snow + Fog
temp Temperature in Celsius
atemp ”Feels like” temperature in Celsius
humidity Relative humidity
windspeed Wind speed
TABLE II. OUTCOMES
Name Description
casual Number of non-registered user rentals initiated
registered Number of registered user rentals initiated
count Number of total rentals
=
1
n
n
i=1
(log (pi + 1) − log (ai + 1))
2
(1)
Where:
: Root Mean Squared Logrithmic Error (RMSLE)
n : Number of explanatory vectors in the test data set
pi : Prediction for vector i
ai : Actual value for vector i
B. Discussion of Parameters
Some of the parameters in TABLE I had to be modified and
some care had to be taken that redundant and non-important
variables were not used. The ”datetime” variable was divided
into 4 different variables: year, day, month and hour. This
allowed our model to take into account variations by the hour
and month as one would intuitively expect a strongly correlated
cyclical pattern associated with these variables. Also, variables
such as ”season”, which is entirely dependent on the month
and day were generally taken out of the model so as to not
”over-train” the model.
Some experimentation was needed to best determine which
variables affected the outcome the most strongly, one way to do
this that we will not discuss is by using Principal Component
Analysis (PCA). When results are discussed, we will provide
a full list of explanatory variables used in the model.
II. LINEAR REGRESSION
To demonstrate some of the underlying concepts of SVR
we take as inspiration a very simple linear regression for
creating a predictive model. In this we will assume that each
explanatory variable used has a weight and the sum of each
variable times it weight plus an offset is a good model of the
system.
A. Problem Formulation
We assume that the predictive function that we would like
to find takes the same form as (2)
f(xi) = w0 +
m
j=1
wjxi,j (2)
Where:
xi : The ith explanatory vector
w0 : An offset weight
wj : weights for each component of xi
m : The number of variables in xi
So, in this case if we determine the weights w =
[w0 · · · wm]
T
we have found a predictive model. Since we
have m + 1 weights, we could use a system of equations of
the form (2) to find them. So for the training set of data we
have system of equations (3).




1 x1,1 x1,2 · · · x1,m
1 x2,1 x2,2 · · · x2,m
...
...
...
1 xn,1 xn,2 · · · xn,m








w0
w1
...
wm



 =




y1
y2
...
yn



 (3)
Where:
n : The the number of training eplanatory vectors
yi : The outcome of for each explanatory vector i
Using matrix notation we rewrite (2) as (4). We recognize
this as a standard over-defined minimization problem, The
solution of which is given by (5). (X+
denotes the pseudo
inverse of the data matrix X)
Xw = y (4)
w = X+
y (5)
B. Results
Using this naive linear regression method, with the ex-
planatory vectors: ”year”, ”month”, ”day”, ”hour”, ”holiday”,
”workingday”, ”weather”, ”temp”, ”humidity” and ”wind-
speed” we managed to achieve the competition results in
TABLE III.
TABLE III. KAGGLE SCORE FOR LINEAR REGRESSION PREDICTION
Score (RMSLE) Rank (of approx. 1500)
1.30542 1275
C. Analysis of Linear Regression Results
As we would suspect, the linear regression did not perform
very well. The reason for this is that many variables do not
affect the outcome in a linear way. The variable ”weather”
may reduce the number of riders proportionally to how how
bad the weather is, and as such, is a good candidate for linear
regression but what about a variable like ”hour”? You would
intuitively think that this variable would create spikes in the
outcome for hours that represent rush hour. Fig. 1 shows that
this is true as well as showing the average bike rentals for each
hour over the whole data set compared to a best fit line. We
can easily see that the linear regression is not really a good
representation of this variable. Hence, we need a non-linear
representation of the data.
Fig. 1. Linear Regression Fit to Hourly Average
III. HIGHER DIMENSIONAL MAPPING
AND KERNEL FUNCTIONS
Since linear regression fails to accurately model the data,
it is obvious that we need to use a non-linear model to achieve
a better approximation. However, non-linear models are much
more complex than linear models. One strategy that could work
would be to map the low dimensional data into a higher dimen-
sional space where it is linear. In a simplistic manner, linear
regression preforms this kind of mapping by adding in an offset
term and performing the mapping: Φ (x) : Rm
→ Rm+1
. This
idea can be expanded to include higher order terms as well,
consider the mapping Φ such that:
Φ (x) : R2
→ R6
Φ ([x1 x2]) = 1 x1 x2 x2
1 x2
2 x1x2 (6)
The problem with these kinds of mappings is that the linear
regression model becomes extremely inefficient. This is due to
the fact that we could be mapping to a space with a HUGE
number of dimensions. As an example, for an m-dimensional
vector, a simple quadratic mapping the transformed vector
will be in an O m2
dimensional space. This can become
computationally expensive very quickly.
A. Definition
A solution to the problem of mapping to a higher dimen-
sional space is the use of kernel functions. Kernel functions al-
low us to find inner the products of high dimensional vectors in
a lower dimensional space for a very specific set of functions.
This means that if we can formulate our minimization problem
to depend only on these inner products. We can then use these
kernel functions to drastically improve the performance of our
algorithm.
The definition of a kernel function is simply any function
that satisfies the following:
K([x1 x2]) = Φ (x1) , Φ (x2) (7)
Where:
K : The Kernel Fucntion
Φ : A mapping to a higher dimensional space
B. Example Kernel Function
In order to illustrated the relationship between the mapping
functions and the kernels function, an example a simple kernel
function is derived below.
Given the column vectors:
x = [x1 x2]
T
z = [z1 z2]
T
Φ(x) = x2
1
√
2x1x2 x2
2
T
If follows that:
Φ(x)
T
Φ(z) = x2
1
√
2x1x2 x2
2 z2
1
√
2z1z2 z2
2
T
= x2
1 z 2
1 + x2
2 z 2
2 + 2x1x2z1z2
= (x1z1 + x2z2)
2
= xT
z
2
= K(x, z)
C. Other Types of Kernel Functions
Two of the most commonly used used kernel function are
the Gaussian Radial Basis Function (RBF) (8) and Polynomial
function (9).
K (x2, x2) = exp −
x1 − x2
2
2σ2
, σ ∈ R (8)
K (x2, x2) = ( x1, x2 + c)
p
, c ≥ 0, p ∈ N (9)
D. Discussion
We have defined kernel functions and showed how they
can be used to calculate high dimensional inner products using
lower dimensional vectors. With this knowledge we can move
forward to define the formulation of support vector regression,
using kernel functions to simplify calculations.
IV. DERIVATION OF THE SUPPORT VECTOR REGRESSION
METHOD
A. Primal Formulation
In order to use the efficient properties of kernel functions
we now need a regression formulation that can be expressed
in terms of the inner product of explanatory vectors xi. To this
end we consider the minimization problem (10)
Minimize:
1
2
w 2
2 + C
n
i=1
(ζi + ζ∗
i ) (10)
Subject to:
yi − w, xi − w0 ≤ + ζi (11)
w, xi + w0 − yi ≤ + ζ∗
i (12)
ζi ≥ 0
ζ∗
i ≥ 0
In this formulation ζi and ζ ∗
i are the slack variables, they
allow the data to vary outside of the band ± . However, if they
do go outside this band, they are penalize the minimization
term. C > 0 is the amount for which deviations larger than
can are penalized. As shown in Fig 2, only the points outside
the region contribute to cost as we linearly penalize deviations.
These penalized data vectors are the support vectors.
Fig. 2. Visuaization of Support Vectors [2]
B. Lagrangian Minimization
The minimization problem described by (10) has the La-
grangian representation (13)
L :=
1
2
w 2
2 + C
n
i=1
(ζi + ζ∗
i ) −
n
i=1
(ηiζi + η∗
i ζ∗
i )
−
n
i=1
αi ( + ζi + w0 + w, xi − yi)
−
n
i=1
α∗
i ( + ζ∗
i − w0 − w, xi + yi) (13)
Taking each derivative of L with respect to the variables
{w, w0, ζi, ζ∗
i } yields the following expressions:
∂L
∂w
= w −
n
i=1
(αi − α∗
i ) xi (14)
∂L
∂w0
= −w0
n
i=1
(αi − α∗
i ) (15)
∂L
∂ζi
= C − (ηi + αi) (16)
∂L
∂ζ∗
i
= C − (η∗
i + α∗
i ) (17)
Setting each derivative equal to 0 imparts the following
expressions:
(14) = 0 =⇒ w =
n
i=1
(αi − α∗
i ) xi (18)
(15) = 0 =⇒
n
i=1
(αi − α∗
i ) = 0 (19)
(16) = 0 =⇒ ηi = C − αi (20)
(17) = 0 =⇒ η∗
i = C − α∗
i (21)
C. Dual Formulation
Plugging expressions (18), (19), (20), (21) back into (13)
then yields the dual formulation of the minimization problem
(10). This formulation is given by (22)
Maximize:
n
i=1
(αi − α∗
i ) yi −
n
i=1
(αi + α∗
i )
−
1
2
n
i,j=1
(αi − α∗
i ) αj − α∗
j xi, xj (22)
Subject to:
n
i=1
(αi − α∗
i ) = 0
αi, α∗
i ∈ [0, C]
Notice that the dual formulation has is written in terms
of the inner product of xi. This mean that we can use the
kernel functions described in SECTION III to reduce the
dimensionality of a higher order mapping zi = Φ(xi). This
allows us to write (22) as (23)
n
i=1
(αi − α∗
i ) yi −
n
i=1
(αi + α∗
i )
−
1
2
n
i,j=1
(αi − α∗
i ) αj − α∗
j K (xi, xj) (23)
D. Solving for α(∗)
Now the only unknowns left are the variables α and α∗
.
Solving for these variables is a task that can be accomplished
numerically. One such numerical scheme is an interior point
algorithm referred to as primal-dual path-following [3]. This
technique is described in [4]. It should also be noted that a very
nice property of the SVR formulation is that it can be shown
to be convex [3] so any numerical technique will converge to
only one possible solution.
E. Final Solution
Once we have solved for α and α∗
all that is left is compute
the weights w. This is realized by plugging (18) into (2) to
obtain (24)
f (xi) =
n
i,j=1
(αi − α∗
i ) αj − α∗
j K (xi, x2) + w0 (24)
Similarly, the offset term can be solved for by plugging
(18) into (11) or (12) to obtain (25) and (26):
w0 = yi −
n
j=1
αjK (xj, xi) − for aj ∈ (0, C) (25)
w0 = yi +
n
j=1
α∗
j K (xj, xi) + for a∗
j ∈ (0, C) (26)
With:
i s.t. 0 < αi < C/n
F. Selection of Parameters
When selecting parameters C and it helps to have an
understanding of how they affect the regression. The primal
minimization problem (10) holds some clues as to how these
variables affect the outcome. C penalizes the function that
is being minimized any time a vector goes outside the error
insensitive tube (which is ± ).
We can see from Fig. 3 that a small C favors a smoother
function while a larger C puts more emphasis on getting as
close to every point as possible. Thus, the C parameter is a
good way to deal with ”over-fitting” the data. It can be thought
of as a gain that we apply to the slack variables.
Fig. 3. Effect of the C Parameter
As shown by Fig 4 the size of controls how much small
errors in the predictive function are ignored. A small will
penalize most errors while a larger value will not penalize
errors that are close enough. Thus determines the number of
support vectors used calculate f(xi)
Fig. 4. Effect of the Parameter
G. Results
Using the parameters in TABLE IV, we achieved the
highest Kaggle score for our team. The score is provided in
TABLE V. These results show a drastic improvement from the
naive Linear Regression method.
TABLE IV. SVR PARAMETERS
Parameter Value
explanatory
variables
month, hour, weather, workingday
kernel Gaussian Radial Basis Function (RBF)
0.1
C 30
TABLE V. KAGGLE SCORE FOR SVR PREDICTION
Score (RMSLE) Rank (of approx. 1500)
0.55815 847
H. Analysis of SVR Results
We have seen that SVR boosts the predictive power a lot
from the baseline linear regression. In order to answer why
it does we again show the plot for a model run with just the
hour as an explanatory variable (Fig. 5). Now we can see that
the non-linear function fit to the data shows a much closer
adherence to the average. As time of day is one of the most
principal variables, we can easily imagine our fit in higher
dimensions conforming much more closely to actual data.
Fig. 5. SVR Fit to Hourly Average
V. CONCLUSION
This project has demonstrated how Support Vector Re-
gression can be used to find a functional approximation to
a nonlinear dataset. It extends the idea of linear regression
to higher dimensional spaces, and artfully utilizes kernel
functions in order to reduce the complexity of computing the
result. As our results in the Kaggle competition have shown,
SVR is a far more robust method of prediction than the naive
linear regression.
ACKNOWLEDGMENT
Special thanks to Professor Gongguo Tang for a very well
taught and interesting class this semester.
REFERENCES
[1] ”Data - Bike Sharing Demand,” https://www.kaggle.com/c/bike-sharing-
demand/data, Accessed Dec. 10, 2014
[2] P. S. Yu, et al. ”Support vector regression for real-time flood stage
forecasting”. Journal of Hydrology, 328 (3-4), pp. 704-716 (Sep. 2006)
[3] A. Smola, B. Sch¨olkopf, ”A Tutorial On Support Vector Regression,”
Sep. 30, 2003
[4] R. J. Vanderbei, ”LOQO: An interior point code for quadratic program-
ming.” TR SOR-94-15, Statistics and Operations Re- search, Princeton
Univ., NJ, 1994.

More Related Content

PPTX
support vector regression
PPTX
logistic regression with python and R
PPTX
multiple linear regression
PPTX
knn classification
PPTX
svm classification
PPTX
simple linear regression
PPTX
decision tree regression
PPTX
polynomial linear regression
support vector regression
logistic regression with python and R
multiple linear regression
knn classification
svm classification
simple linear regression
decision tree regression
polynomial linear regression

What's hot (20)

PPTX
PCA and LDA in machine learning
PPTX
random forest regression
PPTX
PPTX
A quick introduction to R
PPTX
Grid search (parameter tuning)
PPT
Struct examples
PDF
Linear models
 
PDF
Programs in array using SWIFT
PDF
Regression kriging
 
PDF
10. Getting Spatial
 
PDF
Scala collection methods flatMap and flatten are more powerful than monadic f...
PDF
Map, Reduce and Filter in Swift
PPTX
Functional Programming in Swift
PDF
Cubist
 
PPT
Queue implementation
PDF
Fp in scala with adts
PDF
The Ring programming language version 1.3 book - Part 25 of 88
PDF
Chapter 1 Basic Concepts
PDF
Fp in scala with adts part 2
PDF
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
PCA and LDA in machine learning
random forest regression
A quick introduction to R
Grid search (parameter tuning)
Struct examples
Linear models
 
Programs in array using SWIFT
Regression kriging
 
10. Getting Spatial
 
Scala collection methods flatMap and flatten are more powerful than monadic f...
Map, Reduce and Filter in Swift
Functional Programming in Swift
Cubist
 
Queue implementation
Fp in scala with adts
The Ring programming language version 1.3 book - Part 25 of 88
Chapter 1 Basic Concepts
Fp in scala with adts part 2
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
Ad

Similar to Exploring Support Vector Regression - Signals and Systems Project (20)

PDF
IEOR 265 Final Paper_Minchao Lin
PDF
Introduction to Radial Basis Function Networks
PDF
Data Science Cheatsheet.pdf
PPTX
Linear regression
PDF
working with python
DOC
Introduction to Support Vector Machines
PDF
Lecture6 xing
PPTX
Linear regression
PDF
MLR PPT.pdf seoul bike sharing demand prediction
PPTX
AI & ML(Unit III).pptx.It contains also syllabus
PDF
Random forest algorithm for regression a beginner's guide
PDF
Introduction to machine learning
PDF
Lecture 3 - Linear Regression
PDF
debatrim_report (1)
PDF
Unsupervised learning
PDF
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
PPTX
Machine learning session4(linear regression)
PDF
Huong dan cu the svm
PDF
IEOR 265 Final Paper_Minchao Lin
Introduction to Radial Basis Function Networks
Data Science Cheatsheet.pdf
Linear regression
working with python
Introduction to Support Vector Machines
Lecture6 xing
Linear regression
MLR PPT.pdf seoul bike sharing demand prediction
AI & ML(Unit III).pptx.It contains also syllabus
Random forest algorithm for regression a beginner's guide
Introduction to machine learning
Lecture 3 - Linear Regression
debatrim_report (1)
Unsupervised learning
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
Machine learning session4(linear regression)
Huong dan cu the svm
Ad

More from Surya Chandra (7)

PDF
Robotics Simulation by Wireless Brains - ROBOKDC'15 Project
PDF
Forward Bit Error Correction - Wireless Communications
PDF
Wordoku Puzzle Solver - Image Processing Project
PDF
Direction Finding - Antennas Project
PDF
Smart Bin – Advanced Control System Design Project
PDF
Augmented Reality Video Playlist - Computer Vision Project
PDF
Balancing Robot Kalman Filter Design – Estimation Theory Project
Robotics Simulation by Wireless Brains - ROBOKDC'15 Project
Forward Bit Error Correction - Wireless Communications
Wordoku Puzzle Solver - Image Processing Project
Direction Finding - Antennas Project
Smart Bin – Advanced Control System Design Project
Augmented Reality Video Playlist - Computer Vision Project
Balancing Robot Kalman Filter Design – Estimation Theory Project

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
UNIT 4 Total Quality Management .pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Mechanical Engineering MATERIALS Selection
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Construction Project Organization Group 2.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
DOCX
573137875-Attendance-Management-System-original
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Lecture Notes Electrical Wiring System Components
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
UNIT 4 Total Quality Management .pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mechanical Engineering MATERIALS Selection
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Construction Project Organization Group 2.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
573137875-Attendance-Management-System-original
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS

Exploring Support Vector Regression - Signals and Systems Project

  • 1. Exploring Support Vector Regression for Predictive Data Analysis Daniel Kuntz∗, Surya Chandra† and Jon Pritchard‡ Department of Electrical Engineering and Computer Science Colorado School of Mines: Golden, CO Email: ∗dkuntz@mines.edu, †schandra@mines.edu, ‡jpritcha@mines.edu Abstract—The purpose of this paper is to demonstrate the use of Support Vector Regression (SVR) in the context of predicting the hourly use of bikes in Washington D.C.’s bike share program. An abridged derivation of the SVR scheme is given along with an explanation of kernel functions which are vital to the performance of this method. Bike share data is provided as part of a Kaggle TM competition, meaning we get a firm qualitative benchmark for it’s predictive performance against an array of other competitors, also, we show a direct comparison between SVR and a naive linear regression to further intuitive comprehension of the concepts. Our results indicate good performance vs. linear regression and competitive performance in the overall contest. I. INTRODUCTION Advances in predictive modelling are providing new in- sights into critical data for businesses, governments and indi- viduals. One of the most popular of these methods is SVR. It is an efficient, highly configurable, and mathematically sound solution for gaining this insight. In short, it is designed for the task of fitting a non-linear function to approximate an outcome (e.g. number of bikes rented) based on data that this outcome is perceived to be a function of (e.g. time, season, weather, ...) which are usually called ”explanatory variables”. As a test case, our team will compete in the Washington D.C. Bike Share competition hosted by Kaggle. In this contest, it is of interest to the city to determine when and why people are using their bike share program. This information will allow them to properly plan for future growth a well as provide and analysis of customer use patterns. Our team has decided to use SVR modelling to compete in the competition and our approach is documented herein. A. How The Competition Works Kaggle provides two sets of data. One set, generally referred to as the ”training” set provides a set of explanatory variables along with the outcome for each. For this particular competition the given variables and outcomes are provided in TABLES I and II respectively. This set of data is used to train a prediction algorithm. The second set of data, referred to as the ”test” set provides explanatory variables but not their outcome. This outcome is hidden from contestants who’s job it is to predict these outcomes. Once a prediction is made, it’s accuracy is scored with equation (1). TABLE I. EXPLANATORY VARIABLES [1] Name Description datetime Date and Time (YYYY-MM-DD HH:MM:SS) season Season (1 = spring, 2 = summer, 3 = fall, 4 = winter) holiday Whether the day is considered a holiday weather 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog temp Temperature in Celsius atemp ”Feels like” temperature in Celsius humidity Relative humidity windspeed Wind speed TABLE II. OUTCOMES Name Description casual Number of non-registered user rentals initiated registered Number of registered user rentals initiated count Number of total rentals = 1 n n i=1 (log (pi + 1) − log (ai + 1)) 2 (1) Where: : Root Mean Squared Logrithmic Error (RMSLE) n : Number of explanatory vectors in the test data set pi : Prediction for vector i ai : Actual value for vector i B. Discussion of Parameters Some of the parameters in TABLE I had to be modified and some care had to be taken that redundant and non-important variables were not used. The ”datetime” variable was divided into 4 different variables: year, day, month and hour. This allowed our model to take into account variations by the hour and month as one would intuitively expect a strongly correlated cyclical pattern associated with these variables. Also, variables such as ”season”, which is entirely dependent on the month and day were generally taken out of the model so as to not ”over-train” the model. Some experimentation was needed to best determine which variables affected the outcome the most strongly, one way to do this that we will not discuss is by using Principal Component Analysis (PCA). When results are discussed, we will provide a full list of explanatory variables used in the model.
  • 2. II. LINEAR REGRESSION To demonstrate some of the underlying concepts of SVR we take as inspiration a very simple linear regression for creating a predictive model. In this we will assume that each explanatory variable used has a weight and the sum of each variable times it weight plus an offset is a good model of the system. A. Problem Formulation We assume that the predictive function that we would like to find takes the same form as (2) f(xi) = w0 + m j=1 wjxi,j (2) Where: xi : The ith explanatory vector w0 : An offset weight wj : weights for each component of xi m : The number of variables in xi So, in this case if we determine the weights w = [w0 · · · wm] T we have found a predictive model. Since we have m + 1 weights, we could use a system of equations of the form (2) to find them. So for the training set of data we have system of equations (3).     1 x1,1 x1,2 · · · x1,m 1 x2,1 x2,2 · · · x2,m ... ... ... 1 xn,1 xn,2 · · · xn,m         w0 w1 ... wm     =     y1 y2 ... yn     (3) Where: n : The the number of training eplanatory vectors yi : The outcome of for each explanatory vector i Using matrix notation we rewrite (2) as (4). We recognize this as a standard over-defined minimization problem, The solution of which is given by (5). (X+ denotes the pseudo inverse of the data matrix X) Xw = y (4) w = X+ y (5) B. Results Using this naive linear regression method, with the ex- planatory vectors: ”year”, ”month”, ”day”, ”hour”, ”holiday”, ”workingday”, ”weather”, ”temp”, ”humidity” and ”wind- speed” we managed to achieve the competition results in TABLE III. TABLE III. KAGGLE SCORE FOR LINEAR REGRESSION PREDICTION Score (RMSLE) Rank (of approx. 1500) 1.30542 1275 C. Analysis of Linear Regression Results As we would suspect, the linear regression did not perform very well. The reason for this is that many variables do not affect the outcome in a linear way. The variable ”weather” may reduce the number of riders proportionally to how how bad the weather is, and as such, is a good candidate for linear regression but what about a variable like ”hour”? You would intuitively think that this variable would create spikes in the outcome for hours that represent rush hour. Fig. 1 shows that this is true as well as showing the average bike rentals for each hour over the whole data set compared to a best fit line. We can easily see that the linear regression is not really a good representation of this variable. Hence, we need a non-linear representation of the data. Fig. 1. Linear Regression Fit to Hourly Average III. HIGHER DIMENSIONAL MAPPING AND KERNEL FUNCTIONS Since linear regression fails to accurately model the data, it is obvious that we need to use a non-linear model to achieve a better approximation. However, non-linear models are much more complex than linear models. One strategy that could work would be to map the low dimensional data into a higher dimen- sional space where it is linear. In a simplistic manner, linear regression preforms this kind of mapping by adding in an offset term and performing the mapping: Φ (x) : Rm → Rm+1 . This idea can be expanded to include higher order terms as well, consider the mapping Φ such that: Φ (x) : R2 → R6 Φ ([x1 x2]) = 1 x1 x2 x2 1 x2 2 x1x2 (6) The problem with these kinds of mappings is that the linear regression model becomes extremely inefficient. This is due to the fact that we could be mapping to a space with a HUGE number of dimensions. As an example, for an m-dimensional vector, a simple quadratic mapping the transformed vector will be in an O m2 dimensional space. This can become computationally expensive very quickly.
  • 3. A. Definition A solution to the problem of mapping to a higher dimen- sional space is the use of kernel functions. Kernel functions al- low us to find inner the products of high dimensional vectors in a lower dimensional space for a very specific set of functions. This means that if we can formulate our minimization problem to depend only on these inner products. We can then use these kernel functions to drastically improve the performance of our algorithm. The definition of a kernel function is simply any function that satisfies the following: K([x1 x2]) = Φ (x1) , Φ (x2) (7) Where: K : The Kernel Fucntion Φ : A mapping to a higher dimensional space B. Example Kernel Function In order to illustrated the relationship between the mapping functions and the kernels function, an example a simple kernel function is derived below. Given the column vectors: x = [x1 x2] T z = [z1 z2] T Φ(x) = x2 1 √ 2x1x2 x2 2 T If follows that: Φ(x) T Φ(z) = x2 1 √ 2x1x2 x2 2 z2 1 √ 2z1z2 z2 2 T = x2 1 z 2 1 + x2 2 z 2 2 + 2x1x2z1z2 = (x1z1 + x2z2) 2 = xT z 2 = K(x, z) C. Other Types of Kernel Functions Two of the most commonly used used kernel function are the Gaussian Radial Basis Function (RBF) (8) and Polynomial function (9). K (x2, x2) = exp − x1 − x2 2 2σ2 , σ ∈ R (8) K (x2, x2) = ( x1, x2 + c) p , c ≥ 0, p ∈ N (9) D. Discussion We have defined kernel functions and showed how they can be used to calculate high dimensional inner products using lower dimensional vectors. With this knowledge we can move forward to define the formulation of support vector regression, using kernel functions to simplify calculations. IV. DERIVATION OF THE SUPPORT VECTOR REGRESSION METHOD A. Primal Formulation In order to use the efficient properties of kernel functions we now need a regression formulation that can be expressed in terms of the inner product of explanatory vectors xi. To this end we consider the minimization problem (10) Minimize: 1 2 w 2 2 + C n i=1 (ζi + ζ∗ i ) (10) Subject to: yi − w, xi − w0 ≤ + ζi (11) w, xi + w0 − yi ≤ + ζ∗ i (12) ζi ≥ 0 ζ∗ i ≥ 0 In this formulation ζi and ζ ∗ i are the slack variables, they allow the data to vary outside of the band ± . However, if they do go outside this band, they are penalize the minimization term. C > 0 is the amount for which deviations larger than can are penalized. As shown in Fig 2, only the points outside the region contribute to cost as we linearly penalize deviations. These penalized data vectors are the support vectors. Fig. 2. Visuaization of Support Vectors [2] B. Lagrangian Minimization The minimization problem described by (10) has the La- grangian representation (13)
  • 4. L := 1 2 w 2 2 + C n i=1 (ζi + ζ∗ i ) − n i=1 (ηiζi + η∗ i ζ∗ i ) − n i=1 αi ( + ζi + w0 + w, xi − yi) − n i=1 α∗ i ( + ζ∗ i − w0 − w, xi + yi) (13) Taking each derivative of L with respect to the variables {w, w0, ζi, ζ∗ i } yields the following expressions: ∂L ∂w = w − n i=1 (αi − α∗ i ) xi (14) ∂L ∂w0 = −w0 n i=1 (αi − α∗ i ) (15) ∂L ∂ζi = C − (ηi + αi) (16) ∂L ∂ζ∗ i = C − (η∗ i + α∗ i ) (17) Setting each derivative equal to 0 imparts the following expressions: (14) = 0 =⇒ w = n i=1 (αi − α∗ i ) xi (18) (15) = 0 =⇒ n i=1 (αi − α∗ i ) = 0 (19) (16) = 0 =⇒ ηi = C − αi (20) (17) = 0 =⇒ η∗ i = C − α∗ i (21) C. Dual Formulation Plugging expressions (18), (19), (20), (21) back into (13) then yields the dual formulation of the minimization problem (10). This formulation is given by (22) Maximize: n i=1 (αi − α∗ i ) yi − n i=1 (αi + α∗ i ) − 1 2 n i,j=1 (αi − α∗ i ) αj − α∗ j xi, xj (22) Subject to: n i=1 (αi − α∗ i ) = 0 αi, α∗ i ∈ [0, C] Notice that the dual formulation has is written in terms of the inner product of xi. This mean that we can use the kernel functions described in SECTION III to reduce the dimensionality of a higher order mapping zi = Φ(xi). This allows us to write (22) as (23) n i=1 (αi − α∗ i ) yi − n i=1 (αi + α∗ i ) − 1 2 n i,j=1 (αi − α∗ i ) αj − α∗ j K (xi, xj) (23) D. Solving for α(∗) Now the only unknowns left are the variables α and α∗ . Solving for these variables is a task that can be accomplished numerically. One such numerical scheme is an interior point algorithm referred to as primal-dual path-following [3]. This technique is described in [4]. It should also be noted that a very nice property of the SVR formulation is that it can be shown to be convex [3] so any numerical technique will converge to only one possible solution. E. Final Solution Once we have solved for α and α∗ all that is left is compute the weights w. This is realized by plugging (18) into (2) to obtain (24) f (xi) = n i,j=1 (αi − α∗ i ) αj − α∗ j K (xi, x2) + w0 (24) Similarly, the offset term can be solved for by plugging (18) into (11) or (12) to obtain (25) and (26): w0 = yi − n j=1 αjK (xj, xi) − for aj ∈ (0, C) (25) w0 = yi + n j=1 α∗ j K (xj, xi) + for a∗ j ∈ (0, C) (26) With: i s.t. 0 < αi < C/n
  • 5. F. Selection of Parameters When selecting parameters C and it helps to have an understanding of how they affect the regression. The primal minimization problem (10) holds some clues as to how these variables affect the outcome. C penalizes the function that is being minimized any time a vector goes outside the error insensitive tube (which is ± ). We can see from Fig. 3 that a small C favors a smoother function while a larger C puts more emphasis on getting as close to every point as possible. Thus, the C parameter is a good way to deal with ”over-fitting” the data. It can be thought of as a gain that we apply to the slack variables. Fig. 3. Effect of the C Parameter As shown by Fig 4 the size of controls how much small errors in the predictive function are ignored. A small will penalize most errors while a larger value will not penalize errors that are close enough. Thus determines the number of support vectors used calculate f(xi) Fig. 4. Effect of the Parameter G. Results Using the parameters in TABLE IV, we achieved the highest Kaggle score for our team. The score is provided in TABLE V. These results show a drastic improvement from the naive Linear Regression method. TABLE IV. SVR PARAMETERS Parameter Value explanatory variables month, hour, weather, workingday kernel Gaussian Radial Basis Function (RBF) 0.1 C 30 TABLE V. KAGGLE SCORE FOR SVR PREDICTION Score (RMSLE) Rank (of approx. 1500) 0.55815 847 H. Analysis of SVR Results We have seen that SVR boosts the predictive power a lot from the baseline linear regression. In order to answer why it does we again show the plot for a model run with just the hour as an explanatory variable (Fig. 5). Now we can see that the non-linear function fit to the data shows a much closer adherence to the average. As time of day is one of the most principal variables, we can easily imagine our fit in higher dimensions conforming much more closely to actual data. Fig. 5. SVR Fit to Hourly Average V. CONCLUSION This project has demonstrated how Support Vector Re- gression can be used to find a functional approximation to a nonlinear dataset. It extends the idea of linear regression to higher dimensional spaces, and artfully utilizes kernel functions in order to reduce the complexity of computing the result. As our results in the Kaggle competition have shown, SVR is a far more robust method of prediction than the naive linear regression.
  • 6. ACKNOWLEDGMENT Special thanks to Professor Gongguo Tang for a very well taught and interesting class this semester. REFERENCES [1] ”Data - Bike Sharing Demand,” https://www.kaggle.com/c/bike-sharing- demand/data, Accessed Dec. 10, 2014 [2] P. S. Yu, et al. ”Support vector regression for real-time flood stage forecasting”. Journal of Hydrology, 328 (3-4), pp. 704-716 (Sep. 2006) [3] A. Smola, B. Sch¨olkopf, ”A Tutorial On Support Vector Regression,” Sep. 30, 2003 [4] R. J. Vanderbei, ”LOQO: An interior point code for quadratic program- ming.” TR SOR-94-15, Statistics and Operations Re- search, Princeton Univ., NJ, 1994.