Linear Regression : comparision of Gradient Descent and Normal Equations
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The following study will explore linear regression over
two datasources by comparing Gradient Descent (GD) and
Normal Equations techniques adjusting the parameters to
avoiding overfitting. The first dataset is called FRIED and
was proposed in Friedman [?] and Breiman [?], it com-
prises 40,768 cases, 10 attributes (0 nominal, 10 continu-
ous). The second dataset is called ABALONE comprising
4,177 cases, 8 attributes (1 categorical, 7 continuous).
2. Activities
There is a large discussion about the difference in perfor-
mance from Batch Gradient Descent versus On-line GD in
Wilsona and Martinez [?]. To overcome some of these dis-
advantages such as slowness to be close of the minimum,
one of the state-of-the-art approaches found in literature for
solving linear regression problems is mini batch Gradient
Descent. Another advanced algorithm is stochastic GD, one
discussion is available at Gardner [?]. Least but not last,
Least mean squares is a class of adaptive filter described in
the book by Widrow and Stearns [?].
3. Proposed Solutions
It was implemented two algorithms to deal with the prob-
lem. Both algorithms were developed in Octave language,
trying to apply vectorized implementation whenever is pos-
sible to try to achieve the best performance. As the first
approach, Gradient Descent using regularization (to avoid
overfitting) was created and used the following formula to
measure the cost function:
J(θ) = (1/2m)
m
i=1
((X[i]∗θ)−y[i]))2
+λ
n
j=1
θ[j]2
) (1)
The gradient was updated for theta at index 0 without us-
ing regularization parameter, while the other theta indexes
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
were updated according to the formula below:
θ[j] = θ[j]−α
m
i=1
((X[i]∗θ)−y[i])∗X[i])+
λ
n
∗θ[j]) (2)
As a second approach, the Normal Equations algorithm
was implemented to compute the closed-form solution for
linear regression with the following code:
function [ t h e t a ] = normalEqn (X, y )
t h e t a = zeros ( s i z e (X, 2) , 1 ) ;
t h e t a = pinv (X’∗X)∗X’∗ y ;
end
In order to get the best model, it was developed some
code to plot learning curve (error x number of instances
the algorithm used) and validation curve (error over the λ).
With these graphs, we can selecte the best GD parameters
such as learning rate (α) or λ (regularization parameter).
The lambda values tested are: 0, 0.001, 0.003, 0.01, 0.03,
0.1, 0.3, 1 3, 10.
If the algorithm presents a biased behaviour, it was de-
veloped a simple code in R to test several polynomials in
the input data (x2, x3, and so on) and plot a graph to de-
tect the residuals and the multiplication of the two different
variables (nchoosek gives the binomial coefficient of n and
k) as the snippet below:
for z=1 : m
for i =1 : k
for j =1 : p %degree
column = ( ( i −1)∗k )+ j ;
X poly ( z , column ) = X( z , i ) ˆ j ;
end
end
end
c = nchoosek ( 1 : k , 2 ) ; %m u l t i p l i c a t i o n
for z=1 : m
for j =1: s i z e ( c , 1 )
column = ( k∗p ) + j ;
X poly ( z , column ) = X( z , c ( j , 1 ) ) ∗X( z , c ( j , 2 ) ) ;
end
end
1
4. Experiments and Discussion
We will describe each dataset experiment in a diferent
sub-section.
4.1. FRIED dataset
Firstly, the FRIED dataset provided has 2 pieces: train
(with 87.5 percent or 35000 cases), and test set (with 12.5
percent or 5000 instances). By running the GD using a sim-
ple polynomial (degree =1 for all input variables). The theta
found was the following:
intercept x1 x2 x3 x4
Theta 12964.21 1412.12 1472.49 -19.51 2193.17
x5 x6 x7 x8 x9 x10
1081.09 -9.42 36.58 7.3 12.13 -43.92
Table 1. Theta found for Gradient Descent in FRIED
The cost was monitored for two scenarios below:
Cost Parameters J
Without Regularization
Alpha=0.03
nbr. iterations=400
3.45
With Regularization
Alpha=0.03
nbr. iterations=400
lambda=0.03
115.94
Table 2. Cost with and without regularization for FRIED
As it can be seen by the table of costs, the regulariza-
tion did not improve the result of the Gradient Descent for
FRIED, particularly, it was considerably worse, the follow-
ing values were tried 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3,
10.
The following cost function vs. number of iterations was
displayed for the GD without regularization:
Figure 1. GD function cost without regularization for FRIED.
By looking at the error, it can be noticed that it was very
large even for a resonable number of iterations. So, we used
R lm function to look for the residuals and the R-squared
coefficient to find linear correlations among the variables
and the output. The following table summarizes the finding:
lm formula Resid. std error R-squared
y ∼ x1+x2+x3+x4+x5
+x6+x7+x8+x9+x10
5533 0.2507
Table 3. R lm result for FRIED
4.1.1 Adding multiplication and higher degree polyno-
mials
As the error was still high, we tried to increase model com-
plexity, in the case, adding new polymonial degrees and
multiplication of variables two-by-two. The training error
was plotted in the following graph:
Figure 2. Increasing model complexity for FRIED.
The ”best” model found was with a 5th degree poly-
nomial. The error found in training set was J=1.39 with
lambda=0, cross validation error=18.49. But this lead to a
biased model, as the cross validation increases, so we can
stay with model at degree 1 or 3. After that, it was plot-
ted for degree=5 some lambda values. The best value found
was lambda=0 (despite all values are very similar for train-
ing and cross validation error, lambda it does not seem to
influence).
4.2. ABALONE dataset
ABALONE dataset provided has 2 pieces: train (with 84
percent or 3500 cases), and test set (with 16 percent or 677
instances). As linear regression input needs to be contin-
uous variables, the attribute sex, which is categorical was
converted into 3 different columns (one for each possible
value: ”M”,”F”, and ”I”) which was valued to 1.0 if the cat-
egory is matching or 0.0 if not. By applying this procedure,
s variable was splitted into 3 numerical features called s M,
s F, and s I due to the 3 possible values for the variable s.
intercept s M s F s I l
Theta 10.02 0.28 0.46 -0.69 0.07
d h ww shw vw sw
0.28 0.35 0.38 0.09 0.28 0.44
Table 4. Theta found for Gradient Descent in ABALONE
Again the cost associated was calculated as shown be-
low:
Cost Parameters J
Without Regularization
Alpha=0.03
nbr. iterations=400
4.1
With Regularization
Alpha=0.03
nbr. iterations=400
lambda=0.03
55.75
Table 5. Cost with and without regularization for ABALONE
Running GD gaves the following cost x number of itera-
tions graph without using regularization:
Figure 3. GD function cost without regularization for ABALONE.
Using R lm function to look for the residuals and the
R-squared coefficient to find linear correlations among the
variables and the output gave the following result:
lm formula Resid. std error R-squared
ry ∼ s M+s F+s I+l
+d+h+ww+shw+vw+sw
2865 0.2641
Table 6. R lm result for ABALONE
The last graph suggests that the best lambda value found
for this regression was 0.03.
The value found by looking at the learning curve (error
versus number of training examples) is given below:
Figure 4. Learning curve (error versus nbr. training examples)for
ABALONE .
The curve indicates that with 100 examples the error was
already with a variance that is acceptable (because J(train)
and J(test) are very close at other.
4.2.1 Adding multiplication and higher degree polyno-
mials
Trying to increase model complexity, it was used the code
to generate new variables, that is, new polymonial degrees
and multiplication of variables two-by-two. The training
error was plotted in the following graph:
Figure 5. Increasing model complexity for ABALONE.
The ”best” model found was with a 4th degree polyno-
mial. The error found was J=3.64 with lambda=0. After
that, it was plotted for degree=4 some lambda values. The
best value found was lambda=0.001 (despite all values are
very similar).
Figure 6. Lambda values for polynomial degree=4 for ABALONE.
4.3. Gradient Descent (GD) x Normal Equations
(NE)
The machine used to run has an Intel 2-core processor
with 2,26Ghz and 4Gb RAM. Next table summarizes per-
formance for ABALONE and FRIED dataset for both Gra-
dient Descent and Normal Equation algorithms:
The theta found for FRIED and ABALONE dataset are
shown in the next page:
Algorithm Parameters nbr training instances numerical attributes Datasource System CPU time (seconds)
Gradient Descent
Alpha=0.03
nbr. iterations=400
lambda=0.03
35000 10 FRIED 20.95
Normal Equations 35000 10 FRIED 100.34
Gradient Descent
Alpha=0.03
nbr. iterations=400
lambda=0.03
3500 10 ABALONE 1.23
Normal Equations 3500 10 ABALONE 1.12
intercept x1 x2 x3 x4
Theta 4101.59 4.54 4.72 -0.06 7.03
x5 x6 x7 x8 x9 x10
3.46 -0.03 0.12 0.02 0.03 -0.14
Table 7. Theta found for Normal Equations in FRIED
intercept s M s F s I l
Theta 0 0 78.51 73.55 27.08
d h ww shw vw sw
0.02 0.05 0.16 0.03 0.05 0.18
Table 8. Theta found for Gradient Descent in ABALONE
Theoretically, normal equations runs in O(n3) so it is
suitable for cases where n is not huge. For these 2 ex-
amples, Gradient Descent runs much faster for FRIED (at
least 5 times), while it took almost the same system time
for ABALONE datasource.
5. Conclusions and Future Work
Concerning FRIED dataset, very small values of alpha
0.001 or 0.003 made it did not work, probably due to a gra-
dient that made the algorithm to stop in some local minima
or to miss the global minimum. The values provided in the
study were the ones that made the algorithm not to be stuck
and found a reasonable cost. FRIED ”best” cost model was
found with a polynomial of degree 5 (cost=1.39), lambda=,
alpha=0.03, but it becames biased because the test set gave
an error of 18.49. Hence, the recommended model was de-
gree 1 (the minimum error on test set giving an error of
12.02).
Regarding ABALONE dataset, it was easier to study
from different perspectives and these tools were of great
help to assure quality. Firstly, cost over number of iterations
provided a way to see that this dataset took less iterations to
converge. Secondly, learning curves (error versus number
of instances) make it easier to see that the error was not so
high, hence avoiding bias, and the convergence garantees
that the variance is not so high, avoiding overfitting. Finally,
validation curves (error between training and cross valida-
tion/test sets versus lambda) which gave a pragmatic way
to select the ”best” lambda. The ”best ”model found from
a cost perspective was a polynomial of degree 4 also using
variable multiplication, lambda=0.001, alpha=0.03 (train-
ing cost of 3.64).
Strangely in both cases, regularization did not improve
the result, it seems a bug on the implementation as it does
not count with a grounded theory at least for ABALONE.
We will further study what went wrong with the Octave im-
plementation. For FRIED it could have some explanation
based on the fact that the dataset is created generatively (ar-
tificial dataset) with a scattered set of points in the dimen-
sional space.
As the datasets are not huge, from a performance per-
spective, Normal Equations gave the response in a reason-
able time, justifying its application due the exactness of the
results. It is easier than Gradient Descent because it does
not require training several parameters and making use of
cross-validation to avoid bias or overfitting.
However, as theory precognize that Normal Equations
runs in O(n3), it starts to degrade very quickly, as we can
see from an increase of 10 times in dataset, it increases at
least 100 times in system cpu time.
That is why, as a future work, we recommend to use a
greater dataset (at least 10-100 times) to compare the perfo-
mance results and effectively demonstrate Gradient Descent
use in these large dataset cases.
References
[1] J. FRIEDMAN. Multivariate adaptative regression splines.
Annals of Statistics, 1(19):1ˆa“–141, 1991. 1
[2] L. BREIMAN. Bagging predictors. machine learning. Kluwer
Academic Publishers, 3(24):123–140, 1996. 1
[3] Tony R. Martinezb D. Randall Wilsona. The general ineffi-
ciency of batch training for gradient descent learning. Neural
Networks, 16:1429ˆa“–1451, 2003. 1
[4] W.A. Gardner. Learning characteristics of stochastic-gradient-
descent algorithms: A general study, analysis, and critique.
Signal Processing, 6(2):113ˆa“–133, 1984. 1
[5] Samuel D. Stearns Bernard Widrow. Adaptive Signal Process-
ing. Prentice Hall, 1985. 1

More Related Content

PDF
Matlab for Chemical Engineering
PDF
Solution of matlab chapter 3
PPTX
FDM Numerical solution of Laplace Equation using MATLAB
PPT
Prime
PPTX
Direct Methods For The Solution Of Systems Of
PDF
Numerical
PDF
poster_kdd_2013_LRLR
Matlab for Chemical Engineering
Solution of matlab chapter 3
FDM Numerical solution of Laplace Equation using MATLAB
Prime
Direct Methods For The Solution Of Systems Of
Numerical
poster_kdd_2013_LRLR

What's hot (20)

PDF
ilovepdf_merged
PDF
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
PPTX
Signal Processing Homework Help
PPTX
Probability Assignment Help
PDF
Chapter 2 surds and indicies
DOCX
Indefinite Integral
PDF
Linear Regression Ordinary Least Squares Distributed Calculation Example
PPTX
Signal Processing Assignment Help
PDF
Analysis and design of algorithms part 4
PDF
Solution 3.
PDF
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
PDF
Applied mathematics 40
PPTX
Statistics Assignment Help
PPTX
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical Methods
PDF
5.vector geometry Further Mathematics Zimbabwe Zimsec Cambridge
PPT
Chap4
PPT
Numerical Methods
PPTX
Data Analysis Homework Help
PPTX
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
PPTX
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
ilovepdf_merged
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
Signal Processing Homework Help
Probability Assignment Help
Chapter 2 surds and indicies
Indefinite Integral
Linear Regression Ordinary Least Squares Distributed Calculation Example
Signal Processing Assignment Help
Analysis and design of algorithms part 4
Solution 3.
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
Applied mathematics 40
Statistics Assignment Help
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical Methods
5.vector geometry Further Mathematics Zimbabwe Zimsec Cambridge
Chap4
Numerical Methods
Data Analysis Homework Help
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
Ad

Viewers also liked (15)

PDF
PPTX
Power Point
PDF
Postcards final-all
PDF
2014-mo444-practical-assignment-02-paulo_faria
DOC
PDF
2014-mo444-practical-assignment-04-paulo_faria
DOCX
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
DOCX
McDONALD_FRANK_RESUME 2016(20)
PDF
2014-mo444-final-project
PDF
Article_6
PDF
Rebellions Excerpt
PPTX
CaseStudyIndustryPresentation
PDF
Hashevaynu's 13th Annual Dinner
PPTX
Going Live April 2015
PPTX
Service and guidance in education
Power Point
Postcards final-all
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
McDONALD_FRANK_RESUME 2016(20)
2014-mo444-final-project
Article_6
Rebellions Excerpt
CaseStudyIndustryPresentation
Hashevaynu's 13th Annual Dinner
Going Live April 2015
Service and guidance in education
Ad

Similar to 2014-mo444-practical-assignment-01-paulo_faria (20)

PDF
X01 Supervised learning problem linear regression one feature theorie
PPTX
Regression ppt
PPTX
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
PPTX
Linear regression, costs & gradient descent
PPTX
2. Linear regression with one variable.pptx
PPT
Introduction to Machine Learning STUDENTS.ppt
PDF
4 linear regeression with multiple variables
PDF
Avoid Overfitting with Regularization
PPT
lecture6.ppt
PDF
Deep learning concepts
PPTX
Linear regression
PDF
Lecture 5 - Linear Regression Linear Regression
PDF
Gradient_Descent_Unconstrained.pdf
PDF
Chapter3 hundred page machine learning
PDF
PhysRevE.89.042911
PDF
L1 intro2 supervised_learning
PDF
Machine learning using matlab.pdf
PDF
HonsTokelo
PDF
Petrini - MSc Thesis
PPTX
11Polynomial RegressionPolynomial RegressionPolynomial RegressionPolynomial R...
X01 Supervised learning problem linear regression one feature theorie
Regression ppt
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
Linear regression, costs & gradient descent
2. Linear regression with one variable.pptx
Introduction to Machine Learning STUDENTS.ppt
4 linear regeression with multiple variables
Avoid Overfitting with Regularization
lecture6.ppt
Deep learning concepts
Linear regression
Lecture 5 - Linear Regression Linear Regression
Gradient_Descent_Unconstrained.pdf
Chapter3 hundred page machine learning
PhysRevE.89.042911
L1 intro2 supervised_learning
Machine learning using matlab.pdf
HonsTokelo
Petrini - MSc Thesis
11Polynomial RegressionPolynomial RegressionPolynomial RegressionPolynomial R...

2014-mo444-practical-assignment-01-paulo_faria

  • 1. Linear Regression : comparision of Gradient Descent and Normal Equations Paulo Renato de Faria∗ Anderson Rocha† 1. Introduction The following study will explore linear regression over two datasources by comparing Gradient Descent (GD) and Normal Equations techniques adjusting the parameters to avoiding overfitting. The first dataset is called FRIED and was proposed in Friedman [?] and Breiman [?], it com- prises 40,768 cases, 10 attributes (0 nominal, 10 continu- ous). The second dataset is called ABALONE comprising 4,177 cases, 8 attributes (1 categorical, 7 continuous). 2. Activities There is a large discussion about the difference in perfor- mance from Batch Gradient Descent versus On-line GD in Wilsona and Martinez [?]. To overcome some of these dis- advantages such as slowness to be close of the minimum, one of the state-of-the-art approaches found in literature for solving linear regression problems is mini batch Gradient Descent. Another advanced algorithm is stochastic GD, one discussion is available at Gardner [?]. Least but not last, Least mean squares is a class of adaptive filter described in the book by Widrow and Stearns [?]. 3. Proposed Solutions It was implemented two algorithms to deal with the prob- lem. Both algorithms were developed in Octave language, trying to apply vectorized implementation whenever is pos- sible to try to achieve the best performance. As the first approach, Gradient Descent using regularization (to avoid overfitting) was created and used the following formula to measure the cost function: J(θ) = (1/2m) m i=1 ((X[i]∗θ)−y[i]))2 +λ n j=1 θ[j]2 ) (1) The gradient was updated for theta at index 0 without us- ing regularization parameter, while the other theta indexes ∗Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: paulo.faria@gmail.com †Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: anderson.rocha@ic.unicamp.br were updated according to the formula below: θ[j] = θ[j]−α m i=1 ((X[i]∗θ)−y[i])∗X[i])+ λ n ∗θ[j]) (2) As a second approach, the Normal Equations algorithm was implemented to compute the closed-form solution for linear regression with the following code: function [ t h e t a ] = normalEqn (X, y ) t h e t a = zeros ( s i z e (X, 2) , 1 ) ; t h e t a = pinv (X’∗X)∗X’∗ y ; end In order to get the best model, it was developed some code to plot learning curve (error x number of instances the algorithm used) and validation curve (error over the λ). With these graphs, we can selecte the best GD parameters such as learning rate (α) or λ (regularization parameter). The lambda values tested are: 0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 3, 10. If the algorithm presents a biased behaviour, it was de- veloped a simple code in R to test several polynomials in the input data (x2, x3, and so on) and plot a graph to de- tect the residuals and the multiplication of the two different variables (nchoosek gives the binomial coefficient of n and k) as the snippet below: for z=1 : m for i =1 : k for j =1 : p %degree column = ( ( i −1)∗k )+ j ; X poly ( z , column ) = X( z , i ) ˆ j ; end end end c = nchoosek ( 1 : k , 2 ) ; %m u l t i p l i c a t i o n for z=1 : m for j =1: s i z e ( c , 1 ) column = ( k∗p ) + j ; X poly ( z , column ) = X( z , c ( j , 1 ) ) ∗X( z , c ( j , 2 ) ) ; end end 1
  • 2. 4. Experiments and Discussion We will describe each dataset experiment in a diferent sub-section. 4.1. FRIED dataset Firstly, the FRIED dataset provided has 2 pieces: train (with 87.5 percent or 35000 cases), and test set (with 12.5 percent or 5000 instances). By running the GD using a sim- ple polynomial (degree =1 for all input variables). The theta found was the following: intercept x1 x2 x3 x4 Theta 12964.21 1412.12 1472.49 -19.51 2193.17 x5 x6 x7 x8 x9 x10 1081.09 -9.42 36.58 7.3 12.13 -43.92 Table 1. Theta found for Gradient Descent in FRIED The cost was monitored for two scenarios below: Cost Parameters J Without Regularization Alpha=0.03 nbr. iterations=400 3.45 With Regularization Alpha=0.03 nbr. iterations=400 lambda=0.03 115.94 Table 2. Cost with and without regularization for FRIED As it can be seen by the table of costs, the regulariza- tion did not improve the result of the Gradient Descent for FRIED, particularly, it was considerably worse, the follow- ing values were tried 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10. The following cost function vs. number of iterations was displayed for the GD without regularization: Figure 1. GD function cost without regularization for FRIED. By looking at the error, it can be noticed that it was very large even for a resonable number of iterations. So, we used R lm function to look for the residuals and the R-squared coefficient to find linear correlations among the variables and the output. The following table summarizes the finding: lm formula Resid. std error R-squared y ∼ x1+x2+x3+x4+x5 +x6+x7+x8+x9+x10 5533 0.2507 Table 3. R lm result for FRIED 4.1.1 Adding multiplication and higher degree polyno- mials As the error was still high, we tried to increase model com- plexity, in the case, adding new polymonial degrees and multiplication of variables two-by-two. The training error was plotted in the following graph: Figure 2. Increasing model complexity for FRIED. The ”best” model found was with a 5th degree poly- nomial. The error found in training set was J=1.39 with lambda=0, cross validation error=18.49. But this lead to a biased model, as the cross validation increases, so we can stay with model at degree 1 or 3. After that, it was plot- ted for degree=5 some lambda values. The best value found was lambda=0 (despite all values are very similar for train- ing and cross validation error, lambda it does not seem to influence). 4.2. ABALONE dataset ABALONE dataset provided has 2 pieces: train (with 84 percent or 3500 cases), and test set (with 16 percent or 677 instances). As linear regression input needs to be contin- uous variables, the attribute sex, which is categorical was converted into 3 different columns (one for each possible value: ”M”,”F”, and ”I”) which was valued to 1.0 if the cat- egory is matching or 0.0 if not. By applying this procedure, s variable was splitted into 3 numerical features called s M, s F, and s I due to the 3 possible values for the variable s. intercept s M s F s I l Theta 10.02 0.28 0.46 -0.69 0.07 d h ww shw vw sw 0.28 0.35 0.38 0.09 0.28 0.44 Table 4. Theta found for Gradient Descent in ABALONE Again the cost associated was calculated as shown be- low:
  • 3. Cost Parameters J Without Regularization Alpha=0.03 nbr. iterations=400 4.1 With Regularization Alpha=0.03 nbr. iterations=400 lambda=0.03 55.75 Table 5. Cost with and without regularization for ABALONE Running GD gaves the following cost x number of itera- tions graph without using regularization: Figure 3. GD function cost without regularization for ABALONE. Using R lm function to look for the residuals and the R-squared coefficient to find linear correlations among the variables and the output gave the following result: lm formula Resid. std error R-squared ry ∼ s M+s F+s I+l +d+h+ww+shw+vw+sw 2865 0.2641 Table 6. R lm result for ABALONE The last graph suggests that the best lambda value found for this regression was 0.03. The value found by looking at the learning curve (error versus number of training examples) is given below: Figure 4. Learning curve (error versus nbr. training examples)for ABALONE . The curve indicates that with 100 examples the error was already with a variance that is acceptable (because J(train) and J(test) are very close at other. 4.2.1 Adding multiplication and higher degree polyno- mials Trying to increase model complexity, it was used the code to generate new variables, that is, new polymonial degrees and multiplication of variables two-by-two. The training error was plotted in the following graph: Figure 5. Increasing model complexity for ABALONE. The ”best” model found was with a 4th degree polyno- mial. The error found was J=3.64 with lambda=0. After that, it was plotted for degree=4 some lambda values. The best value found was lambda=0.001 (despite all values are very similar). Figure 6. Lambda values for polynomial degree=4 for ABALONE. 4.3. Gradient Descent (GD) x Normal Equations (NE) The machine used to run has an Intel 2-core processor with 2,26Ghz and 4Gb RAM. Next table summarizes per- formance for ABALONE and FRIED dataset for both Gra- dient Descent and Normal Equation algorithms: The theta found for FRIED and ABALONE dataset are shown in the next page:
  • 4. Algorithm Parameters nbr training instances numerical attributes Datasource System CPU time (seconds) Gradient Descent Alpha=0.03 nbr. iterations=400 lambda=0.03 35000 10 FRIED 20.95 Normal Equations 35000 10 FRIED 100.34 Gradient Descent Alpha=0.03 nbr. iterations=400 lambda=0.03 3500 10 ABALONE 1.23 Normal Equations 3500 10 ABALONE 1.12 intercept x1 x2 x3 x4 Theta 4101.59 4.54 4.72 -0.06 7.03 x5 x6 x7 x8 x9 x10 3.46 -0.03 0.12 0.02 0.03 -0.14 Table 7. Theta found for Normal Equations in FRIED intercept s M s F s I l Theta 0 0 78.51 73.55 27.08 d h ww shw vw sw 0.02 0.05 0.16 0.03 0.05 0.18 Table 8. Theta found for Gradient Descent in ABALONE Theoretically, normal equations runs in O(n3) so it is suitable for cases where n is not huge. For these 2 ex- amples, Gradient Descent runs much faster for FRIED (at least 5 times), while it took almost the same system time for ABALONE datasource. 5. Conclusions and Future Work Concerning FRIED dataset, very small values of alpha 0.001 or 0.003 made it did not work, probably due to a gra- dient that made the algorithm to stop in some local minima or to miss the global minimum. The values provided in the study were the ones that made the algorithm not to be stuck and found a reasonable cost. FRIED ”best” cost model was found with a polynomial of degree 5 (cost=1.39), lambda=, alpha=0.03, but it becames biased because the test set gave an error of 18.49. Hence, the recommended model was de- gree 1 (the minimum error on test set giving an error of 12.02). Regarding ABALONE dataset, it was easier to study from different perspectives and these tools were of great help to assure quality. Firstly, cost over number of iterations provided a way to see that this dataset took less iterations to converge. Secondly, learning curves (error versus number of instances) make it easier to see that the error was not so high, hence avoiding bias, and the convergence garantees that the variance is not so high, avoiding overfitting. Finally, validation curves (error between training and cross valida- tion/test sets versus lambda) which gave a pragmatic way to select the ”best” lambda. The ”best ”model found from a cost perspective was a polynomial of degree 4 also using variable multiplication, lambda=0.001, alpha=0.03 (train- ing cost of 3.64). Strangely in both cases, regularization did not improve the result, it seems a bug on the implementation as it does not count with a grounded theory at least for ABALONE. We will further study what went wrong with the Octave im- plementation. For FRIED it could have some explanation based on the fact that the dataset is created generatively (ar- tificial dataset) with a scattered set of points in the dimen- sional space. As the datasets are not huge, from a performance per- spective, Normal Equations gave the response in a reason- able time, justifying its application due the exactness of the results. It is easier than Gradient Descent because it does not require training several parameters and making use of cross-validation to avoid bias or overfitting. However, as theory precognize that Normal Equations runs in O(n3), it starts to degrade very quickly, as we can see from an increase of 10 times in dataset, it increases at least 100 times in system cpu time. That is why, as a future work, we recommend to use a greater dataset (at least 10-100 times) to compare the perfo- mance results and effectively demonstrate Gradient Descent use in these large dataset cases. References [1] J. FRIEDMAN. Multivariate adaptative regression splines. Annals of Statistics, 1(19):1ˆa“–141, 1991. 1 [2] L. BREIMAN. Bagging predictors. machine learning. Kluwer Academic Publishers, 3(24):123–140, 1996. 1 [3] Tony R. Martinezb D. Randall Wilsona. The general ineffi- ciency of batch training for gradient descent learning. Neural Networks, 16:1429ˆa“–1451, 2003. 1 [4] W.A. Gardner. Learning characteristics of stochastic-gradient- descent algorithms: A general study, analysis, and critique. Signal Processing, 6(2):113ˆa“–133, 1984. 1 [5] Samuel D. Stearns Bernard Widrow. Adaptive Signal Process- ing. Prentice Hall, 1985. 1