SlideShare a Scribd company logo
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
DOI: 10.5121/ijaia.2017.8603 29
ON THE PREDICTION ACCURACIES OF THREE
MOST KNOWN REGULARIZERS : RIDGE REGRESSION,
THE LASSO ESTIMATE, AND ELASTIC NET
REGULARIZATION METHODS
Adel Aloraini
Computer Science department, Qassim University, Saudi Arabia.
ABSTRACT
The work in this paper shows intensive empirical experiments using 13 datasets to understand the
regularization effectiveness of ridge regression, the lasso estimate, and elastic net regularization methods.
The study offers a deep understanding of how the datasets affect the goodness of the prediction accuracy of
each regularization method for a given problem given the diversity in the datasets used. The results have
shown that datasets play crucial rules on the performance of the regularization method and that the
predication accuracy depends heavily on the nature of the sampled datasets.
KEYWORDS
ridge regression, regularization, the lasso estimate, elastic net .
1. INTRODUCTION
Penalization over regressive parameters have taken much attention in the literature recently. This
is clearly due to the optimality such methods can provide in the model selections over non-
penalized regressive parameters (such as in linear regression). The way that penalized regressive
methods shrink the parameters associated with features in the model, is key in providing better
predictions when model selection is required among models in the search space. The most well
known penalized regressive methods can be seen in ridge regression [1], lasso estimate [2], and
recently elastic net regularization method [3]. In the ridge regression, the model selection is not
much severed in which the regressive parameters are penalized towards ”zeros” which usually
introduces light sparsity to the model selection and includes all the chosen subset features from
the search space. However, lasso estimate introduces much severity and usually the penalization
term (λ) discards irrelevant features from the chosen model [2]. However, the shortcoming from
lasso estimate is when the search space includes features which are highly correlated. The lasso
estimate in a highly correlated feature space tends to choose one among the correlated features
and ignores the others which, if were chosen, might lead to a better prediction accuracy. In
addition, the lasso estimate selects at most (n) features when the number of features (p) is greater
than the number of samples (n) which indicates that the number of features is bounded by the
number of samples [3]. Elastic net regularization method on the other hand, can handle
limitations encountered by the lasso estimate in terms of : (1) choosing the group of features that
best give better prediction, and (2) giving unbounded behavior by the number of samples in (p >>
n) situation .Hence, given the a for mentioned interesting diversities between regularization
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
30
methods , we show in this work empirical experiments to step further in analyzing model
selection behavior in terms of prediction accuracy for ridge regression, the lasso estimate, and
elastic net regularization methods. To curry out model selection, we used Bayesian information
criteria (BIC) and cross validation score functions for all fitted models in the search space by
using a set of different (λ) values as well as different (α) values for elastic net regularization
method.
In the next section, we detail the related work then we proceed to detail the methodology, the
results, and the conclusion
2. RELATED WORK
Ridge regression, the lasso estimate, , and lately elastic net regularization methods have been
extensively used for model selections and feature reductions in machine learning literature and
applications. In [4] ridge regression has been applied in a combination approach between
homomorhpic encryption and Yao garbled circuits which outperformed using homomorphic
encryption or Yao circuits only. Ridge regression also has shown interesting results when multi-
colinearity exist in model selection and associated parameters specially when ridge regression
is combined with Variance Inflation Factors(VIF) [5],[6]. However, ridge regression is often used
when features in the model selection are all important to be included in resultant classifiers or
models. However, when the search space encounters many irrelevant features to the problem
under modeling, then lasso estimate can investigate and return sparser models that contain the
most important features.This is because lasso estimate shrinks many associated parameters
towards zero that tend to be less relevant. In [7] the paper has experimented with the lasso
estimate and its variants and the results have shown that when the lasso estimate is relaxed with
filtering methods, the prediction is improved. Also, the lasso estimate in [8] has been applied for
visual object tracking and the results have shown promising performance for computer vision
field. Moreover, for network modeling, the work in [9] has addressed how useful the lasso
estimate is to estimate psychopathological networks specially that estimated parameters fast
growing comparing to the data samples. However, the work has concerned that the lasso estimate
can yield a sparse models that can capture interesting results using e exponential growth
parameters in the search space under investigation. However, when there exists a group of
features naturally correlated and co-work with each other such as between genes in cellular
systems, often the lasso estimate tends to choose a member of the group and ignores the others [3]
as well as the bounders imposed by the number of chosen features which is subject to the sample
size [3]. The aforementioned shortcomings from the lasso estimate have already been addressed
and solved by the elastic net regularization methods which deals with the correlated features
either to be all in-or-out of the selected model. Elastic net regularization method also shows a
more reliable feature selection in the p >> n datasets and due to the scalability elastic net
regularization method provides in the model selection and in the estimated parameters, it has
drawn much attention recently. Elastic net regularization method in [10] was able to determine a
reliable performance with high accuracy in assessing the uncertainties on node voltage and
determine the influential factors along with calculating their voltage influence parameters. In the
study by [11] a hybrid probabilistic prediction methodology based on elastic net regularization
method with probabilistic Bayesian belief network has been applied to predict the on-show
probability of the patient to the clinic using demographics, socioeconomic status as well as
available appointment information history which provide a notable comparable predictive results
among many approaches in the literature.
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
31
In our current study we aim to provide an intensive comparison between the lasso estimate, ridge
regression, and elastic net regularization methods in order to further analyze the prediction
accuracy of these methods. We used 13 datasets that are different in sample size and the systems
that have been sampled from to apply different diversities to the behavioral analysis for each
penalized method as well as the score functions used to do model selection. In the next section we
go in details for the methodology and the datasets used in the study.
3. DATASETS
The datasets used in the study have been obtained from different systems. We used gene
expression datasets from microarray experiments that differ in sample and feature size. The
datasets are annotated by different signaling pathways :(1) cell-cycle signaling pathway which
has 5 samples and 98 genes that considered to be hard-to-learn-from as the feature space is far
exceeds the sample space. (2) MAPK signaling pathway has the same gene size as cell-cycle
signaling pathway but with 100 samples. We also used two different microarray datasets sampled
from prostate cancer gene expression signaling pathways : (1) JAKSTAT signaling pathway
with 86 genes across 13 samples, and (2) JAKSTAT1 signaling pathway that has 35 genes v.s. 62
samples. A much harder-to-learn-from datasets we used come from breast-cancer tissues : (1) the
breast cancer dataset1 contains 209 genes vs. 14 samples, and the breast cancer dataset2 contains
209 samples vs. 10 samples. In order to allow diversity to the methods in the experiments, we
used 7 datasets from equities market that allow for big sample size comparing to the feature space
that are banks in the market. The banks in the equities market have been monitored in 7 intervals
as follows : (1) equities-dataset1 has 11 features vs. 80 sample size,(2) equities-dataset2 has 11
features vs. 59 sample size, (3) equities-dataset3 has 11 features vs. 39 sample size , (4) equities-
dataset4 has 11 features vs. 19 sample size,(5) equities-dataset5 has 11 features vs. 7 sample size,
(6) equities-dataset6 has 11 features vs. 250 sample size, and (7) equities-dataset7 has 11 features
vs. 250 sample size.
4. METHODOLOGY
All 13 datasets described above have been injected to the three penalized regressive methods :
ridge regression, the lasso estimate, and elastic net regularization methods. In ridge regression,
the lasso estimate, and elastic net regularization methods the set of (λ) values have been
computed by the function (glmnet) in R for 100 values [12], [13]. Thus, each dataset generates a
search space of 100 models corresponding to 100 values. The generated models then have been
scored by two scoring functions, namely Bayesian information criterion (BIC) (Algorithm 1),
and cross validation (Algorithm 2). When using BIC score function, all models were generated
by the corresponding (λ) values and then BIC scores all these models and returns from the
search space the model with the smallest BIC score. For cross validation we set (nfolds=3) as we
have small sample size in some datasets used in the experiments. For the elastic net regularization
method, we experimented with 8 different (α) values that were between [0.1, 0.9] since α=1.0
corresponds to the lasso penalty and when α=0.0 it gives ridge regression penalty. After defining
the necessary parameters, (λ, α, and n folds)), we executed the experiments over the 13 datasets.
Algorithm 1, and Algorithm 2 describe the framework of applying BIC score function, and cross
validation on ridge regression, the lasso estimate, and elastic net regularization methods.
In the next sections we explain the component of Algorithm 1that gives how cross validation was
used from within the pre- defined (λ, α) to choose the most optimal model in the search space of
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
32
each dataset used in the experiment, and Algorithm 2 explains the same but for BIC score
function.
A. Algorithm 1
Algorithm(1) starts in step(1) by defining the vector of (α) used in the experiments that ranges
between [0.0,1.0], where (0.0) penalizes for the lasso estimate and (1.0) penalizes for ridge
regression. Then, in step(2) the algorithm iterates on each value of α.vector for the dataset under
investigation in step(3).vector for the dataset under investigation in step(3). The algorithm iterates
on the length of feature space to experiment with each feature(step(6)) to find out the most
optimal subset of features from the remaining features in the dataset(step(5)). To do that, first ; in
step(7), cross validation score function with the pre-defined((nfolds))is used to score all possible
models generated from the set of (λ) values(100 values between [0-1]) for the possible subset of
features from step(5). Second, the algorithm in step(8) returns the best optimal value of (λ) that
corresponds to the smallest error from cross validation score function which in turns is used
in step(9) in order to return the best fit model in the search space for the current(αi).. Finally , in
step(10), and step(11) the best fit model is used as a predictive model to estimate the goodness-
of-fit for the chosen optimal (λ).
B. Algorithm(2)
Algorithm (2) starts in step(1) by defining the vector of (α) used in the experiments that ranges
between [0.0-1.0] , where (0.0) penalizes for the lasso estimate and (1.0) penalizes for ridge
regression. Then, in step (2) the algorithm iterates on each value of α. vector for the dataset under
investigation in step(3).vector for the dataset under investigation in step(3). After that the sample
size parameter is determined in step (4) in order to be used in BIC score function later. The
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
33
algorithm in step (5) iterates on the length of feature space to experiment with each feature
(step(7)) in order to find out the most optimal subset of features from the remaining features in
To do that, first ; in step(8) all pre- defined (λ) are used to fit all the possible models in the search
space for a particular (α) . Then, the algorithm in step (9) extracts the exact (λ) values used to
generate all models in the search space. After that, in step(10) the algorithm iterates on each
value of (λ. vector) to generate a model(step(11)),predict on the fitted model(step(12)), calculate
the prediction error(step(13)), and then calculate the number of features found in the
model(step(14)) which can be determined by the nonzero parameters( s) in the model. After that,
the algorithm calculates the BIC scoring function as in step (15). All BIC score functions for all
models are stored in a vector (step(16)) in order to be used to return the best BIC score function
as in step(18). After that the best BIC score function is used to return the best model as in step
(19), then finally the prediction accuracy for the chosen model is calculated in step (20), and step
(21).
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
34
5. RESULTS AND DISCUSSION
The algorithms described in the previous section, have been applied to the aforementioned 13
datasets. Table.I shows the results of applying the methodology described in Algorithm 1 , and
Table.II shows the results of applying the methodology described in Algorithm 2. The range of
(α) values from 0.0-1.0 were used in order to experiment with the lasso estimate(when α=0.0),
ridge regression(when alpha=1.0), and elastic net regularization methods. When datasets were
used to experiment with the methodology in Algorithm 1, cross validation score function did not
significantly show a better (α) over another in terms of prediction accuracy except for cell-cycle
dataset when α= 0.7,0.8,0.9,1.0 , and for MAPK dataset when α= 0.6,0.7,0.8,0.9,1.0 in which the
prediction accuracy considered to be the worst among other α values. Hence the lasso estimate
worked better than ridge regression and elastic net regularization methods for these particular
datasets. Similarly, Table. II shows the results of applying the methodology described in
Algorithm 2. When datasets were used to experiment with the methodology in Algorithm 2 BIC
score function has shown similar prediction accuracies for the datasets across different values of
(α) comparing to cross validation score function in Algorithm 1 except for 6 datasets in which
BIC score function has given a better prediction accuracy comparing to cross validation. These
datasets are : equities-dataset5 when α={0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, cell-cycle dataset when
α={0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, MAPK dataset when α={0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
JAKSTAT1dataset when α={0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, breast cancer dataset1 when α=
0.4,0.5,0.6,0.7,0.8,0.9,1.0 , and breast cancer dataset2 when α= 0.4,0.5,0.6,0.7,0.8,0.9,1.0 as the
prediction accuracy was almost (0:0). As a result from Table-I, the lasso estimate, ridge
regression, and elastic net regularization methods tend to work similarly expect for cell-cycle ,
and MAPK datasets in which ridge regression outperformed the lasso estimate and elastic net
regularization methods. In Table-II, the lasso estimate and elastic net regulariztion methods have
shown a better results than ridge regression in 47% of the datasets but still work similarly in the
other datasets. On looking thoroughly for the score functions used in the comparison, the average
prediction accuracies for all datasets across all different values of α were considered and it can
be seen that BIC score function outperformed cross validation score function as shown in Figure
(fig: prediction).
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
35
Final Prediction Accuracy for all datasets vs. alpha values 0.0
Fig. 1. This figure shows the average of prediction accuracies for all datasets across all different values of .
The figurte shows that BIC score functionoutperformed cross validation score function.
6. CONCLUSION
The study in this paper focused on how the ridge regression, the lasso estimate , and elastic net
regularization methods behave in terms of prediction accuracy when wrapped up with BIC , and
cross validation score functions in 13 different datasets that are different in dimensionality. The
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017
36
results clearly show that the performance of a single regularizer is subject to the dataset under
investigation which makes the prediction accuracy differ accordingly. The results also show that
the lasso estimate and elastic net regularization methods perform better comapred with ridge
regression and this is a justification that ridge regression includes more irrelevant features than
the lasso estimte and elastic net in the chosen model which decreases accuracy in the prediction.
REFERENCES.
[1] a.n. tikhonov and v.y. arsenin.solutions of ill-posed problems.wiley, new york, 1977.
[2] robert j. tibshirani. regression shrinkage and selection via the lasso.journal of the royal statistical
society, series b, 58(1):267–288, 1996.
[3] h. zou and t. hastie. regularization and variable selection via the elastic net. journal of the royal
statistical society: series b (statistical methodology), 67(2):301–320, 2003.
[4] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, Dan Boneh, and Nina Taft.
Privacy-preserving ridge regression on hundreds of millions of records. In IEEE Symposium on
Security and Privacy, pages 334–348. IEEE Computer Society, 2013.
[5] Bonsang Koo and Byungjin Shin. Using ridge regression to improve the accuracy and
interpretation of the hedonic pricing model : Focusing on apartments in guro-gu, seoul. In IEEE
Symposium on Security and Privacy, volume 16, pages 77–85. Korean Institute Of Construction
Engineering and Management, 2015.
[6] C.B. Garca, J. Garca, M.M. Lpez Martn, and R. Salmern. Collinearity: revisiting the variance
inflation factor in ridge regression. volume 42, pages 648–661, 2015.
[7] Adel Aloraini. Ensemble feature selection methods for a better regu- larization of the lasso estimate in
p>>n gene expression datasets. In Proceedings of the 12th conference in machine learning and
applica- tions, pages 122–126, 2013.
[8] Qiao Liu, Xiao Ma, Weihua Ou, and Quan Zhou. Visual object tracking with online sample selection
via lasso regularization. Signal, Image and Video Processing, 11(5):881–888, 2017.
[9] Sacha Epskamp, Joost Kruis, and Maarten Marsman. Estimating sychopathological networks: Be
careful what you wish for. volume 12, 2017.
[10] Pengwei Chen, Shun Tao, Xiangning Xiao, and Lu Li. Uncertainty level of voltage in distribution
network: an analysis model with elastic net and application in storage configuration. In IEEE
Transactions on Smart Grid, 2016.
[11] Kazim Topuz, Hasmet Uner, Asil Oztekin, and Mehmet Bayram Yildirim. Predicting pediatric
clinic no-shows: a decision analytic framework using elastic net and bayesian belief network. Annals
of Operations Research, 2017.
[12] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear
models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
[13] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Reg- ularization paths for cox’s
proportional hazards model via coordinate descent. Journal of Statistical Software, 39(5):1–13, 2011.

More Related Content

PDF
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
PDF
Statistical Modeling: The Two Cultures
PDF
Performance analysis of regularized linear regression models for oxazolines a...
PDF
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
PDF
An efficient algorithm for sequence generation in data mining
PDF
Anomaly detection via eliminating data redundancy and rectifying data error i...
PDF
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
PDF
A novel population-based local search for nurse rostering problem
UNDERSTANDING LEAST ABSOLUTE VALUE IN REGRESSION-BASED DATA MINING
Statistical Modeling: The Two Cultures
Performance analysis of regularized linear regression models for oxazolines a...
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
An efficient algorithm for sequence generation in data mining
Anomaly detection via eliminating data redundancy and rectifying data error i...
TECHNICAL REVIEW: PERFORMANCE OF EXISTING IMPUTATION METHODS FOR MISSING DATA...
A novel population-based local search for nurse rostering problem

What's hot (20)

PDF
Data Analytics on Solar Energy Using Hadoop
PDF
On Tracking Behavior of Streaming Data: An Unsupervised Approach
PDF
Descriptive versus Mechanistic Modeling
PDF
Towards reducing the
PDF
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
PDF
Sugarcane yield forecasting using
DOCX
One Graduate Paper
PDF
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
PDF
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
PDF
PDF
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDY
PDF
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PDF
Oversampling technique in student performance classification from engineering...
PDF
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
PDF
rule refinement in inductive knowledge based systems
PDF
RAINFALL PREDICTION USING DATA MINING TECHNIQUES - A SURVEY
PDF
AUTOMATED TEST CASE GENERATION AND OPTIMIZATION: A COMPARATIVE REVIEW
PDF
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
PDF
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
Data Analytics on Solar Energy Using Hadoop
On Tracking Behavior of Streaming Data: An Unsupervised Approach
Descriptive versus Mechanistic Modeling
Towards reducing the
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Sugarcane yield forecasting using
One Graduate Paper
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
USING ARTIFICIAL NEURAL NETWORK IN DIAGNOSIS OF THYROID DISEASE: A CASE STUDY
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
Oversampling technique in student performance classification from engineering...
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
rule refinement in inductive knowledge based systems
RAINFALL PREDICTION USING DATA MINING TECHNIQUES - A SURVEY
AUTOMATED TEST CASE GENERATION AND OPTIMIZATION: A COMPARATIVE REVIEW
Unsupervised Feature Selection Based on the Distribution of Features Attribut...
2014 Gene expressionmicroarrayclassification usingPCA–BEL.
Ad

Similar to ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESSION, THE LASSO ESTIMATE, AND ELASTIC NET REGULARIZATION METHODS (20)

PDF
Controlling informative features for improved accuracy and faster predictions...
PDF
Pattern recognition system based on support vector machines
PDF
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
PDF
PDF
Application of support vector machines for prediction of anti hiv activity of...
PDF
Efficiency of Prediction Algorithms for Mining Biological Databases
PDF
Autism_risk_factors
PDF
An integrated mechanism for feature selection
PDF
Data mining projects topics for java and dot net
PDF
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
PDF
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
PDF
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
PDF
M033059064
PDF
Srge most important publications 2020
PDF
Feature Selection Approach based on Firefly Algorithm and Chi-square
PDF
Single parent mating in genetic algorithm for real robotic system identification
PDF
T180203125133
PPT
Recommender system
PDF
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...
PDF
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
Controlling informative features for improved accuracy and faster predictions...
Pattern recognition system based on support vector machines
A hybrid wrapper spider monkey optimization-simulated annealing model for opt...
Application of support vector machines for prediction of anti hiv activity of...
Efficiency of Prediction Algorithms for Mining Biological Databases
Autism_risk_factors
An integrated mechanism for feature selection
Data mining projects topics for java and dot net
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
M033059064
Srge most important publications 2020
Feature Selection Approach based on Firefly Algorithm and Chi-square
Single parent mating in genetic algorithm for real robotic system identification
T180203125133
Recommender system
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Network Security Unit 5.pdf for BCA BBA.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?

ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESSION, THE LASSO ESTIMATE, AND ELASTIC NET REGULARIZATION METHODS

  • 1. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 DOI: 10.5121/ijaia.2017.8603 29 ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESSION, THE LASSO ESTIMATE, AND ELASTIC NET REGULARIZATION METHODS Adel Aloraini Computer Science department, Qassim University, Saudi Arabia. ABSTRACT The work in this paper shows intensive empirical experiments using 13 datasets to understand the regularization effectiveness of ridge regression, the lasso estimate, and elastic net regularization methods. The study offers a deep understanding of how the datasets affect the goodness of the prediction accuracy of each regularization method for a given problem given the diversity in the datasets used. The results have shown that datasets play crucial rules on the performance of the regularization method and that the predication accuracy depends heavily on the nature of the sampled datasets. KEYWORDS ridge regression, regularization, the lasso estimate, elastic net . 1. INTRODUCTION Penalization over regressive parameters have taken much attention in the literature recently. This is clearly due to the optimality such methods can provide in the model selections over non- penalized regressive parameters (such as in linear regression). The way that penalized regressive methods shrink the parameters associated with features in the model, is key in providing better predictions when model selection is required among models in the search space. The most well known penalized regressive methods can be seen in ridge regression [1], lasso estimate [2], and recently elastic net regularization method [3]. In the ridge regression, the model selection is not much severed in which the regressive parameters are penalized towards ”zeros” which usually introduces light sparsity to the model selection and includes all the chosen subset features from the search space. However, lasso estimate introduces much severity and usually the penalization term (λ) discards irrelevant features from the chosen model [2]. However, the shortcoming from lasso estimate is when the search space includes features which are highly correlated. The lasso estimate in a highly correlated feature space tends to choose one among the correlated features and ignores the others which, if were chosen, might lead to a better prediction accuracy. In addition, the lasso estimate selects at most (n) features when the number of features (p) is greater than the number of samples (n) which indicates that the number of features is bounded by the number of samples [3]. Elastic net regularization method on the other hand, can handle limitations encountered by the lasso estimate in terms of : (1) choosing the group of features that best give better prediction, and (2) giving unbounded behavior by the number of samples in (p >> n) situation .Hence, given the a for mentioned interesting diversities between regularization
  • 2. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 30 methods , we show in this work empirical experiments to step further in analyzing model selection behavior in terms of prediction accuracy for ridge regression, the lasso estimate, and elastic net regularization methods. To curry out model selection, we used Bayesian information criteria (BIC) and cross validation score functions for all fitted models in the search space by using a set of different (λ) values as well as different (α) values for elastic net regularization method. In the next section, we detail the related work then we proceed to detail the methodology, the results, and the conclusion 2. RELATED WORK Ridge regression, the lasso estimate, , and lately elastic net regularization methods have been extensively used for model selections and feature reductions in machine learning literature and applications. In [4] ridge regression has been applied in a combination approach between homomorhpic encryption and Yao garbled circuits which outperformed using homomorphic encryption or Yao circuits only. Ridge regression also has shown interesting results when multi- colinearity exist in model selection and associated parameters specially when ridge regression is combined with Variance Inflation Factors(VIF) [5],[6]. However, ridge regression is often used when features in the model selection are all important to be included in resultant classifiers or models. However, when the search space encounters many irrelevant features to the problem under modeling, then lasso estimate can investigate and return sparser models that contain the most important features.This is because lasso estimate shrinks many associated parameters towards zero that tend to be less relevant. In [7] the paper has experimented with the lasso estimate and its variants and the results have shown that when the lasso estimate is relaxed with filtering methods, the prediction is improved. Also, the lasso estimate in [8] has been applied for visual object tracking and the results have shown promising performance for computer vision field. Moreover, for network modeling, the work in [9] has addressed how useful the lasso estimate is to estimate psychopathological networks specially that estimated parameters fast growing comparing to the data samples. However, the work has concerned that the lasso estimate can yield a sparse models that can capture interesting results using e exponential growth parameters in the search space under investigation. However, when there exists a group of features naturally correlated and co-work with each other such as between genes in cellular systems, often the lasso estimate tends to choose a member of the group and ignores the others [3] as well as the bounders imposed by the number of chosen features which is subject to the sample size [3]. The aforementioned shortcomings from the lasso estimate have already been addressed and solved by the elastic net regularization methods which deals with the correlated features either to be all in-or-out of the selected model. Elastic net regularization method also shows a more reliable feature selection in the p >> n datasets and due to the scalability elastic net regularization method provides in the model selection and in the estimated parameters, it has drawn much attention recently. Elastic net regularization method in [10] was able to determine a reliable performance with high accuracy in assessing the uncertainties on node voltage and determine the influential factors along with calculating their voltage influence parameters. In the study by [11] a hybrid probabilistic prediction methodology based on elastic net regularization method with probabilistic Bayesian belief network has been applied to predict the on-show probability of the patient to the clinic using demographics, socioeconomic status as well as available appointment information history which provide a notable comparable predictive results among many approaches in the literature.
  • 3. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 31 In our current study we aim to provide an intensive comparison between the lasso estimate, ridge regression, and elastic net regularization methods in order to further analyze the prediction accuracy of these methods. We used 13 datasets that are different in sample size and the systems that have been sampled from to apply different diversities to the behavioral analysis for each penalized method as well as the score functions used to do model selection. In the next section we go in details for the methodology and the datasets used in the study. 3. DATASETS The datasets used in the study have been obtained from different systems. We used gene expression datasets from microarray experiments that differ in sample and feature size. The datasets are annotated by different signaling pathways :(1) cell-cycle signaling pathway which has 5 samples and 98 genes that considered to be hard-to-learn-from as the feature space is far exceeds the sample space. (2) MAPK signaling pathway has the same gene size as cell-cycle signaling pathway but with 100 samples. We also used two different microarray datasets sampled from prostate cancer gene expression signaling pathways : (1) JAKSTAT signaling pathway with 86 genes across 13 samples, and (2) JAKSTAT1 signaling pathway that has 35 genes v.s. 62 samples. A much harder-to-learn-from datasets we used come from breast-cancer tissues : (1) the breast cancer dataset1 contains 209 genes vs. 14 samples, and the breast cancer dataset2 contains 209 samples vs. 10 samples. In order to allow diversity to the methods in the experiments, we used 7 datasets from equities market that allow for big sample size comparing to the feature space that are banks in the market. The banks in the equities market have been monitored in 7 intervals as follows : (1) equities-dataset1 has 11 features vs. 80 sample size,(2) equities-dataset2 has 11 features vs. 59 sample size, (3) equities-dataset3 has 11 features vs. 39 sample size , (4) equities- dataset4 has 11 features vs. 19 sample size,(5) equities-dataset5 has 11 features vs. 7 sample size, (6) equities-dataset6 has 11 features vs. 250 sample size, and (7) equities-dataset7 has 11 features vs. 250 sample size. 4. METHODOLOGY All 13 datasets described above have been injected to the three penalized regressive methods : ridge regression, the lasso estimate, and elastic net regularization methods. In ridge regression, the lasso estimate, and elastic net regularization methods the set of (λ) values have been computed by the function (glmnet) in R for 100 values [12], [13]. Thus, each dataset generates a search space of 100 models corresponding to 100 values. The generated models then have been scored by two scoring functions, namely Bayesian information criterion (BIC) (Algorithm 1), and cross validation (Algorithm 2). When using BIC score function, all models were generated by the corresponding (λ) values and then BIC scores all these models and returns from the search space the model with the smallest BIC score. For cross validation we set (nfolds=3) as we have small sample size in some datasets used in the experiments. For the elastic net regularization method, we experimented with 8 different (α) values that were between [0.1, 0.9] since α=1.0 corresponds to the lasso penalty and when α=0.0 it gives ridge regression penalty. After defining the necessary parameters, (λ, α, and n folds)), we executed the experiments over the 13 datasets. Algorithm 1, and Algorithm 2 describe the framework of applying BIC score function, and cross validation on ridge regression, the lasso estimate, and elastic net regularization methods. In the next sections we explain the component of Algorithm 1that gives how cross validation was used from within the pre- defined (λ, α) to choose the most optimal model in the search space of
  • 4. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 32 each dataset used in the experiment, and Algorithm 2 explains the same but for BIC score function. A. Algorithm 1 Algorithm(1) starts in step(1) by defining the vector of (α) used in the experiments that ranges between [0.0,1.0], where (0.0) penalizes for the lasso estimate and (1.0) penalizes for ridge regression. Then, in step(2) the algorithm iterates on each value of α.vector for the dataset under investigation in step(3).vector for the dataset under investigation in step(3). The algorithm iterates on the length of feature space to experiment with each feature(step(6)) to find out the most optimal subset of features from the remaining features in the dataset(step(5)). To do that, first ; in step(7), cross validation score function with the pre-defined((nfolds))is used to score all possible models generated from the set of (λ) values(100 values between [0-1]) for the possible subset of features from step(5). Second, the algorithm in step(8) returns the best optimal value of (λ) that corresponds to the smallest error from cross validation score function which in turns is used in step(9) in order to return the best fit model in the search space for the current(αi).. Finally , in step(10), and step(11) the best fit model is used as a predictive model to estimate the goodness- of-fit for the chosen optimal (λ). B. Algorithm(2) Algorithm (2) starts in step(1) by defining the vector of (α) used in the experiments that ranges between [0.0-1.0] , where (0.0) penalizes for the lasso estimate and (1.0) penalizes for ridge regression. Then, in step (2) the algorithm iterates on each value of α. vector for the dataset under investigation in step(3).vector for the dataset under investigation in step(3). After that the sample size parameter is determined in step (4) in order to be used in BIC score function later. The
  • 5. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 33 algorithm in step (5) iterates on the length of feature space to experiment with each feature (step(7)) in order to find out the most optimal subset of features from the remaining features in To do that, first ; in step(8) all pre- defined (λ) are used to fit all the possible models in the search space for a particular (α) . Then, the algorithm in step (9) extracts the exact (λ) values used to generate all models in the search space. After that, in step(10) the algorithm iterates on each value of (λ. vector) to generate a model(step(11)),predict on the fitted model(step(12)), calculate the prediction error(step(13)), and then calculate the number of features found in the model(step(14)) which can be determined by the nonzero parameters( s) in the model. After that, the algorithm calculates the BIC scoring function as in step (15). All BIC score functions for all models are stored in a vector (step(16)) in order to be used to return the best BIC score function as in step(18). After that the best BIC score function is used to return the best model as in step (19), then finally the prediction accuracy for the chosen model is calculated in step (20), and step (21).
  • 6. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 34 5. RESULTS AND DISCUSSION The algorithms described in the previous section, have been applied to the aforementioned 13 datasets. Table.I shows the results of applying the methodology described in Algorithm 1 , and Table.II shows the results of applying the methodology described in Algorithm 2. The range of (α) values from 0.0-1.0 were used in order to experiment with the lasso estimate(when α=0.0), ridge regression(when alpha=1.0), and elastic net regularization methods. When datasets were used to experiment with the methodology in Algorithm 1, cross validation score function did not significantly show a better (α) over another in terms of prediction accuracy except for cell-cycle dataset when α= 0.7,0.8,0.9,1.0 , and for MAPK dataset when α= 0.6,0.7,0.8,0.9,1.0 in which the prediction accuracy considered to be the worst among other α values. Hence the lasso estimate worked better than ridge regression and elastic net regularization methods for these particular datasets. Similarly, Table. II shows the results of applying the methodology described in Algorithm 2. When datasets were used to experiment with the methodology in Algorithm 2 BIC score function has shown similar prediction accuracies for the datasets across different values of (α) comparing to cross validation score function in Algorithm 1 except for 6 datasets in which BIC score function has given a better prediction accuracy comparing to cross validation. These datasets are : equities-dataset5 when α={0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, cell-cycle dataset when α={0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, MAPK dataset when α={0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, JAKSTAT1dataset when α={0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, breast cancer dataset1 when α= 0.4,0.5,0.6,0.7,0.8,0.9,1.0 , and breast cancer dataset2 when α= 0.4,0.5,0.6,0.7,0.8,0.9,1.0 as the prediction accuracy was almost (0:0). As a result from Table-I, the lasso estimate, ridge regression, and elastic net regularization methods tend to work similarly expect for cell-cycle , and MAPK datasets in which ridge regression outperformed the lasso estimate and elastic net regularization methods. In Table-II, the lasso estimate and elastic net regulariztion methods have shown a better results than ridge regression in 47% of the datasets but still work similarly in the other datasets. On looking thoroughly for the score functions used in the comparison, the average prediction accuracies for all datasets across all different values of α were considered and it can be seen that BIC score function outperformed cross validation score function as shown in Figure (fig: prediction).
  • 7. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 35 Final Prediction Accuracy for all datasets vs. alpha values 0.0 Fig. 1. This figure shows the average of prediction accuracies for all datasets across all different values of . The figurte shows that BIC score functionoutperformed cross validation score function. 6. CONCLUSION The study in this paper focused on how the ridge regression, the lasso estimate , and elastic net regularization methods behave in terms of prediction accuracy when wrapped up with BIC , and cross validation score functions in 13 different datasets that are different in dimensionality. The
  • 8. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.6, November 2017 36 results clearly show that the performance of a single regularizer is subject to the dataset under investigation which makes the prediction accuracy differ accordingly. The results also show that the lasso estimate and elastic net regularization methods perform better comapred with ridge regression and this is a justification that ridge regression includes more irrelevant features than the lasso estimte and elastic net in the chosen model which decreases accuracy in the prediction. REFERENCES. [1] a.n. tikhonov and v.y. arsenin.solutions of ill-posed problems.wiley, new york, 1977. [2] robert j. tibshirani. regression shrinkage and selection via the lasso.journal of the royal statistical society, series b, 58(1):267–288, 1996. [3] h. zou and t. hastie. regularization and variable selection via the elastic net. journal of the royal statistical society: series b (statistical methodology), 67(2):301–320, 2003. [4] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, Dan Boneh, and Nina Taft. Privacy-preserving ridge regression on hundreds of millions of records. In IEEE Symposium on Security and Privacy, pages 334–348. IEEE Computer Society, 2013. [5] Bonsang Koo and Byungjin Shin. Using ridge regression to improve the accuracy and interpretation of the hedonic pricing model : Focusing on apartments in guro-gu, seoul. In IEEE Symposium on Security and Privacy, volume 16, pages 77–85. Korean Institute Of Construction Engineering and Management, 2015. [6] C.B. Garca, J. Garca, M.M. Lpez Martn, and R. Salmern. Collinearity: revisiting the variance inflation factor in ridge regression. volume 42, pages 648–661, 2015. [7] Adel Aloraini. Ensemble feature selection methods for a better regu- larization of the lasso estimate in p>>n gene expression datasets. In Proceedings of the 12th conference in machine learning and applica- tions, pages 122–126, 2013. [8] Qiao Liu, Xiao Ma, Weihua Ou, and Quan Zhou. Visual object tracking with online sample selection via lasso regularization. Signal, Image and Video Processing, 11(5):881–888, 2017. [9] Sacha Epskamp, Joost Kruis, and Maarten Marsman. Estimating sychopathological networks: Be careful what you wish for. volume 12, 2017. [10] Pengwei Chen, Shun Tao, Xiangning Xiao, and Lu Li. Uncertainty level of voltage in distribution network: an analysis model with elastic net and application in storage configuration. In IEEE Transactions on Smart Grid, 2016. [11] Kazim Topuz, Hasmet Uner, Asil Oztekin, and Mehmet Bayram Yildirim. Predicting pediatric clinic no-shows: a decision analytic framework using elastic net and bayesian belief network. Annals of Operations Research, 2017. [12] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. [13] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Reg- ularization paths for cox’s proportional hazards model via coordinate descent. Journal of Statistical Software, 39(5):1–13, 2011.