SlideShare a Scribd company logo
3
Most read
4
Most read
12
Most read
Model Selection
Techniques
- Swati (19013030)
Let’s start !
MODEL SELECTION TECHNIQUES
● Multiple models are fitted and evaluated to choose the best. The best approach to model selection requires
“sufficient” data, which may be nearly infinite depending on the complexity of the problem.
● There are three ways of selecting your ML model in which two are the fields of probability and sampling.
1. Random Train/Test Split
 Random Splits are used to randomly sample a percentage of data
into training, testing, and preferably validation sets.
 Data to be passed in model is divided into train and test, ratio wise.
This can be achieved using train_test_split function of scikit-learn
python library.
 On re-running of train_test_split data code, results come out to be
different on each run of code. So, you aren’t sure how exactly your
model will perform on unseen data.
 The advantage of this method is that there is a good chance that the
original population is well represented in all the three sets. In more
formal terms, random splitting will prevent a biased sampling of data.
2. Resampling
 In resampling technique of model selection, for a set of
iterations, data is resampled into train/test followed by
training on train and evaluation on test set.
 Model chosen from this technique is assessed based on
performance, not the model complexity.
 Performance is computed on out-of-sample data.
Resampling techniques estimate the error by evaluating
out-of-sample data aka unseen data.
- Strategies of resampling are such as K-Fold,
StratifiedK-Fold etc.
Types of Resampling
1. Time-Based Split 2. K-Fold Cross-
Validation
4. Bootstrap
3. Stratified K-Fold
Time – Based Split
 There are some types of data where random splits are
not possible.
 For example, if we have to train a model for weather
forecasting, we cannot randomly divide the data into
training and testing sets. This will jumble up the seasonal
pattern! Such data is often referred to by the term – Time
Series.
 In such cases, a time-wise split is used. The training set
can have data for the last three years and 10 months of
the present year. The last two months can be reserved
for the testing or validation set.
 However, the drawback of time-series data is that the
events or data points are not mutually independent.
One event might affect every data input that follows
after.
K-Fold Cross-Validation
 The cross-validation technique works by randomly shuffling the
dataset and then splitting it into k groups. Thereafter, on iterating
over each group, the group needs to be considered as a test set
while all other groups are clubbed together into the training set.
 The model is tested on the test group and the process continues for
k groups.
 Thus, by the end of the process, one has k different results on k
different test groups. The best model can then be selected easily by
choosing the one with the highest score.
Stratified K-Fold
 The process for stratified K-Fold is similar to that of K-
Fold cross-validation with one single point of difference –
unlike in k-fold cross-validation, the values of the
target variable is taken into consideration in
stratified k-fold.
 If for instance, the target variable is a categorical variable
with 2 classes, then stratified k-fold ensures that each
test fold gets an equal ratio of the two classes when
compared to the training set.
 This makes the model evaluation more accurate and the
model training less biased.
Bootstrap
 Bootstrap is one of the most powerful ways to obtain a stabilized model.
 The first step is to select a sample size (which is usually equal to the size of
the original dataset). Thereafter, a sample data point must be randomly
selected from the original dataset and added to the bootstrap sample. After
the addition, the sample needs to be put back into the original sample. This
process needs to be repeated for N times, where N is the sample size.
 Therefore, it is a resampling technique that creates the bootstrap sample
by sampling data points from the original dataset with replacement.
 The model is trained on the bootstrap sample and then evaluated on all
those data points that did not make it to the bootstrapped sample. These
are called the out-of-bag samples.
3. Probabilistic measures
 Probabilistic Measures do not just take into account the model
performance but also the model complexity.
 Model complexity is the measure of the model’s ability to capture
the variance in the data.
 For example, a highly biased model like the linear regression
algorithm is less complex and on the other hand, a neural network is
very high on complexity.
 A fair bit of disadvantage however lies in the fact that probabilistic
measures do not consider the uncertainty of the models and has a
chance of selecting simpler models over complex models.
e
o SRM tries to balance out the
model’s complexity against its
success at fitting on the data.
o MDL or the minimum description
length is the minimum number of
such bits required to represent the
model.
o BIC penalizes the model for its
complexity and is preferably used
when the size of the dataset is not
very small .
o AIC is the measure of information
loss.
2. Bayesian Information Criterion (BIC)
1. Akaike Information Criterion (AIC)
4. Structural Risk Minimization (SRM)
3. Minimum Description Length (MDL)
Types of Probabilistic measures
Thank You !!
For more Information visit on :
(https://neptune.ai/blog/the-ultimate-guide-to-evaluation-and-selection-of-
models-in-machine-learning)

More Related Content

PDF
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
PPTX
Random forest algorithm
PPTX
K Nearest Neighbor Algorithm
PDF
Module 4: Model Selection and Evaluation
PPTX
Knn Algorithm presentation
PPTX
K-Nearest Neighbor Classifier
PDF
Decision trees in Machine Learning
PPTX
K nearest neighbor
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Random forest algorithm
K Nearest Neighbor Algorithm
Module 4: Model Selection and Evaluation
Knn Algorithm presentation
K-Nearest Neighbor Classifier
Decision trees in Machine Learning
K nearest neighbor

What's hot (20)

PPTX
Decision Tree - C4.5&CART
PDF
Optimization for Deep Learning
PDF
Linear regression
PDF
CounterFactual Explanations.pdf
ODP
Machine Learning With Logistic Regression
PDF
Data Science - Part V - Decision Trees & Random Forests
PPT
KNN - Classification Model (Step by Step)
PPT
Machine Learning
PPT
2.4 rule based classification
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
PPT
K means Clustering Algorithm
PPTX
Classification Algorithm.
PPTX
Reinforcement Learning
PDF
Understanding Bagging and Boosting
PDF
K - Nearest neighbor ( KNN )
PPTX
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
PPT
K mean-clustering algorithm
PPTX
Presentation on supervised learning
PDF
Density Based Clustering
Decision Tree - C4.5&CART
Optimization for Deep Learning
Linear regression
CounterFactual Explanations.pdf
Machine Learning With Logistic Regression
Data Science - Part V - Decision Trees & Random Forests
KNN - Classification Model (Step by Step)
Machine Learning
2.4 rule based classification
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
K means Clustering Algorithm
Classification Algorithm.
Reinforcement Learning
Understanding Bagging and Boosting
K - Nearest neighbor ( KNN )
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
K mean-clustering algorithm
Presentation on supervised learning
Density Based Clustering
Ad

Similar to Model Selection Techniques (20)

PPTX
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
PPT
5_Model for Predictions_Machine_Learning.ppt
PPTX
Cross Validation Cross ValidationmCross Validation.pptx
PDF
Modelling and evaluation
PPTX
Cross validation.pptx
PDF
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
PDF
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
PDF
Post Graduate Admission Prediction System
PPTX
Unit-3 ML Modelling and Evaluation .pptx
PPTX
SMOTE and K-Fold Cross Validation-Presentation.pptx
PPTX
CST413 KTU S7 CSE Machine Learning Classification Assessment Confusion matrix...
PPTX
Model Development And Evaluation in ML.pptx
PDF
Resampling methods Cross Validation Bootstrap Bias and variance estimation...
PPTX
Statistical Learning and Model Selection (1).pptx
PPTX
crossvalidation.pptx
PPTX
ML2_ML (1) concepts explained in details.pptx
PDF
Introduction to Artificial Intelligence_ Lec 10
PDF
Barga Data Science lecture 10
PDF
ML MODULE 5.pdf
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
5_Model for Predictions_Machine_Learning.ppt
Cross Validation Cross ValidationmCross Validation.pptx
Modelling and evaluation
Cross validation.pptx
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
Post Graduate Admission Prediction System
Unit-3 ML Modelling and Evaluation .pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
CST413 KTU S7 CSE Machine Learning Classification Assessment Confusion matrix...
Model Development And Evaluation in ML.pptx
Resampling methods Cross Validation Bootstrap Bias and variance estimation...
Statistical Learning and Model Selection (1).pptx
crossvalidation.pptx
ML2_ML (1) concepts explained in details.pptx
Introduction to Artificial Intelligence_ Lec 10
Barga Data Science lecture 10
ML MODULE 5.pdf
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Ad

Recently uploaded (20)

PPTX
web development for engineering and engineering
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Sustainable Sites - Green Building Construction
PPT
Project quality management in manufacturing
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
composite construction of structures.pdf
web development for engineering and engineering
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Sustainable Sites - Green Building Construction
Project quality management in manufacturing
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Internet of Things (IOT) - A guide to understanding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
additive manufacturing of ss316l using mig welding
Structs to JSON How Go Powers REST APIs.pdf
OOP with Java - Java Introduction (Basics)
UNIT 4 Total Quality Management .pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
composite construction of structures.pdf

Model Selection Techniques

  • 3. MODEL SELECTION TECHNIQUES ● Multiple models are fitted and evaluated to choose the best. The best approach to model selection requires “sufficient” data, which may be nearly infinite depending on the complexity of the problem. ● There are three ways of selecting your ML model in which two are the fields of probability and sampling.
  • 4. 1. Random Train/Test Split  Random Splits are used to randomly sample a percentage of data into training, testing, and preferably validation sets.  Data to be passed in model is divided into train and test, ratio wise. This can be achieved using train_test_split function of scikit-learn python library.  On re-running of train_test_split data code, results come out to be different on each run of code. So, you aren’t sure how exactly your model will perform on unseen data.  The advantage of this method is that there is a good chance that the original population is well represented in all the three sets. In more formal terms, random splitting will prevent a biased sampling of data.
  • 5. 2. Resampling  In resampling technique of model selection, for a set of iterations, data is resampled into train/test followed by training on train and evaluation on test set.  Model chosen from this technique is assessed based on performance, not the model complexity.  Performance is computed on out-of-sample data. Resampling techniques estimate the error by evaluating out-of-sample data aka unseen data. - Strategies of resampling are such as K-Fold, StratifiedK-Fold etc.
  • 6. Types of Resampling 1. Time-Based Split 2. K-Fold Cross- Validation 4. Bootstrap 3. Stratified K-Fold
  • 7. Time – Based Split  There are some types of data where random splits are not possible.  For example, if we have to train a model for weather forecasting, we cannot randomly divide the data into training and testing sets. This will jumble up the seasonal pattern! Such data is often referred to by the term – Time Series.  In such cases, a time-wise split is used. The training set can have data for the last three years and 10 months of the present year. The last two months can be reserved for the testing or validation set.  However, the drawback of time-series data is that the events or data points are not mutually independent. One event might affect every data input that follows after.
  • 8. K-Fold Cross-Validation  The cross-validation technique works by randomly shuffling the dataset and then splitting it into k groups. Thereafter, on iterating over each group, the group needs to be considered as a test set while all other groups are clubbed together into the training set.  The model is tested on the test group and the process continues for k groups.  Thus, by the end of the process, one has k different results on k different test groups. The best model can then be selected easily by choosing the one with the highest score.
  • 9. Stratified K-Fold  The process for stratified K-Fold is similar to that of K- Fold cross-validation with one single point of difference – unlike in k-fold cross-validation, the values of the target variable is taken into consideration in stratified k-fold.  If for instance, the target variable is a categorical variable with 2 classes, then stratified k-fold ensures that each test fold gets an equal ratio of the two classes when compared to the training set.  This makes the model evaluation more accurate and the model training less biased.
  • 10. Bootstrap  Bootstrap is one of the most powerful ways to obtain a stabilized model.  The first step is to select a sample size (which is usually equal to the size of the original dataset). Thereafter, a sample data point must be randomly selected from the original dataset and added to the bootstrap sample. After the addition, the sample needs to be put back into the original sample. This process needs to be repeated for N times, where N is the sample size.  Therefore, it is a resampling technique that creates the bootstrap sample by sampling data points from the original dataset with replacement.  The model is trained on the bootstrap sample and then evaluated on all those data points that did not make it to the bootstrapped sample. These are called the out-of-bag samples.
  • 11. 3. Probabilistic measures  Probabilistic Measures do not just take into account the model performance but also the model complexity.  Model complexity is the measure of the model’s ability to capture the variance in the data.  For example, a highly biased model like the linear regression algorithm is less complex and on the other hand, a neural network is very high on complexity.  A fair bit of disadvantage however lies in the fact that probabilistic measures do not consider the uncertainty of the models and has a chance of selecting simpler models over complex models. e
  • 12. o SRM tries to balance out the model’s complexity against its success at fitting on the data. o MDL or the minimum description length is the minimum number of such bits required to represent the model. o BIC penalizes the model for its complexity and is preferably used when the size of the dataset is not very small . o AIC is the measure of information loss. 2. Bayesian Information Criterion (BIC) 1. Akaike Information Criterion (AIC) 4. Structural Risk Minimization (SRM) 3. Minimum Description Length (MDL) Types of Probabilistic measures
  • 13. Thank You !! For more Information visit on : (https://neptune.ai/blog/the-ultimate-guide-to-evaluation-and-selection-of- models-in-machine-learning)