SlideShare a Scribd company logo
Demystifying the Bias-Variance Tradeoff
Ashwin Rao
August 1, 2017
1 Motivation and Overview
The Bias-Variance Tradeoff is perhaps the most important concept to learn for any student
getting initiated in Machine Learning. Unfortunately, it is not appreciated adequately
by many students who get caught up in the mechanics of advanced Machine Learning
models/algorithms and don’t realize that many of the pitfalls in the models they build are
due to either too much Bias or too much Variance. In this note, I will explain this tradeoff
by highlighting the probabilistic elements in the derivation of the formula governing the
tradeoff, will explain how to interpret the tradeoff, and will finally introduce the concept
of Capacity that plays a key role in actually “playing the tradeoff”.
2 Understanding the Probabilistic Aspects
I think the crux of the Bias-Variance tradeoff is lost on many students because the prob-
abilistic aspects of the setting under which the tradeoff operates is not explained properly
in most textbooks or by teachers. Let us consider a supervised Machine Learning problem
where the set of features are denoted by X and the supervisory variable is denoted by Y .
To understand the setting intuitively, let us look at a simple example: Say X consists of
the Age, Gender and Country of a person and Y is the Height of the person. Here, we
are in the business of predicting the Height from the value of the (Age, Gender, Country)
3-tuple, but we need to understand that for a fixed (Age, Gender, Country) tuple, the Age
(as seen in the data) will be spread over a range. Hence, we talk about Age as a probability
distribution conditional on the value of the (Age, Gender, Country) tuple. This is really
important to understand - that Y given X (denoted Y | X) is a random variable. Note
that since this is conditional on X, we need to treat the conditional probability of Y | X
as a function of X (meaning the probability distribution of Y depends on X).
In this setting, we denote the Expectation and Variance of the conditional random
variable Y | X as µ(X) and σ2(X) respectively. Now let’s say we build a model with
training data set T whose predicted value (of the supervisory variable) is denoted as ˆY =
ˆfT (X). The subscript T is important - it means that the model prediction function ˆf
1
depends on the training data set T. The intuition here is that we aim to get ˆfT (X)
reasonably close to µ(X), i.e., we hope that our model’s prediction for a given X will be
close to the conditional expectation of Y given X.
The Bias-Variance tradeoff is a statement about expectations under two different (and
independent) sources of randomness:
• The randomness associated with Y conditioned on X. We will refer to this source of
randomness by subscripting with the notation Y | X.
• The randomness associated with the choice of training data set T which in turn results
in randomness in the model-prediction function ˆfT . We will refer to this source of
randomness by subscripting with the notation T
Note that that these two sources of randomness Y | X and T are independent
3 Expected Prediction Error for a Test Data Point
The Expected Prediction Error (EPE) of the model on a test data point (x, y) is defined
as:
EPE(Y |X),T (x) = E(Y |X),T [( ˆfT (x) − y)2
]
= E(Y |X),T [ ˆf2
T (x) + y2
− 2 · ˆfT (x) · y]
= ET [ ˆf2
T (x)] + EY |X[y2
] − 2E(Y |X),T [ ˆfT (x) · y]
Note that:
EY |X[y2
] = µ2
(x) + σ2
(x)
E(Y |X),T [ ˆfT (x) · y] = ET [ ˆfT (x)] · EY |X[y] = ET [ ˆfT (x)] · µ(x)
(because of independence of the two sources of randomness)
Hence,
EPE(Y |X),T (x) = ET [ ˆf2
T (x)] + µ2
(x) + σ2
(x) − 2 · ET [ ˆfT (x)] · µ(x)
4 Bias and Variance
Before we state the definitions of Bias and Variance in precise equational language, let us
understand them intuitively. Both Bias and Variance refer to the probabilistic nature of
the model’s forecast for a given test data point x, with the probabilities governed by the
random choices in selecting the training data set T.
Bias of a model refers to the “gap” between:
2
• The model’s expected prediction of the supervisory variable (corresponding to the
given test data point x). Note that this expectation is over probabilistic choices of
training data set T.
• The expected value of the supervisory variable (corresponding to the given test data
point x) that actually manifests in the data. Note that this expectation is over the
probability distribution of Y given X (notationally, Y | X).
Variance of a model refers to the “expected squared deviation” of the model’s predic-
tion of the supervisory variable (corresponding to the given test data point x) around the
expected prediction of the supervisory variable. Note that this “expected squared devia-
tion” is over probabilistic choices of training data set T and is meant to measure the degree
of “fluctuation” (or you may want to call it “instability”) in the model’s prediction (due
to variations in the choice of the training data set T).
Now we precisely define the Bias and Variance of the model ˆfT when it makes a pre-
diction for the test data point x.
BiasT (x) = ET [ ˆfT (x)] − µ(x)
V arianceT (x) = ET [( ˆfT (x) − ET [ ˆfT (x)])2
] = ET [ ˆf2
T (x)] − (ET [ ˆfT (x)])2
Bias2
T (x) + V arianceT (x)
= (ET [ ˆfT (x)])2
+ µ2
(x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2
T (x)] − (ET [ ˆfT (x)])2
= µ2
(x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2
T (x)]
= EPE(Y |X),T (x) − σ2
(x)
In other words,
EPE(Y |X),T (x) = Bias2
T (x) + V arianceT (x) + σ2
(x)
5 Interpreting the Tradeoff
So we can see that the Expected Prediction Error of a model (built from random training
data) on a test data point (x, y) (that is governed by the conditional randomness Y | X)
is composed of 3 parts:
• σ2(x): Remember that models are in the business of predicting E[y | x] = µ(x)
whereas the EPE compares the model prediction to y | x (not to E[y | x]), so the
conditional (on x) variance around E[y | x] (i.e. σ2(x)) will always exist in the EPE
and is not reducible by any model. So we will focus the rest of the discussion in this
note on how to interpret the remaining two terms, and finally how to control them
in “playing the tradeoff”.
3
• Bias2
T (x): This term has to do with the fact that the model ˆfT might not adequately
capture the complexity of the function defined by the conditional expectation of Y
given X. Under-parameterized models are too simple to capture the richer structure
of data and suffer from high BiasT . An example would be a simple linear regression
trying to capture structure of data that has say an exponential relationship between
X and Y . On the other hand, a model that captures the essential complexity in the
relationship between X and Y will have low/no bias.
• V arianceT (x): This term has to do with the fact that the randomness of (variability
in) the choice of the training data set T results in variability in the predictions of
the model ˆfT . Over-parameterized models are essentially unstable models and suffer
from high V arianceT . An example would be a nearest neighbors model that has too
many degrees of freedom (too many parameters) trying to fit perfectly to the training
data. On the other hand, a model that doesn’t fit the training data too tightly will
have low variance.
6 Capacity
The key in building a good machine learning model is to find the right effective level of
parameterization that balances the two effects of Model Bias and Model Variance (trading
one against the other). Thankfully, we have a precise technical concept called Capacity
of a model that intuitively refers to the effective level of parameterization. You can also
think of Capacity as the “span” of functions that can be captured by the model (hence,
the term “Capacity”), or you can simply think of it as the “Model Complexity”. A linear
regression model’s “span” would be the set of all linear functions. A polynomial model’s
“span” would be the set of all polynomials (up to its specified degree). A nearest neighbor
model would have as much capacity as the set of training data points (since there is a
parameter for each data point), and hence it would typically have high capacity (assuming
we have plenty of training data points). Regularization is a technique which tones down
the Capacity.
But how does Capacity serve as a mechanism to actually play the Bias-Variance Trade-
off? Let’s say you start with a simple linear regression model. We know this has low
Capacity. Training data error would be large if the data itself is fairly non-linear. Now
let’s keep increasing the Capacity. We will find that training data error will keep reducing
because the richer structure of permitted functions will aim to fit the training data more
and more precisely. But there is no free lunch - the test data error (EPE) will increase
if you increase the Capacity too much. There will be an Optimal Capacity somewhere in
between where the EPE is the lowest. This is where you have found the right balance
between the Model Bias and Model Variance. If the Capacity is lower than this Optimal
Capacity, you run the risk of underfitting - high Model Bias and low Model Variance. If the
Capacity is higher than this Optimal Capacity, you run the risk of overfitting - low Model
4
Bias and high Model Variance. So, Capacity is a mechanism to play Model Bias against
Model Variance in an attempt to find the right balance.
The other point to note is that Capacity has a connection with the number of training
data points. If you have a larger number of training data points, the Optimal Capacity
will be tend to be larger (until the Optimal Capacity plateaus - as it eventually achieves
sufficient complexity to solve the problem).
This note did not go into the mathematical specification of the technical term Capacity
as I wanted to keep this to an introductory content, but mathematically advanced readers
are encouraged to understand the technical term Capacity by looking up the definition of
Vapnik-Chervonenkis dimension and how it is used to bound the gap between the test data
error and the training data error. Sadly, the VC dimension is not very useful in practical
Machine Learning models, but it serves as a great metric to understand and appreciate the
significance of Capacity and its connection with the Bias-Variance Tradeoff.
5

More Related Content

PDF
The Newsvendor meets the Options Trader
PPT
Econometrics ch8
PPT
Econometrics ch5
PPT
Econometrics ch11
PDF
Machine learning (1)
PPT
Econometrics ch4
PPT
Econometrics ch6
PPT
Econometrics ch3
The Newsvendor meets the Options Trader
Econometrics ch8
Econometrics ch5
Econometrics ch11
Machine learning (1)
Econometrics ch4
Econometrics ch6
Econometrics ch3

What's hot (19)

PDF
PPTX
Probability distributionv1
PPT
Continuous Random variable
PDF
Lecture 2
PPT
Using Microsoft excel for six sigma
PDF
Logistic regression
PDF
CS229 Machine Learning Lecture Notes
PDF
Types of Probability Distributions - Statistics II
PDF
A Geometric Note on a Type of Multiple Testing-07-24-2015
DOCX
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
PPT
Data Analysison Regression
PDF
Problem_Session_Notes
PDF
Spanos lecture+3-6334-estimation
PDF
Directional Hypothesis testing
PDF
02.bayesian learning
PDF
Intro to Quant Trading Strategies (Lecture 10 of 10)
PPTX
Diagnostic methods for Building the regression model
PDF
Intro to Quant Trading Strategies (Lecture 8 of 10)
PDF
Cs229 notes4
Probability distributionv1
Continuous Random variable
Lecture 2
Using Microsoft excel for six sigma
Logistic regression
CS229 Machine Learning Lecture Notes
Types of Probability Distributions - Statistics II
A Geometric Note on a Type of Multiple Testing-07-24-2015
Basic statistics by_david_solomon_hadi_-_split_and_reviewed
Data Analysison Regression
Problem_Session_Notes
Spanos lecture+3-6334-estimation
Directional Hypothesis testing
02.bayesian learning
Intro to Quant Trading Strategies (Lecture 10 of 10)
Diagnostic methods for Building the regression model
Intro to Quant Trading Strategies (Lecture 8 of 10)
Cs229 notes4
Ad

Similar to Demystifying the Bias-Variance Tradeoff (20)

PDF
M08 BiasVarianceTradeoff
PPTX
regression.pptx
PPT
chap4_Parametric_Methods.ppt
PDF
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
PDF
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
PDF
Bias and variance / tradeoff Machine learning/DMDW
PPTX
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
PPTX
Hyperparameter Tuning
PDF
All about the Bias and Variance in ai.pdf
PDF
Modeling Social Data, Lecture 7: Model complexity and generalization
PPTX
Lecture 4b _Overfitting_Underfitting_Bias_Variance_Presentation.pptx
DOCX
Sampling theory teaches about machine .docx
PDF
Bias and variance trade off
PPTX
Machine Learning: Bias and Variance Trade-off
PDF
Machine learning (4)
PPTX
Evaluation measures Data Science Course.pptx
PDF
Bias-Variance_relted_to_ML.pdf
PDF
Modelling and evaluation
PDF
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
PPTX
Over fitting underfitting
M08 BiasVarianceTradeoff
regression.pptx
chap4_Parametric_Methods.ppt
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Empirical Analysis of the Bias-Variance Tradeoff Across Machine Learning Models
Bias and variance / tradeoff Machine learning/DMDW
UNIT-II-Machine-Learning.pptx Machine Learning Different AI Models
Hyperparameter Tuning
All about the Bias and Variance in ai.pdf
Modeling Social Data, Lecture 7: Model complexity and generalization
Lecture 4b _Overfitting_Underfitting_Bias_Variance_Presentation.pptx
Sampling theory teaches about machine .docx
Bias and variance trade off
Machine Learning: Bias and Variance Trade-off
Machine learning (4)
Evaluation measures Data Science Course.pptx
Bias-Variance_relted_to_ML.pdf
Modelling and evaluation
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Over fitting underfitting
Ad

More from Ashwin Rao (20)

PDF
Stochastic Control/Reinforcement Learning for Optimal Market Making
PDF
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
PDF
Fundamental Theorems of Asset Pricing
PDF
Evolutionary Strategies as an alternative to Reinforcement Learning
PDF
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
PDF
Understanding Dynamic Programming through Bellman Operators
PDF
Stochastic Control of Optimal Trade Order Execution
PDF
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
PDF
Overview of Stochastic Calculus Foundations
PDF
Risk-Aversion, Risk-Premium and Utility Theory
PDF
Value Function Geometry and Gradient TD
PDF
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
PDF
HJB Equation and Merton's Portfolio Problem
PDF
Policy Gradient Theorem
PDF
A Quick and Terse Introduction to Efficient Frontier Mathematics
PDF
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
PDF
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
PDF
Category Theory made easy with (ugly) pictures
PDF
Risk Pooling sensitivity to Correlation
PDF
Abstract Algebra in 3 Hours
Stochastic Control/Reinforcement Learning for Optimal Market Making
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Fundamental Theorems of Asset Pricing
Evolutionary Strategies as an alternative to Reinforcement Learning
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Understanding Dynamic Programming through Bellman Operators
Stochastic Control of Optimal Trade Order Execution
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
Overview of Stochastic Calculus Foundations
Risk-Aversion, Risk-Premium and Utility Theory
Value Function Geometry and Gradient TD
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
HJB Equation and Merton's Portfolio Problem
Policy Gradient Theorem
A Quick and Terse Introduction to Efficient Frontier Mathematics
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
Category Theory made easy with (ugly) pictures
Risk Pooling sensitivity to Correlation
Abstract Algebra in 3 Hours

Recently uploaded (20)

PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Modernising the Digital Integration Hub
PDF
Hybrid model detection and classification of lung cancer
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
The various Industrial Revolutions .pptx
PDF
August Patch Tuesday
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Getting Started with Data Integration: FME Form 101
Modernising the Digital Integration Hub
Hybrid model detection and classification of lung cancer
NewMind AI Weekly Chronicles – August ’25 Week III
O2C Customer Invoices to Receipt V15A.pptx
The various Industrial Revolutions .pptx
August Patch Tuesday
observCloud-Native Containerability and monitoring.pptx
cloud_computing_Infrastucture_as_cloud_p
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
NewMind AI Weekly Chronicles - August'25-Week II
DP Operators-handbook-extract for the Mautical Institute
A contest of sentiment analysis: k-nearest neighbor versus neural network
Web App vs Mobile App What Should You Build First.pdf
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Zenith AI: Advanced Artificial Intelligence
gpt5_lecture_notes_comprehensive_20250812015547.pdf
OMC Textile Division Presentation 2021.pptx

Demystifying the Bias-Variance Tradeoff

  • 1. Demystifying the Bias-Variance Tradeoff Ashwin Rao August 1, 2017 1 Motivation and Overview The Bias-Variance Tradeoff is perhaps the most important concept to learn for any student getting initiated in Machine Learning. Unfortunately, it is not appreciated adequately by many students who get caught up in the mechanics of advanced Machine Learning models/algorithms and don’t realize that many of the pitfalls in the models they build are due to either too much Bias or too much Variance. In this note, I will explain this tradeoff by highlighting the probabilistic elements in the derivation of the formula governing the tradeoff, will explain how to interpret the tradeoff, and will finally introduce the concept of Capacity that plays a key role in actually “playing the tradeoff”. 2 Understanding the Probabilistic Aspects I think the crux of the Bias-Variance tradeoff is lost on many students because the prob- abilistic aspects of the setting under which the tradeoff operates is not explained properly in most textbooks or by teachers. Let us consider a supervised Machine Learning problem where the set of features are denoted by X and the supervisory variable is denoted by Y . To understand the setting intuitively, let us look at a simple example: Say X consists of the Age, Gender and Country of a person and Y is the Height of the person. Here, we are in the business of predicting the Height from the value of the (Age, Gender, Country) 3-tuple, but we need to understand that for a fixed (Age, Gender, Country) tuple, the Age (as seen in the data) will be spread over a range. Hence, we talk about Age as a probability distribution conditional on the value of the (Age, Gender, Country) tuple. This is really important to understand - that Y given X (denoted Y | X) is a random variable. Note that since this is conditional on X, we need to treat the conditional probability of Y | X as a function of X (meaning the probability distribution of Y depends on X). In this setting, we denote the Expectation and Variance of the conditional random variable Y | X as µ(X) and σ2(X) respectively. Now let’s say we build a model with training data set T whose predicted value (of the supervisory variable) is denoted as ˆY = ˆfT (X). The subscript T is important - it means that the model prediction function ˆf 1
  • 2. depends on the training data set T. The intuition here is that we aim to get ˆfT (X) reasonably close to µ(X), i.e., we hope that our model’s prediction for a given X will be close to the conditional expectation of Y given X. The Bias-Variance tradeoff is a statement about expectations under two different (and independent) sources of randomness: • The randomness associated with Y conditioned on X. We will refer to this source of randomness by subscripting with the notation Y | X. • The randomness associated with the choice of training data set T which in turn results in randomness in the model-prediction function ˆfT . We will refer to this source of randomness by subscripting with the notation T Note that that these two sources of randomness Y | X and T are independent 3 Expected Prediction Error for a Test Data Point The Expected Prediction Error (EPE) of the model on a test data point (x, y) is defined as: EPE(Y |X),T (x) = E(Y |X),T [( ˆfT (x) − y)2 ] = E(Y |X),T [ ˆf2 T (x) + y2 − 2 · ˆfT (x) · y] = ET [ ˆf2 T (x)] + EY |X[y2 ] − 2E(Y |X),T [ ˆfT (x) · y] Note that: EY |X[y2 ] = µ2 (x) + σ2 (x) E(Y |X),T [ ˆfT (x) · y] = ET [ ˆfT (x)] · EY |X[y] = ET [ ˆfT (x)] · µ(x) (because of independence of the two sources of randomness) Hence, EPE(Y |X),T (x) = ET [ ˆf2 T (x)] + µ2 (x) + σ2 (x) − 2 · ET [ ˆfT (x)] · µ(x) 4 Bias and Variance Before we state the definitions of Bias and Variance in precise equational language, let us understand them intuitively. Both Bias and Variance refer to the probabilistic nature of the model’s forecast for a given test data point x, with the probabilities governed by the random choices in selecting the training data set T. Bias of a model refers to the “gap” between: 2
  • 3. • The model’s expected prediction of the supervisory variable (corresponding to the given test data point x). Note that this expectation is over probabilistic choices of training data set T. • The expected value of the supervisory variable (corresponding to the given test data point x) that actually manifests in the data. Note that this expectation is over the probability distribution of Y given X (notationally, Y | X). Variance of a model refers to the “expected squared deviation” of the model’s predic- tion of the supervisory variable (corresponding to the given test data point x) around the expected prediction of the supervisory variable. Note that this “expected squared devia- tion” is over probabilistic choices of training data set T and is meant to measure the degree of “fluctuation” (or you may want to call it “instability”) in the model’s prediction (due to variations in the choice of the training data set T). Now we precisely define the Bias and Variance of the model ˆfT when it makes a pre- diction for the test data point x. BiasT (x) = ET [ ˆfT (x)] − µ(x) V arianceT (x) = ET [( ˆfT (x) − ET [ ˆfT (x)])2 ] = ET [ ˆf2 T (x)] − (ET [ ˆfT (x)])2 Bias2 T (x) + V arianceT (x) = (ET [ ˆfT (x)])2 + µ2 (x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2 T (x)] − (ET [ ˆfT (x)])2 = µ2 (x) − 2 · ET [ ˆfT (x)] · µ(x) + ET [ ˆf2 T (x)] = EPE(Y |X),T (x) − σ2 (x) In other words, EPE(Y |X),T (x) = Bias2 T (x) + V arianceT (x) + σ2 (x) 5 Interpreting the Tradeoff So we can see that the Expected Prediction Error of a model (built from random training data) on a test data point (x, y) (that is governed by the conditional randomness Y | X) is composed of 3 parts: • σ2(x): Remember that models are in the business of predicting E[y | x] = µ(x) whereas the EPE compares the model prediction to y | x (not to E[y | x]), so the conditional (on x) variance around E[y | x] (i.e. σ2(x)) will always exist in the EPE and is not reducible by any model. So we will focus the rest of the discussion in this note on how to interpret the remaining two terms, and finally how to control them in “playing the tradeoff”. 3
  • 4. • Bias2 T (x): This term has to do with the fact that the model ˆfT might not adequately capture the complexity of the function defined by the conditional expectation of Y given X. Under-parameterized models are too simple to capture the richer structure of data and suffer from high BiasT . An example would be a simple linear regression trying to capture structure of data that has say an exponential relationship between X and Y . On the other hand, a model that captures the essential complexity in the relationship between X and Y will have low/no bias. • V arianceT (x): This term has to do with the fact that the randomness of (variability in) the choice of the training data set T results in variability in the predictions of the model ˆfT . Over-parameterized models are essentially unstable models and suffer from high V arianceT . An example would be a nearest neighbors model that has too many degrees of freedom (too many parameters) trying to fit perfectly to the training data. On the other hand, a model that doesn’t fit the training data too tightly will have low variance. 6 Capacity The key in building a good machine learning model is to find the right effective level of parameterization that balances the two effects of Model Bias and Model Variance (trading one against the other). Thankfully, we have a precise technical concept called Capacity of a model that intuitively refers to the effective level of parameterization. You can also think of Capacity as the “span” of functions that can be captured by the model (hence, the term “Capacity”), or you can simply think of it as the “Model Complexity”. A linear regression model’s “span” would be the set of all linear functions. A polynomial model’s “span” would be the set of all polynomials (up to its specified degree). A nearest neighbor model would have as much capacity as the set of training data points (since there is a parameter for each data point), and hence it would typically have high capacity (assuming we have plenty of training data points). Regularization is a technique which tones down the Capacity. But how does Capacity serve as a mechanism to actually play the Bias-Variance Trade- off? Let’s say you start with a simple linear regression model. We know this has low Capacity. Training data error would be large if the data itself is fairly non-linear. Now let’s keep increasing the Capacity. We will find that training data error will keep reducing because the richer structure of permitted functions will aim to fit the training data more and more precisely. But there is no free lunch - the test data error (EPE) will increase if you increase the Capacity too much. There will be an Optimal Capacity somewhere in between where the EPE is the lowest. This is where you have found the right balance between the Model Bias and Model Variance. If the Capacity is lower than this Optimal Capacity, you run the risk of underfitting - high Model Bias and low Model Variance. If the Capacity is higher than this Optimal Capacity, you run the risk of overfitting - low Model 4
  • 5. Bias and high Model Variance. So, Capacity is a mechanism to play Model Bias against Model Variance in an attempt to find the right balance. The other point to note is that Capacity has a connection with the number of training data points. If you have a larger number of training data points, the Optimal Capacity will be tend to be larger (until the Optimal Capacity plateaus - as it eventually achieves sufficient complexity to solve the problem). This note did not go into the mathematical specification of the technical term Capacity as I wanted to keep this to an introductory content, but mathematically advanced readers are encouraged to understand the technical term Capacity by looking up the definition of Vapnik-Chervonenkis dimension and how it is used to bound the gap between the test data error and the training data error. Sadly, the VC dimension is not very useful in practical Machine Learning models, but it serves as a great metric to understand and appreciate the significance of Capacity and its connection with the Bias-Variance Tradeoff. 5