SlideShare a Scribd company logo
Francesco Casalegno
Ordinal Regression
→ Learning to Rank
→ Ordinal Binary Decompositions
→ Threshold Models
→ Metrics for Ordinal Regression
Examples
2
Francesco Casalegno – Ordinal Regression
Examples
Likert Scale
Given
Metadata (age, sex, nationality, …) and user’s past behavior (tweets, Facebook posts, ...).
Predict
User’s agreement with a given statement.
3
Francesco Casalegno – Ordinal Regression
Examples
Disease Staging
Given
X-ray imaging of a patient’s tooth.
Predict
Caries stage from I to IV.
4
Stage I
early enamel lesion
Stage II
advanced enamel lesion
Stage III
dentin lesion
Stage IV
pulp lesion
*image from https://royaloakdentalclinic.com
Francesco Casalegno – Ordinal Regression
Examples
Recommender Systems
Given
Metadata (age, sex, nationality, …) and user’s feedbacks (ratings given to other movies, watched movies, ...).
Predict
User’s rating (in stars, from 1 to 5) to a movie the user has not seen or rated yet.
5
Francesco Casalegno – Ordinal Regression
Examples
Sovereign Credit Ranking
Given
Time series measuring the economy of a country (GDP % change, public debt, …).
Predict
Sovereign credit rating, e.g. using Moody’s, Standard & Poor’s, or FitchRating scale.
6
?
?
Fundamental Concepts
7
Francesco Casalegno – Ordinal Regression
Levels of Measurements
8
Stevens’ scale for levels of measurements
Understanding the nature of target variables y is essential to choose which regression model to use
Binary / Multinomial
Classification
Ordinal Regression
Standard Regression
Standard Regression
2. Ordinal
Items can be compared and sorted
Likert scale — “strongly disagree” ≺ “disagree” ≺ “agree” ≺ “strongly agree”
3. Interval
Items are values on an affine scale, and they can be subtracted
Dates — “23-Oct-2019” - “16-Apr-2018” = “25-Dec-2004” - “19-Jun-2003” (= 555 days)
4. Ratio
Items represents amounts, and their ratio can be computed
Length — “300 m” / “150 m” = “1000 m” / “500 m” (= 2)
1. Nominal
Items can be differentiated based on their name
Mushroom species — “portobello” ≠ “porcini” ≠ “shiitake” ≠ “morels”
Francesco Casalegno – Ordinal Regression
Levels of Measurements
Beyond Stevens’ scale
Stevens's typology is widely adopted, but more refined classifications are possible.
● categorical VS continuous
● unordered VS cyclical VS linear
● bounded VS unbounded
● …
9
Classification
Circular Regression
Standard Regression
Ordinal Regression
Poisson Regression
Regression Types
● Nominal data → y ∊ {c1
, …, cK
}
Facebook reactions — possible values are 👍, ❤, 😂, 😯, 😢, 😡
● Ordinal data → y ∊ {r1
≺ r2
≺ … ≺ rK
}
User satisfaction — possible values are 😠 ≺ 😒 ≺ 😐 ≺ 😊 ≺ 😀
● Count data → y ∊ ℕ
Items sold per month — possible values are 0, 1, 2, 3, ...
● Continuous data → y ∊ ℝ
GPD % variation — possible values are -10, 12.52, -0.74, ...
● Periodic data → y ∊ … → r1
→ r2
→ … → rK
→ r1
→ r2
→ ...
Day of the week — possible values are … → Mon → Tue → Wed → … → Sun → Mon → ...
Francesco Casalegno – Ordinal Regression
Characteristics of Ordinal Regression
● The target variable y can take only 1 value in a discrete set of K ranks
Caries progression stage is a value in the set {I, II, III, IV}.
10
● There is an intrinsic linear strict order in the set of possible ranks
Caries of stage I is less severe than stage II, which is less severe than stage III etc.
● Distance between ranks is ill-defined, so we cannot map ranks to real values
We cannot compare the difference between stage IV and III to that between II and I.
● We want accurate (distance from truth) predictions
If the true stage is II, a prediction of stage I is better than a prediction of stage IV.
● We want consistent (rank order wrt truth) predictions
A model is not consistent if predicts caries of stage I as II and stage IV as III.
VS
VS
Francesco Casalegno – Ordinal Regression
Learning to Rank
General term, gathers methods aiming at constructing a ranking model from training data.
Applications are everywhere: recommender systems, information retrieval, ...
Learning to Rank
11
● Pointwise Ranking
Given x ∊ X, predict its relevance using ordinal labels.
● Pairwise Ranking
Given x1
, x2
∊ X, predict the relative order (i.e. x1
≺ x2
or x2
≺ x1
).
● Listwise Ranking
Given a list {x1
, …., xn
} ∊ X, predict a permutation that sorts it.
→ Ordinal regression models solve a pointwise ranking task
→ Pointwise ranking models are much more scalable than pairwise and listwise models.
→ Pointwise ranking models can also solve pairwise and listwise ranking tasks (sort by predicted rank!)
Ordinal Regression Methods
12
Francesco Casalegno – Ordinal Regression
Methods Taxonomy
13
Francesco Casalegno – Ordinal Regression
Naïve Methods
Main Idea
Ordinal Regression (aka Ordinal Classification) sits between Regression and Classification tasks:
● As in classification tasks, the target is discrete
● As in regression tasks, the domain of the target is ordered
Naïve approaches treat the problem using models for one of these tasks.
These methods introduce simplifications in the problem, but they can still provide good results as they
inherit the performance of very well-tuned models.
14
Regression-based approach
Convert classes into numerical values corresponding to their rank (e.g. stage III → 3.0).
Drawbacks
● Metric distance between ranks is undefined, so the numerical mapping is completely arbitrary.
● Model’s predictions are real numbers, so mapping back to classes is also arbitrary (clip + round?)
Classification-based approach
Treat the classes as nominal, unrelated classes (i.e. ignore their order).
Drawbacks
● The model treats errors equally: predicting stage III for the true stage IV is treated as predicting stage I.
● Consistency of ranks of predicted classes wrt to true classes is not taken into account.
Francesco Casalegno – Ordinal Regression
Ordinal Binary Decomposition Methods
Main Idea
Convert original ordinal regression problem with label into K - 1 binary classification
problems, where the k-th problem is to predict the target binary label .
15
From ordinal labels to binary vectors
Convert each original label y into the K - 1 binary vector and train models for the K - 1
binary classification tasks.
Example. Assume K = 4 (e.g. caries stages):
From predicted probabilities to predicted rank
Once we trained the K - 1 models, we obtain the vector of probabilities where is the
predicted probability computed by the k-th model .
Then, the rank-wise probabilities are easily recovered as
and we predict the rank .
Notice that the K - 1 tasks are treated independently, so for some k we may have : in any case
we can still compute the arg max and the predicted rank .
Francesco Casalegno – Ordinal Regression
Ordinal Binary Decomposition Methods
Two different approaches are possible for ordinal binary decomposition methods.
16
Multiple Model Approach
This is the simplest approach: a model (decision tree, neural network, SVM, …) is trained separately on the K - 1
binary classification tasks.
Multitask Learning
This approach uses a neural network with K - 1 independent output units with sigmoid activations, i.e.
where hθ
(x) is the output of the second-last layer of the neural network.
This network is trained for the K - 1 tasks at the same time, using as a loss function the sum of the K - 1
individual binary cross-entropy losses between and .
This approach is not only faster than separately training K - 1 models, but it may also give better predictions as
it learns the K - 1 probabilities at the same time allowing for a more global view of the problem.
Francesco Casalegno – Ordinal Regression
Threshold Methods
Main Idea
In many cases the ordinal labels come from a discretization of a continuous latent variable. Threshold
methods try to learn to predict the value of latent variable and to learn the thresholds b1
< … < bK-1
that allow to
discretize it into categorical classes.
17
Proportional Odds Model
The simplest assumption is that . This can be seen as a simple extension of logistic regression,
where instead of estimating
by
we want to estimate
by for .
Notice that as long as b1
< … < bK-1
, threshold methods are guaranteed to be consistent, i.e. .
The name of this approach comes from the fact that the odds satisfy the proportional ratio
Cumulative Link Model
The previous approach can be effectively generalized by modelling the latent variable with a neural network:
where hθ
(x) is the output of a neural network. This model is very similar to and can be trained in the same way
as multitask learning: however, notice that here the weights w are share across tasks, only the bk
varies.
Francesco Casalegno – Ordinal Regression
Threshold Methods
18
From predicted probabilities to predicted rank
Once we trained our model, we obtain the vector of probabilities where the values
are guaranteed to be decreasing, assuming b1
< … < bK-1
.
The predicted rank can be computed in a way that generalizes binary classification.
For binary classification, we first estimate the probability and then we predict the label
for some threshold τ (usually τ = 0.5).
For threshold methods we first estimate the probabilities and then we predict the rank
where and
for some threshold τ (usually τ = 0.5).
Ordinal Regression Metrics
19
Francesco Casalegno – Ordinal Regression
Identity of indiscernibles
A metric should have same value for two confusion matrices M1
, M2
iff M1
= M2
.
Ordinal Regression Metrics: Ideal Properties
20
Nominal Errors
A metrics should capture account the nominal classification error.
Quantification of Divergence from Truth
A metric should capture how much the predictions diverge from truth.
Ranking Consistency
A metric should capture how inconsistent the classifier is wrt ranks relative
order.
Francesco Casalegno – Ordinal Regression
Metrics for Ordinal Regression
Classification Metrics
● Accuracy =
21
Rank Correlation Metrics
● Kendall τb
● Spearman ρS
Regression Metrics
● Mean Absolute Error =
● Mean Square Error =
Francesco Casalegno – Ordinal Regression
Metrics for Ordinal Regression
Ordinal Classification Index
● OCβ
γ
22
- path = sequence of neighboring entries Mmn
, from M11
to MKK
- consistent path = each step Mmn
→ Mm’n’
in the path has m ≤ m’ and n ≤ n’
- OC index =
Note: If β ≥ 1, the optimal path is the diagonal, and OCβ
γ
= 1 - Acc!
normalized sum of entries Mmn
in path deviation from main diagonal
path consistent path
Francesco Casalegno – Ordinal Regression
Metrics for Ordinal Regression — Example
23
A = B = C = D =
A
B
C
D
● A is the perfect classifier
● B and C are more consistent than D
w.r.t. relative order pred VS true rank
● B and D are more accurate than C
w.r.t. distance pred VS true rank
● Acc only measures correct/wrong classes → cannot distinguish B, C, D
● MAE and MSE only measure at distance from correct rank → cannot distinguish B, D
● ρs
and τb
only measure relative ranking → cannot distinguish A, B
● OCβ
γ
consistently ranks A > B > {C, D} + satisfies the identity of indiscernibles for any β
○ small β → more weight to ranking consistency → C > D
○ large β → more weight to absolute classification → D > C
Conclusions
24
Francesco Casalegno – Ordinal Regression
Conclusions
1. Understanding the nature of targets y for any predictive model is essential to build a good estimator:
regression, classification, ordinal regression, Poisson regression, circular regression, ...
25
2. Ordinal regression is similar to classification in that labels {r1
, …, rK
} are categorical, and similar to
regression in that labels are ordered r1
≺ …. rK
(although distance between ranks is undefined!)
3. Three main approaches are possible when solving ordinal regression problems
a. naïve methods — convert the problem into standard regression or classification
b. ordinal binary decomposition — convert the problem into K - 1 binary classification tasks
c. threshold methods — try to fit a continuous latent variable and K - 1 ordered thresholds
4. An ideal ordinal regression model should give both accurate (MSE, MAE, …) and consistent (Spearman's
ρ, Kendall’s τ, ...).
Predictions should be evaluated using properly chosen metrics, e.g. Ordinal Classification Index.
Francesco Casalegno – Ordinal Regression
References
● Cao, Wenzhi et al. “Rank-Consistent Ordinal Regression for Neural Networks”, arXiv (2019).
● Cardoso, Jaime S. et al. "Measuring the performance of ordinal classification." International Journal of
Pattern Recognition and Artificial Intelligence 25.08 (2011): 1173-1195.
● Cheng, Jianlin et al. "A neural network approach to ordinal regression." 2008 IEEE International Joint
Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, 2008.
● Frank, Eibe et al. "A simple approach to ordinal classification." European Conference on Machine
Learning. Springer, Berlin, Heidelberg, 2001.
● Gutierrez, Pedro Antonio, et al. "Ordinal regression methods: survey and experimental study." IEEE
Transactions on Knowledge and Data Engineering 28.1 (2015): 127-146.
26

More Related Content

PPTX
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
PDF
3- Ch03- Methods Of Analysis-Sadiku
PDF
Histogram Operation in Image Processing
PPTX
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
PPTX
Fuzzy Clustering(C-means, K-means)
PDF
Feature Extraction
PDF
Transfer Learning
PPTX
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
3- Ch03- Methods Of Analysis-Sadiku
Histogram Operation in Image Processing
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Fuzzy Clustering(C-means, K-means)
Feature Extraction
Transfer Learning
Explaining Black-Box Machine Learning Predictions - Sameer Singh, Assistant P...

What's hot (20)

PDF
K means Clustering
PPTX
Decision trees for machine learning
PPTX
Computer Vision Crash Course
PPTX
Inductive bias
PPTX
Linear Regression and Logistic Regression in ML
PPT
Image Enhancement and Histogram Equalization in Digital Image Processing.ppt
PPT
machine-learning-with-python (1).ppt
PPTX
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
PPTX
Graph Representation Learning
PDF
Linear models for classification
PPTX
Computer Vision(4).pptx
PDF
Performance Metrics for Machine Learning Algorithms
PPTX
Recurrent Neural Networks (RNNs)
PDF
Explainability and bias in AI
PPTX
Interpretable Machine Learning
ODP
Machine Learning With Logistic Regression
PDF
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
PPTX
ML_Unit_1_Part_B
PPTX
ppt 20BET1024.pptx
K means Clustering
Decision trees for machine learning
Computer Vision Crash Course
Inductive bias
Linear Regression and Logistic Regression in ML
Image Enhancement and Histogram Equalization in Digital Image Processing.ppt
machine-learning-with-python (1).ppt
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Graph Representation Learning
Linear models for classification
Computer Vision(4).pptx
Performance Metrics for Machine Learning Algorithms
Recurrent Neural Networks (RNNs)
Explainability and bias in AI
Interpretable Machine Learning
Machine Learning With Logistic Regression
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
ML_Unit_1_Part_B
ppt 20BET1024.pptx
Ad

Similar to Ordinal Regression and Machine Learning: Applications, Methods, Metrics (20)

PPTX
DataAnalysis in machine learning using different techniques
PDF
L2. Evaluating Machine Learning Algorithms I
PPT
Machine-Learning-Algorithms- A Overview.ppt
PPT
Machine-Learning-Algorithms- A Overview.ppt
PPTX
PDF
Introduction to conventional machine learning techniques
PDF
Genetic Algorithms
PPT
Supervised and unsupervised learning
PPTX
Supervised learning
PDF
Introduction to machine learning
DOCX
Essentials of machine learning algorithms
PPT
natural language processing by Christopher
PPT
Machine Learning
PPTX
Probability distribution Function & Decision Trees in machine learning
PPTX
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
PPTX
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
PDF
Explore ml day 2
PDF
Creating new classes of objects with deep generative neural nets
PPTX
MACHINE LEARNING Unit -2 Algorithm.pptx
DataAnalysis in machine learning using different techniques
L2. Evaluating Machine Learning Algorithms I
Machine-Learning-Algorithms- A Overview.ppt
Machine-Learning-Algorithms- A Overview.ppt
Introduction to conventional machine learning techniques
Genetic Algorithms
Supervised and unsupervised learning
Supervised learning
Introduction to machine learning
Essentials of machine learning algorithms
natural language processing by Christopher
Machine Learning
Probability distribution Function & Decision Trees in machine learning
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
Explore ml day 2
Creating new classes of objects with deep generative neural nets
MACHINE LEARNING Unit -2 Algorithm.pptx
Ad

More from Francesco Casalegno (8)

PDF
DVC - Git-like Data Version Control for Machine Learning projects
PDF
Recommender Systems
PDF
Markov Chain Monte Carlo Methods
PDF
Hyperparameter Optimization for Machine Learning
PDF
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
PDF
Smart Pointers in C++
PDF
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
PDF
C++11: Rvalue References, Move Semantics, Perfect Forwarding
DVC - Git-like Data Version Control for Machine Learning projects
Recommender Systems
Markov Chain Monte Carlo Methods
Hyperparameter Optimization for Machine Learning
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Smart Pointers in C++
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
C++11: Rvalue References, Move Semantics, Perfect Forwarding

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Foundation of Data Science unit number two notes
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Computer network topology notes for revision
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Quality review (1)_presentation of this 21
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Business Analytics and business intelligence.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Ppt On Nestle.pptx huunnnhhgfvu
ISS -ESG Data flows What is ESG and HowHow
Foundation of Data Science unit number two notes
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IB Computer Science - Internal Assessment.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Quality review (1)_presentation of this 21
oil_refinery_comprehensive_20250804084928 (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Acumen Training GuidePresentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Supervised vs unsupervised machine learning algorithms
Business Analytics and business intelligence.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Ordinal Regression and Machine Learning: Applications, Methods, Metrics

  • 1. Francesco Casalegno Ordinal Regression → Learning to Rank → Ordinal Binary Decompositions → Threshold Models → Metrics for Ordinal Regression
  • 3. Francesco Casalegno – Ordinal Regression Examples Likert Scale Given Metadata (age, sex, nationality, …) and user’s past behavior (tweets, Facebook posts, ...). Predict User’s agreement with a given statement. 3
  • 4. Francesco Casalegno – Ordinal Regression Examples Disease Staging Given X-ray imaging of a patient’s tooth. Predict Caries stage from I to IV. 4 Stage I early enamel lesion Stage II advanced enamel lesion Stage III dentin lesion Stage IV pulp lesion *image from https://royaloakdentalclinic.com
  • 5. Francesco Casalegno – Ordinal Regression Examples Recommender Systems Given Metadata (age, sex, nationality, …) and user’s feedbacks (ratings given to other movies, watched movies, ...). Predict User’s rating (in stars, from 1 to 5) to a movie the user has not seen or rated yet. 5
  • 6. Francesco Casalegno – Ordinal Regression Examples Sovereign Credit Ranking Given Time series measuring the economy of a country (GDP % change, public debt, …). Predict Sovereign credit rating, e.g. using Moody’s, Standard & Poor’s, or FitchRating scale. 6 ? ?
  • 8. Francesco Casalegno – Ordinal Regression Levels of Measurements 8 Stevens’ scale for levels of measurements Understanding the nature of target variables y is essential to choose which regression model to use Binary / Multinomial Classification Ordinal Regression Standard Regression Standard Regression 2. Ordinal Items can be compared and sorted Likert scale — “strongly disagree” ≺ “disagree” ≺ “agree” ≺ “strongly agree” 3. Interval Items are values on an affine scale, and they can be subtracted Dates — “23-Oct-2019” - “16-Apr-2018” = “25-Dec-2004” - “19-Jun-2003” (= 555 days) 4. Ratio Items represents amounts, and their ratio can be computed Length — “300 m” / “150 m” = “1000 m” / “500 m” (= 2) 1. Nominal Items can be differentiated based on their name Mushroom species — “portobello” ≠ “porcini” ≠ “shiitake” ≠ “morels”
  • 9. Francesco Casalegno – Ordinal Regression Levels of Measurements Beyond Stevens’ scale Stevens's typology is widely adopted, but more refined classifications are possible. ● categorical VS continuous ● unordered VS cyclical VS linear ● bounded VS unbounded ● … 9 Classification Circular Regression Standard Regression Ordinal Regression Poisson Regression Regression Types ● Nominal data → y ∊ {c1 , …, cK } Facebook reactions — possible values are 👍, ❤, 😂, 😯, 😢, 😡 ● Ordinal data → y ∊ {r1 ≺ r2 ≺ … ≺ rK } User satisfaction — possible values are 😠 ≺ 😒 ≺ 😐 ≺ 😊 ≺ 😀 ● Count data → y ∊ ℕ Items sold per month — possible values are 0, 1, 2, 3, ... ● Continuous data → y ∊ ℝ GPD % variation — possible values are -10, 12.52, -0.74, ... ● Periodic data → y ∊ … → r1 → r2 → … → rK → r1 → r2 → ... Day of the week — possible values are … → Mon → Tue → Wed → … → Sun → Mon → ...
  • 10. Francesco Casalegno – Ordinal Regression Characteristics of Ordinal Regression ● The target variable y can take only 1 value in a discrete set of K ranks Caries progression stage is a value in the set {I, II, III, IV}. 10 ● There is an intrinsic linear strict order in the set of possible ranks Caries of stage I is less severe than stage II, which is less severe than stage III etc. ● Distance between ranks is ill-defined, so we cannot map ranks to real values We cannot compare the difference between stage IV and III to that between II and I. ● We want accurate (distance from truth) predictions If the true stage is II, a prediction of stage I is better than a prediction of stage IV. ● We want consistent (rank order wrt truth) predictions A model is not consistent if predicts caries of stage I as II and stage IV as III. VS VS
  • 11. Francesco Casalegno – Ordinal Regression Learning to Rank General term, gathers methods aiming at constructing a ranking model from training data. Applications are everywhere: recommender systems, information retrieval, ... Learning to Rank 11 ● Pointwise Ranking Given x ∊ X, predict its relevance using ordinal labels. ● Pairwise Ranking Given x1 , x2 ∊ X, predict the relative order (i.e. x1 ≺ x2 or x2 ≺ x1 ). ● Listwise Ranking Given a list {x1 , …., xn } ∊ X, predict a permutation that sorts it. → Ordinal regression models solve a pointwise ranking task → Pointwise ranking models are much more scalable than pairwise and listwise models. → Pointwise ranking models can also solve pairwise and listwise ranking tasks (sort by predicted rank!)
  • 13. Francesco Casalegno – Ordinal Regression Methods Taxonomy 13
  • 14. Francesco Casalegno – Ordinal Regression Naïve Methods Main Idea Ordinal Regression (aka Ordinal Classification) sits between Regression and Classification tasks: ● As in classification tasks, the target is discrete ● As in regression tasks, the domain of the target is ordered Naïve approaches treat the problem using models for one of these tasks. These methods introduce simplifications in the problem, but they can still provide good results as they inherit the performance of very well-tuned models. 14 Regression-based approach Convert classes into numerical values corresponding to their rank (e.g. stage III → 3.0). Drawbacks ● Metric distance between ranks is undefined, so the numerical mapping is completely arbitrary. ● Model’s predictions are real numbers, so mapping back to classes is also arbitrary (clip + round?) Classification-based approach Treat the classes as nominal, unrelated classes (i.e. ignore their order). Drawbacks ● The model treats errors equally: predicting stage III for the true stage IV is treated as predicting stage I. ● Consistency of ranks of predicted classes wrt to true classes is not taken into account.
  • 15. Francesco Casalegno – Ordinal Regression Ordinal Binary Decomposition Methods Main Idea Convert original ordinal regression problem with label into K - 1 binary classification problems, where the k-th problem is to predict the target binary label . 15 From ordinal labels to binary vectors Convert each original label y into the K - 1 binary vector and train models for the K - 1 binary classification tasks. Example. Assume K = 4 (e.g. caries stages): From predicted probabilities to predicted rank Once we trained the K - 1 models, we obtain the vector of probabilities where is the predicted probability computed by the k-th model . Then, the rank-wise probabilities are easily recovered as and we predict the rank . Notice that the K - 1 tasks are treated independently, so for some k we may have : in any case we can still compute the arg max and the predicted rank .
  • 16. Francesco Casalegno – Ordinal Regression Ordinal Binary Decomposition Methods Two different approaches are possible for ordinal binary decomposition methods. 16 Multiple Model Approach This is the simplest approach: a model (decision tree, neural network, SVM, …) is trained separately on the K - 1 binary classification tasks. Multitask Learning This approach uses a neural network with K - 1 independent output units with sigmoid activations, i.e. where hθ (x) is the output of the second-last layer of the neural network. This network is trained for the K - 1 tasks at the same time, using as a loss function the sum of the K - 1 individual binary cross-entropy losses between and . This approach is not only faster than separately training K - 1 models, but it may also give better predictions as it learns the K - 1 probabilities at the same time allowing for a more global view of the problem.
  • 17. Francesco Casalegno – Ordinal Regression Threshold Methods Main Idea In many cases the ordinal labels come from a discretization of a continuous latent variable. Threshold methods try to learn to predict the value of latent variable and to learn the thresholds b1 < … < bK-1 that allow to discretize it into categorical classes. 17 Proportional Odds Model The simplest assumption is that . This can be seen as a simple extension of logistic regression, where instead of estimating by we want to estimate by for . Notice that as long as b1 < … < bK-1 , threshold methods are guaranteed to be consistent, i.e. . The name of this approach comes from the fact that the odds satisfy the proportional ratio Cumulative Link Model The previous approach can be effectively generalized by modelling the latent variable with a neural network: where hθ (x) is the output of a neural network. This model is very similar to and can be trained in the same way as multitask learning: however, notice that here the weights w are share across tasks, only the bk varies.
  • 18. Francesco Casalegno – Ordinal Regression Threshold Methods 18 From predicted probabilities to predicted rank Once we trained our model, we obtain the vector of probabilities where the values are guaranteed to be decreasing, assuming b1 < … < bK-1 . The predicted rank can be computed in a way that generalizes binary classification. For binary classification, we first estimate the probability and then we predict the label for some threshold τ (usually τ = 0.5). For threshold methods we first estimate the probabilities and then we predict the rank where and for some threshold τ (usually τ = 0.5).
  • 20. Francesco Casalegno – Ordinal Regression Identity of indiscernibles A metric should have same value for two confusion matrices M1 , M2 iff M1 = M2 . Ordinal Regression Metrics: Ideal Properties 20 Nominal Errors A metrics should capture account the nominal classification error. Quantification of Divergence from Truth A metric should capture how much the predictions diverge from truth. Ranking Consistency A metric should capture how inconsistent the classifier is wrt ranks relative order.
  • 21. Francesco Casalegno – Ordinal Regression Metrics for Ordinal Regression Classification Metrics ● Accuracy = 21 Rank Correlation Metrics ● Kendall τb ● Spearman ρS Regression Metrics ● Mean Absolute Error = ● Mean Square Error =
  • 22. Francesco Casalegno – Ordinal Regression Metrics for Ordinal Regression Ordinal Classification Index ● OCβ γ 22 - path = sequence of neighboring entries Mmn , from M11 to MKK - consistent path = each step Mmn → Mm’n’ in the path has m ≤ m’ and n ≤ n’ - OC index = Note: If β ≥ 1, the optimal path is the diagonal, and OCβ γ = 1 - Acc! normalized sum of entries Mmn in path deviation from main diagonal path consistent path
  • 23. Francesco Casalegno – Ordinal Regression Metrics for Ordinal Regression — Example 23 A = B = C = D = A B C D ● A is the perfect classifier ● B and C are more consistent than D w.r.t. relative order pred VS true rank ● B and D are more accurate than C w.r.t. distance pred VS true rank ● Acc only measures correct/wrong classes → cannot distinguish B, C, D ● MAE and MSE only measure at distance from correct rank → cannot distinguish B, D ● ρs and τb only measure relative ranking → cannot distinguish A, B ● OCβ γ consistently ranks A > B > {C, D} + satisfies the identity of indiscernibles for any β ○ small β → more weight to ranking consistency → C > D ○ large β → more weight to absolute classification → D > C
  • 25. Francesco Casalegno – Ordinal Regression Conclusions 1. Understanding the nature of targets y for any predictive model is essential to build a good estimator: regression, classification, ordinal regression, Poisson regression, circular regression, ... 25 2. Ordinal regression is similar to classification in that labels {r1 , …, rK } are categorical, and similar to regression in that labels are ordered r1 ≺ …. rK (although distance between ranks is undefined!) 3. Three main approaches are possible when solving ordinal regression problems a. naïve methods — convert the problem into standard regression or classification b. ordinal binary decomposition — convert the problem into K - 1 binary classification tasks c. threshold methods — try to fit a continuous latent variable and K - 1 ordered thresholds 4. An ideal ordinal regression model should give both accurate (MSE, MAE, …) and consistent (Spearman's ρ, Kendall’s τ, ...). Predictions should be evaluated using properly chosen metrics, e.g. Ordinal Classification Index.
  • 26. Francesco Casalegno – Ordinal Regression References ● Cao, Wenzhi et al. “Rank-Consistent Ordinal Regression for Neural Networks”, arXiv (2019). ● Cardoso, Jaime S. et al. "Measuring the performance of ordinal classification." International Journal of Pattern Recognition and Artificial Intelligence 25.08 (2011): 1173-1195. ● Cheng, Jianlin et al. "A neural network approach to ordinal regression." 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, 2008. ● Frank, Eibe et al. "A simple approach to ordinal classification." European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2001. ● Gutierrez, Pedro Antonio, et al. "Ordinal regression methods: survey and experimental study." IEEE Transactions on Knowledge and Data Engineering 28.1 (2015): 127-146. 26