Ordinal Regression and Machine Learning: Applications, Methods, Metrics

Francesco Casalegno
Ordinal Regression
→ Learning to Rank
→ Ordinal Binary Decompositions
→ Threshold Models
→ Metrics for Ordinal Regression

Francesco Casalegno – Ordinal Regression
Examples
Likert Scale
Given
Metadata (age, sex, nationality, …) and user’s past behavior (tweets, Facebook posts, ...).
Predict
User’s agreement with a given statement.
3

Examples
Disease Staging
Given
X-ray imaging of a patient’s tooth.
Predict
Caries stage from I to IV.
4
Stage I
early enamel lesion
Stage II
advanced enamel lesion
Stage III
dentin lesion
Stage IV
pulp lesion
*image from https://royaloakdentalclinic.com

Examples
Recommender Systems
Given
Metadata (age, sex, nationality, …) and user’s feedbacks (ratings given to other movies, watched movies, ...).
Predict
User’s rating (in stars, from 1 to 5) to a movie the user has not seen or rated yet.
5

Examples
Sovereign Credit Ranking
Given
Time series measuring the economy of a country (GDP % change, public debt, …).
Predict
Sovereign credit rating, e.g. using Moody’s, Standard & Poor’s, or FitchRating scale.
6
?
?

Levels of Measurements
8
Stevens’ scale for levels of measurements
Understanding the nature of target variables y is essential to choose which regression model to use
Binary / Multinomial
Classiﬁcation
Ordinal Regression
Standard Regression
Standard Regression
2. Ordinal
Items can be compared and sorted
Likert scale — “strongly disagree” ≺ “disagree” ≺ “agree” ≺ “strongly agree”
3. Interval
Items are values on an afﬁne scale, and they can be subtracted
Dates — “23-Oct-2019” - “16-Apr-2018” = “25-Dec-2004” - “19-Jun-2003” (= 555 days)
4. Ratio
Items represents amounts, and their ratio can be computed
Length — “300 m” / “150 m” = “1000 m” / “500 m” (= 2)
1. Nominal
Items can be differentiated based on their name
Mushroom species — “portobello” ≠ “porcini” ≠ “shiitake” ≠ “morels”

Levels of Measurements
Beyond Stevens’ scale
Stevens's typology is widely adopted, but more refined classifications are possible.
● categorical VS continuous
● unordered VS cyclical VS linear
● bounded VS unbounded
● …
9
Classification
Circular Regression
Standard Regression
Ordinal Regression
Poisson Regression
Regression Types
● Nominal data → y ∊ {c1
, …, cK
}
Facebook reactions — possible values are 👍, ❤, 😂, 😯, 😢, 😡
● Ordinal data → y ∊ {r1
≺ r2
≺ … ≺ rK
}
User satisfaction — possible values are 😠 ≺ 😒 ≺ 😐 ≺ 😊 ≺ 😀
● Count data → y ∊ ℕ
Items sold per month — possible values are 0, 1, 2, 3, ...
● Continuous data → y ∊ ℝ
GPD % variation — possible values are -10, 12.52, -0.74, ...
● Periodic data → y ∊ … → r1
→ r2
→ … → rK
→ r1
→ r2
→ ...
Day of the week — possible values are … → Mon → Tue → Wed → … → Sun → Mon → ...

Characteristics of Ordinal Regression
● The target variable y can take only 1 value in a discrete set of K ranks
Caries progression stage is a value in the set {I, II, III, IV}.
10
● There is an intrinsic linear strict order in the set of possible ranks
Caries of stage I is less severe than stage II, which is less severe than stage III etc.
● Distance between ranks is ill-deﬁned, so we cannot map ranks to real values
We cannot compare the difference between stage IV and III to that between II and I.
● We want accurate (distance from truth) predictions
If the true stage is II, a prediction of stage I is better than a prediction of stage IV.
● We want consistent (rank order wrt truth) predictions
A model is not consistent if predicts caries of stage I as II and stage IV as III.
VS
VS

Learning to Rank
General term, gathers methods aiming at constructing a ranking model from training data.
Applications are everywhere: recommender systems, information retrieval, ...
Learning to Rank
11
● Pointwise Ranking
Given x ∊ X, predict its relevance using ordinal labels.
● Pairwise Ranking
Given x1
, x2
∊ X, predict the relative order (i.e. x1
≺ x2
or x2
≺ x1
).
● Listwise Ranking
Given a list {x1
, …., xn
} ∊ X, predict a permutation that sorts it.
→ Ordinal regression models solve a pointwise ranking task
→ Pointwise ranking models are much more scalable than pairwise and listwise models.
→ Pointwise ranking models can also solve pairwise and listwise ranking tasks (sort by predicted rank!)

Methods Taxonomy
13

Naïve Methods
Main Idea
Ordinal Regression (aka Ordinal Classification) sits between Regression and Classification tasks:
● As in classification tasks, the target is discrete
● As in regression tasks, the domain of the target is ordered
Naïve approaches treat the problem using models for one of these tasks.
These methods introduce simplifications in the problem, but they can still provide good results as they
inherit the performance of very well-tuned models.
14
Regression-based approach
Convert classes into numerical values corresponding to their rank (e.g. stage III → 3.0).
Drawbacks
● Metric distance between ranks is undefined, so the numerical mapping is completely arbitrary.
● Model’s predictions are real numbers, so mapping back to classes is also arbitrary (clip + round?)
Classification-based approach
Treat the classes as nominal, unrelated classes (i.e. ignore their order).
Drawbacks
● The model treats errors equally: predicting stage III for the true stage IV is treated as predicting stage I.
● Consistency of ranks of predicted classes wrt to true classes is not taken into account.

Ordinal Binary Decomposition Methods
Main Idea
Convert original ordinal regression problem with label into K - 1 binary classiﬁcation
problems, where the k-th problem is to predict the target binary label .
15
From ordinal labels to binary vectors
Convert each original label y into the K - 1 binary vector and train models for the K - 1
binary classiﬁcation tasks.
Example. Assume K = 4 (e.g. caries stages):
From predicted probabilities to predicted rank
Once we trained the K - 1 models, we obtain the vector of probabilities where is the
predicted probability computed by the k-th model .
Then, the rank-wise probabilities are easily recovered as
and we predict the rank .
Notice that the K - 1 tasks are treated independently, so for some k we may have : in any case
we can still compute the arg max and the predicted rank .

Ordinal Binary Decomposition Methods
Two different approaches are possible for ordinal binary decomposition methods.
16
Multiple Model Approach
This is the simplest approach: a model (decision tree, neural network, SVM, …) is trained separately on the K - 1
binary classiﬁcation tasks.
Multitask Learning
This approach uses a neural network with K - 1 independent output units with sigmoid activations, i.e.
where hθ
(x) is the output of the second-last layer of the neural network.
This network is trained for the K - 1 tasks at the same time, using as a loss function the sum of the K - 1
individual binary cross-entropy losses between and .
This approach is not only faster than separately training K - 1 models, but it may also give better predictions as
it learns the K - 1 probabilities at the same time allowing for a more global view of the problem.

Threshold Methods
Main Idea
In many cases the ordinal labels come from a discretization of a continuous latent variable. Threshold
methods try to learn to predict the value of latent variable and to learn the thresholds b1
< … < bK-1
that allow to
discretize it into categorical classes.
17
Proportional Odds Model
The simplest assumption is that . This can be seen as a simple extension of logistic regression,
where instead of estimating
by
we want to estimate
by for .
Notice that as long as b1
< … < bK-1
, threshold methods are guaranteed to be consistent, i.e. .
The name of this approach comes from the fact that the odds satisfy the proportional ratio
Cumulative Link Model
The previous approach can be effectively generalized by modelling the latent variable with a neural network:
where hθ
(x) is the output of a neural network. This model is very similar to and can be trained in the same way
as multitask learning: however, notice that here the weights w are share across tasks, only the bk
varies.

Threshold Methods
18
From predicted probabilities to predicted rank
Once we trained our model, we obtain the vector of probabilities where the values
are guaranteed to be decreasing, assuming b1
< … < bK-1
.
The predicted rank can be computed in a way that generalizes binary classification.
For binary classification, we first estimate the probability and then we predict the label
for some threshold τ (usually τ = 0.5).
For threshold methods we first estimate the probabilities and then we predict the rank
where and
for some threshold τ (usually τ = 0.5).

Identity of indiscernibles
A metric should have same value for two confusion matrices M1
, M2
iff M1
= M2
.
Ordinal Regression Metrics: Ideal Properties
20
Nominal Errors
A metrics should capture account the nominal classification error.
Quantification of Divergence from Truth
A metric should capture how much the predictions diverge from truth.
Ranking Consistency
A metric should capture how inconsistent the classifier is wrt ranks relative
order.

Metrics for Ordinal Regression
Classiﬁcation Metrics
● Accuracy =
21
Rank Correlation Metrics
● Kendall τb
● Spearman ρS
Regression Metrics
● Mean Absolute Error =
● Mean Square Error =

Metrics for Ordinal Regression
Ordinal Classiﬁcation Index
● OCβ
γ
22
- path = sequence of neighboring entries Mmn
, from M11
to MKK
- consistent path = each step Mmn
→ Mm’n’
in the path has m ≤ m’ and n ≤ n’
- OC index =
Note: If β ≥ 1, the optimal path is the diagonal, and OCβ
γ
= 1 - Acc!
normalized sum of entries Mmn
in path deviation from main diagonal
path consistent path

Metrics for Ordinal Regression — Example
23
A = B = C = D =
A
B
C
D
● A is the perfect classifier
● B and C are more consistent than D
w.r.t. relative order pred VS true rank
● B and D are more accurate than C
w.r.t. distance pred VS true rank
● Acc only measures correct/wrong classes → cannot distinguish B, C, D
● MAE and MSE only measure at distance from correct rank → cannot distinguish B, D
● ρs
and τb
only measure relative ranking → cannot distinguish A, B
● OCβ
γ
consistently ranks A > B > {C, D} + satisfies the identity of indiscernibles for any β
○ small β → more weight to ranking consistency → C > D
○ large β → more weight to absolute classification → D > C

Conclusions
1. Understanding the nature of targets y for any predictive model is essential to build a good estimator:
regression, classification, ordinal regression, Poisson regression, circular regression, ...
25
2. Ordinal regression is similar to classification in that labels {r1
, …, rK
} are categorical, and similar to
regression in that labels are ordered r1
≺ …. rK
(although distance between ranks is undefined!)
3. Three main approaches are possible when solving ordinal regression problems
a. naïve methods — convert the problem into standard regression or classification
b. ordinal binary decomposition — convert the problem into K - 1 binary classification tasks
c. threshold methods — try to fit a continuous latent variable and K - 1 ordered thresholds
4. An ideal ordinal regression model should give both accurate (MSE, MAE, …) and consistent (Spearman's
ρ, Kendall’s τ, ...).
Predictions should be evaluated using properly chosen metrics, e.g. Ordinal Classification Index.

References
● Cao, Wenzhi et al. “Rank-Consistent Ordinal Regression for Neural Networks”, arXiv (2019).
● Cardoso, Jaime S. et al. "Measuring the performance of ordinal classification." International Journal of
Pattern Recognition and Artificial Intelligence 25.08 (2011): 1173-1195.
● Cheng, Jianlin et al. "A neural network approach to ordinal regression." 2008 IEEE International Joint
Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, 2008.
● Frank, Eibe et al. "A simple approach to ordinal classification." European Conference on Machine
Learning. Springer, Berlin, Heidelberg, 2001.
● Gutierrez, Pedro Antonio, et al. "Ordinal regression methods: survey and experimental study." IEEE
Transactions on Knowledge and Data Engineering 28.1 (2015): 127-146.
26

Ordinal Regression and Machine Learning: Applications, Methods, Metrics

More Related Content

What's hot (20)

Similar to Ordinal Regression and Machine Learning: Applications, Methods, Metrics (20)

More from Francesco Casalegno (8)

Recently uploaded (20)

Ordinal Regression and Machine Learning: Applications, Methods, Metrics