Litv_Denmark_Weak_Supervised_Learning.pdf

Weakly Supervised Regression on Uncertain
Datasets
Alexander Litvinenko1
,
joint work with
Vladimir Berikov2
, Roman Kozinets2
,
Kirill Kalmutskiy and Layla Cherikbaeva4
1
RWTH Aachen, Germany, 2
Sobolev Institute of Mathematics, 3
Novosibirsk State University, Russia 4
Al-Farabi Kazakh National
University, Kazakhstan

Plan
After short reviewing of ML tasks, we
▶ introduce weakly-supervised regression problem
▶ propose a method which combines graph Laplacian
regularization and cluster ensemble (collective decision
making) methodologies
▶ solve an auxiliary minimisation problem
▶ apply our method for solving two clustering problems.
2

Machine learning (ML) problems
ML problems can be classified as
▶ fully supervised,
▶ unsupervised,
▶ semi-supervised,
▶ weakly-supervised.

Motivation: Recognition of acute stroke from CT scans
Manual annotation of a large number of computed tomography
(CT) digital images
Figure: Example of CT image: (a) initial image, (b) stroke area
annotated by a radiologist, (c) inaccurate mask.
V. Berikov, A. Litvinenko, I. Pestunov, and Y.Sinyavskiy, On a Weakly
Supervised Classification Problem, AIST 2021 Proceedings, accepted to
Springer Nature
4

Machine learning (ML) problems
Consider a dataset X = {x1, . . . , xn}, where xi ∈ Rd is a feature
vector, d the dimensionality of feature space, and n the sample
size.
Fully supervised learning assumes we are
given a set Y = {y1, . . . , yn}, yi ∈ DY ,
of target feature labels for each data point.
GOAL: To find a decision function y = f (x) (classifier, regression
model) for predicting target feature values for any new data point
x ∈ Rd from the same statistical population.
The function should be optimal in some sense, e.g., give minimal
value to the expected losses.
5

Unsupervised learning setting
In an unsupervised learning problem, the target feature is not
specified.
It is necessary to find a meaningful representation of data, i.e., find
a partition P = {C1, . . . , CK } of X on a relatively small number K
of homogeneous clusters describing the structure of data.
The desired number of clusters is either a predefined parameter or
should be found in the best way.
The obtained cluster partition can be uncertain due to:
1. a lack of knowledge about data structure,
2. uncertaity in setting optional parameters of the learning
algorithm,
3. dependence on random initializations.
6

Construction of the consensus partition
7

Semi-supervised learning problems
The target feature labels are known only for a part of the data set
X1 ⊂ X.
We assume that X1 = {x1, . . . , xn1 }, and the unlabeled part is
X0 = {xn1+1, . . . , xn}.
The set of labels for points from X1 is denoted by
Y1 = {y1, . . . , yn1 }.
GOAL: Predict target features Y0 = (yn1+1, . . . , yn) either for
given unlabeled data X0 (i.e., perform transductive learning) or for
arbitrary new observations from the same statistical population
(inductive learning).
Note: To improve prediction accuracy, the information contained in
both labeled and unlabeled datasets is used.
8

Weakly-supervised learning problems
Def.: For some data points the labels are known, for some
unknown, and for others uncertain
(due to lack of resources, random distortions in labels
identification, etc).
To model uncertainty in the label identification, we suppose that
for each i-th point, i = 1, . . . , n1, the value yi of the target feature
is a realization of a random variable Yi with cumulative
distribution function (cdf) Fi (y) defined on DY .
We suppose that Fi (y) belongs to a given distribution family.
9

Weakly-supervised learning problems
Further assume:
Yi ∼ N(ai , σi ), (1)
where ai , σi are the mean and the st. dev.
Assume ai = yi and σi = si are known for each (weakly) labeled
observation, i = 1, . . . , n1.
For strictly determined observation yi , we postulate a normal
uncertainty model with ai = yi and small standard deviation
σi ≈ 0.
We aim at finding a weak labeling of points from X0, i.e.,
determining cfd Fi (y) for i = n1 + 1, . . . , n following an objective
criterion.
10

Assumptions:
Semi- and weakly- supervised learning assume:
1. cluster assumption (points from the same cluster often have
the same labels or labels close to each other)
2. manifold assumption (points with similar labels belong to a
smooth manifold).
11

Optimization problem
Let F = {F1, . . . , Fn} denote the set of cdfs for n data points; each
cdf Fi is represented by a pair of parameters (ai , σi ).
We solve
find F∗ = arg min
F
J(F), where
J(F) =
X
xi ∈X1
D(Yi , Fi ) + γ
X
xi ,xj ∈X
D(Fi , Fj )Wij . (2)
Here D is a statistical distance between two distributions (such as
the Wasserstein distance, KLD,...)1.
1st term: reduces the dissimilarity on labeled data;
2nd: (smoothing) if two points xi , xj are similar, their labeling
distribution should not be very different.
1
Litvinenko, A, Marzouk, Y, Matthies, HG, Scavino, M, Spantini, A.
Computing f-divergences and distances of high-dimensional probability density
functions. Numer Linear Algebra Appl. 2022;e2467.
https://doi.org/10.1002/nla.2467 12

Objective functional
Use the Wasserstein distance wp between distributions P and Q
over a set DY as a measure of their dissimilarity:
wp(P, Q) :=

inf
γ∈Γ(P,Q)
Z
DY ×DY
ρ(y1, y2)p
dγ(y1, y2)
1/p
,
where Γ(P, Q) is a set of all probability distributions on DY × DY
with marginal distributions P and Q, ρ a distance metric, and
p ≥ 1.
For normal distributions Pi = N(ai , σi ), Qj = N(aj , σj ) and the
Euclidean metric, the w2 distance is equal to
w2(Pi , Qj ) = (ai − aj )2
+ (σi − σj )2
.
13

Objective functional
Add an L2 regularizer:
J(a, σ) =
n1
X
i=1
(yi − ai )2
+ (si − σi )2

+
+ γ
n
X
i,j=1
(ai − aj )2
+ (σi − σj )2

Wij + β(∥a∥2
+ ∥σ∥2
), (3)
where β 0 is a regularization parameter, a = (a1, . . . , an)⊤,
σ = (σ1, . . . , σn)⊤, Wij describes the degree of similarity between
two points.
14

Optimal solution
To find the optimal solution, we differentiate (3) and get:
∂J
∂ai
= 2(ai − yi ) + 4γ
n
X
j=1
(ai − aj )Wij + 2βai = 0, i = 1, . . . , n1,
(4)
∂J
∂ai
= 4γ
n
X
j=1
(ai − aj )Wij + 2βai = 0, i = n1 + 1, . . . , n. (5)
By L := D − W we denote the standard Graph Laplacian, where D
is the diagonal matrix with elements Dii =
P
j
Wij .
15

Optimal solution
Denote Y1,0 = (y1, . . . , yn1 , 0, . . . , 0
| {z }
n−n1
)⊤ and let B be a diagonal
matrix with elements
Bii =
n
β+1, i=1,...,n1
β, i=n1+1,...,n.
Combining (4), (5) and using vector-matrix notation, we finally
get:
(B + 2γL)a = Y1,0,
thus the optimal solution is
a∗
= (B + 2γL)−1
Y1,0. (6)
Similarly, one can obtain the optimal value of σ:
σ∗
= (B + 2γL)−1
S1,0, (7)
where S1,0 = (s1, . . . , sn1 , 0, . . . , 0
| {z }
n−n1
)⊤.
16

Speeding up linear algebra operations
Let the similarity matrix be presented in the a low-rank form
W = AA⊤
, (8)
where matrix A ∈ Rn×m, m ≪ n. Further, we have
B + 2γL = B + 2γD − 2γAA⊤
= G − 2γAA⊤
, (9)
where G := B + 2γD.
The following Woodbury matrix identity is well-known in linear
algebra:
(S + UV )−1
= S−1
− S−1
U(I + VS−1
U)−1
VS−1
, (10)
where S ∈ Rn×n is an invertible matrix, U ∈ Rn×m and V ∈ Rm×n.
17

Speeding up linear algebra operations
Let S = G, U = −2γA and V = A⊤. One can see that
G−1
= diag (1/(B11 + 2γD11), . . . , 1/(Bnn + 2γDnn)) . (11)
From (6), (9), (10) and (11) we obtain:
a∗
= (G−1
+ 2γG−1
A(I − 2γA⊤
G−1
A)−1
A⊤
G−1
) Y1,0. (12)
Similarly, from (7), (9), (10) and (11) we have:
σ∗
= (G−1
+ 2γG−1
A(I − 2γA⊤
G−1
A)−1
A⊤
G−1
) S1,0. (13)
The comput. complexity of (12), (13) is reduced and can be
estimated as O(nm + m3) instead of O(n3).
18

An example of co-association matrix of cluster ensemble
W1,2 shows how often a pair {x1, x2} belongs to the same cluster.
19

An example of co-association matrix of cluster ensemble
W = 1
LA · A⊤, A = [A1, A2, ..., AL] a block matrix,
20

Co-association matrix of cluster ensemble
Use a co-association matrix of cluster ensemble as a similarity
matrix in (3).
Consider partitions {Pℓ}r
ℓ=1, where Pℓ = {Cℓ,1, . . . , Cℓ,Kℓ
},
Cℓ,k ⊂ X, Cℓ,k
T
Cℓ,k′ = ∅ and Kℓ is the number of clusters in ℓ-th
partition.
For each partition Pℓ determine matrix Hℓ = (hℓ(i, j))n
i,j=1 with
elements indicating whether a pair xi , xj belong to the same
cluster in ℓ-th variant or not.
We have
hℓ(i, j) = I[cℓ(xi ) = cℓ(xj )],
and cℓ(x) is the cluster label assigned to x.
The weighted averaged co-association matrix is
H =
r
X
ℓ=1
ωℓHℓ, (14)
where ω1, . . . , ωr are weights , ωℓ ≥ 0,
P
ωℓ = 1.
21

Graph Laplacian
Graph Laplacian matrix for H can be written in the form:
L = D′
− H,
where D′ = diag(D′
11, . . . , D′
nn), D′
ii =
P
j
H(i, j). One can see that
D′
ii =
n
X
j=1
r
X
ℓ=1
ωℓ
Kℓ
X
k=1
Zℓ(i, k)Zℓ(j, k) =
r
X
ℓ=1
ωℓNℓ(i), (15)
where Nℓ(i) is the size of the cluster which includes point xi in ℓ-th
partition variant.
Using H instead of the similarity matrix W in (8), and the matrix
D′ defined in (15), we obtain cluster ensemble based predictions in
the form given by (12), (13).
22

WSR-LRCM algorithm
Weakly Supervised Regression algorithm based on the Low-Rank
representation of the co-association matrix (WSR-LRCM):
Input:
X: dataset
ai , σi , i = 1, . . . , n1: uncertain input parameters for labeled and
inaccurately labeled points;
r, Ω: number of runs and set of parameters for the k-means
clustering (number of clusters, maximum number of iterations,
parameters of the initialization process).
Output:
a∗, σ∗: predicted estimates of uncertain parameters for objects
from sample X (including predictions for the unlabeled sample).
23

WSR-LRCM algorithm
Steps:
1. Generate r variants of clustering partition for parameters
randomly chosen from Ω; calculate weights ω1, . . . , ωr .
2. Find graph Laplacian in the low-rank representation;
3. Calculate predicted estimates of uncertainty parameters using
(12) and (13) .
end.
24

Numerical examples
Numerical examples
25

First example with artificial data
Dataset: a mixture of two multidim. normal distributions
N(m1, σX I), N(m2, σX I), m1, m2 ∈ R8, d = 8
Noise: 2 independent U(0, 1).
Ground truth: Y = 1 + ε, Y = 2 + ε, ε ∈ N(0, σ2
ε )
MC generates samples of the given size n according to the
specified distribution mixture.
Training: 66.6%, Xtrain,
Test: 33.4%, Xtest.
In the training dataset:
10% fully labeled samples;
20% inaccurately labeled objects;
rest: unlabeled data.
26

This partitioning mimics a typical situation in the weakly
supervised learning: a small number of accurately labeled instances,
medium sized uncertain labelings and a lot of unlabeled examples.
To model the inaccurate labeling, we use the parameters defined in
(1):
σi = δ · σY ,
σY is a standard deviation of Y over labeled data,
δ 0.
27

The ensemble variants are generated by random initialization of
centroids; to increase the diversity of base clusterings, we set the
number of clusters in each run as K = 2, . . . , Kmax , where
Kmax = 2 + r, and r = 10 is the ensemble size.
Parameters (Objective functional) β = 0.001 and γ = 0.001 were
estimated.
The quality of prediction is estimated on the test sample as the
Mean Wasserstein Distance between the predicted according to
(12), (13) and ground truth values of the parameters:
MWD =
1
ntest
X
xi ∈Xtest
(atrue
i − a∗
i )2
+ σ∗2
i

,
where ntest is the test sample size, and atrue
i = ytrue
i the true value
of the target feature.
28

We compare the suggested method WSR-LRCM with its simplified
version. The output predictions were calculated according to (6)
and (7).
Repeated 40 times.
m1 = (0, . . . , 0)⊤, m2 = (10, . . . , 10)⊤, σX = 3, and δ = 0.1.
29

n σε
WSR-LRCM WSR-RBF
MWD time (sec) MWD time (sec)
1000
0.01 0.002 0.04 0.007 0.04
0.1 0.012 0.04 0.017 0.04
0.25 0.065 0.04 0.070 0.04
5000
0.01 0.001 0.14 0.004 1.71
0.1 0.011 0.14 0.014 1.72
0.25 0.064 0.15 0.067 1.75
10000
0.01 0.001 0.33 0.002 9.40
0.1 0.011 0.33 0.012 9.35
0.25 0.064 0.33 0.065 9.36
105
0.01 0.001 6.72 - -
106
0.01 0.001 89.12 - -
30

Weakly supervised WSR-LRCM vs. semi-supervised SSR-RBF
n = 1000, σε = 0.1
Table: Results of experiments with WSR-LRCM and SSR-RBF
algorithms. Averaged MWD estimates are calculated for different values
of parameter δ.
δ 0.1 0.25 0.5
WSR-LRCM 0.012 0.017 0.038
SSR-RBF 0.051 0.051 0.051
δ accounts for the degree of uncertainty: the larger its value is, the
more similar the results become.
See more in [Berikov, V., Litvinenko, A. (2021). Weakly Supervised
Regression Using Manifold Regularization and Low-Rank Matrix
Representation. In: Pardalos, P., Khachay, M., Kazakov, A. (eds) MOTOR
2021. LNCS, vol 12755. Springer, Cham.
https://doi.org/10.1007/978-3-030-77876-7_30]
31

The second example with real data
Gas Turbine CO and NOx Emission Data Set: UC Irvine Machine
Learning Repository: Gas Turbine CO and NOx Emis- sion Data Set.
https://archive.ics.uci.edu/ml/datasets/Gas+Turbine+CO+
and+NOx+Emission+Data+Set (06 Apr 2021).
11 features (temperature, pressure, humidity, etc.) of a gas turbine.
The monitoring was carried out during 2011-2015.
Predicted outputs: Carbon monoxide (CO) and Nitrogen oxides (NOx).
32

We make predictions for CO over the year 2015 (in total, 7384
observations).
Datasets: learning (66.6%) and test (33.4%) samples.
Accurately labeled sample is 1% from the entire dataset;
Inaccurately labeled instances 10% ;
Unlabeled samples: the rest.
Used k-means clustering (the number of clusters varies from 100
to 100 + r).
Forecast: the averaged MWD for
WSR-LRCM is 1.85
SSR-RBF is 5.18.
33

WSR-LRCM vs. fully supervised algorithms,
We calculate the standard Mean Absolute Error (MAE) using
estimates of a∗ defined in (12) as the predicted feature outputs:
MAE =
1
ntest
X
xi ∈Xtest
|ytrue
i − a∗
i |.
MAE time
WSR-LRCM, 100+ clusters 0.634 1.99
Random Forest (RF), 300 trees 0.774 0.35
Linear Regression (LR) 0.873 0.38
From the experiments, one may conclude that the proposed
WSR-LRCM gives more accurate predictions than other compared
methods in case of a small proportion of labeled sample.
34

Conclusion
1. Introduced a weakly supervised regression method
(WSR-LRCM) using the manifold regularization technique.
2. To model uncertain labeling, we have used normal distribution
3. The measure of similarity between uncertain labelings was
formulated in terms of the Wasserstein distance between
probability distributions.
4. Ensemble clustering is used for obtaining the co-association
matrix which we consider as the similarity matrix.
5. WSR-LRCM is faster
6. Additional information on uncertain labelings improves the
regression quality.
35

Other applications
Other applications
36

Masks of healthy (purple contours) and affected (yellow contours) brain tissues
and fragments emplacement for healthy and affected tissues (green and red
squares respectively) for FS = 16 (left) and 30 (right). Red color saturation
indicates the probability of assigning the fragment to the affected tissue.
Nedel’ko V et al. Comparative Analysis of Deep Neural Network and
Texture-Based Classifiers for Recognition of Acute Stroke using Non-Contrast
CT Images // 2020 Ural Symposium (USBEREIT). – IEEE, 2020, pp 376-379.
DOI:10.1109/USBEREIT48449.2020.9117784
37

1. V. Berikov, I. Pestunov, Ensemble clustering based on weighted
co-association matrices: Error bound and convergence properties, Pattern
Recognition. 2017. Vol. 63. P. 427-436.
2. V. Berikov, A. Litvinenko, Semi-Supervised Regression using Cluster
Ensemble and Low-Rank Co-Association Matrix Decomposition under
Uncertainties, arXiv:1901.03919, (2019).
3. V. Berikov, N. Karaev, A. Tewari, Semi-supervised classification with
cluster ensemble. In Engineering, Computer and Information Sciences
(SIBIRCON), International Multi-Conference. 245-250. IEEE. (2017)
4. V.B. Berikov, Construction of an optimal collective decision in cluster
analysis on the basis of an averaged co-association matrix and cluster
validity indices. Pattern Recognition and Image Analysis. 27(2), 153-165
(2017)
5. V. Berikov, Cluster Ensemble with Averaged Co-Association Matrix
Maximizing the Expected Margin, CEUR Workshop Proceedings,
http://ceur-ws.org/Vol-1623/papercpr1.pdf, 2019
6. V.B. Berikov, A. Litvinenko, The influence of prior knowledge on the
expected performance of a classifier. Pattern recognition letters 24 (15),
2537-2548, (2003)
7. V Berikov, A Litvinenko, Methods for statistical data analysis with
decision trees, Novosibirsk, Sobolev Institute of Mathematics, 2003,
http://www.math.nsc.ru/AP/datamine/eng/context.pdf

Acknowledgements
The research was partly supported by
1. Scientific research program “Mathematical methods of pattern recognition
and prediction” in the Sobolev Institute of Mathematics SB RAS.
2. Russian Foundation for Basic Research grants 18-07-00600,
18-29-0904mk, 19-29-01175
3. Russian Ministry of Science and Education under the 5-100 Excellence
Programme.
4. A. Litvinenko by funding from the Alexander von Humboldt Foundation.
39

Litv_Denmark_Weak_Supervised_Learning.pdf

More Related Content

Similar to Litv_Denmark_Weak_Supervised_Learning.pdf (20)

More from Alexander Litvinenko (20)

Recently uploaded (20)

Litv_Denmark_Weak_Supervised_Learning.pdf