SlideShare a Scribd company logo
Weakly Supervised Regression on Uncertain
Datasets
Alexander Litvinenko1
,
joint work with
Vladimir Berikov2
, Roman Kozinets2
,
Kirill Kalmutskiy and Layla Cherikbaeva4
1
RWTH Aachen, Germany, 2
Sobolev Institute of Mathematics, 3
Novosibirsk State University, Russia 4
Al-Farabi Kazakh National
University, Kazakhstan
Plan
After short reviewing of ML tasks, we
▶ introduce weakly-supervised regression problem
▶ propose a method which combines graph Laplacian
regularization and cluster ensemble (collective decision
making) methodologies
▶ solve an auxiliary minimisation problem
▶ apply our method for solving two clustering problems.
2
Machine learning (ML) problems
ML problems can be classified as
▶ fully supervised,
▶ unsupervised,
▶ semi-supervised,
▶ weakly-supervised.
Motivation: Recognition of acute stroke from CT scans
Manual annotation of a large number of computed tomography
(CT) digital images
Figure: Example of CT image: (a) initial image, (b) stroke area
annotated by a radiologist, (c) inaccurate mask.
V. Berikov, A. Litvinenko, I. Pestunov, and Y.Sinyavskiy, On a Weakly
Supervised Classification Problem, AIST 2021 Proceedings, accepted to
Springer Nature
4
Machine learning (ML) problems
Consider a dataset X = {x1, . . . , xn}, where xi ∈ Rd is a feature
vector, d the dimensionality of feature space, and n the sample
size.
Fully supervised learning assumes we are
given a set Y = {y1, . . . , yn}, yi ∈ DY ,
of target feature labels for each data point.
GOAL: To find a decision function y = f (x) (classifier, regression
model) for predicting target feature values for any new data point
x ∈ Rd from the same statistical population.
The function should be optimal in some sense, e.g., give minimal
value to the expected losses.
5
Unsupervised learning setting
In an unsupervised learning problem, the target feature is not
specified.
It is necessary to find a meaningful representation of data, i.e., find
a partition P = {C1, . . . , CK } of X on a relatively small number K
of homogeneous clusters describing the structure of data.
The desired number of clusters is either a predefined parameter or
should be found in the best way.
The obtained cluster partition can be uncertain due to:
1. a lack of knowledge about data structure,
2. uncertaity in setting optional parameters of the learning
algorithm,
3. dependence on random initializations.
6
Construction of the consensus partition
7
Semi-supervised learning problems
The target feature labels are known only for a part of the data set
X1 ⊂ X.
We assume that X1 = {x1, . . . , xn1 }, and the unlabeled part is
X0 = {xn1+1, . . . , xn}.
The set of labels for points from X1 is denoted by
Y1 = {y1, . . . , yn1 }.
GOAL: Predict target features Y0 = (yn1+1, . . . , yn) either for
given unlabeled data X0 (i.e., perform transductive learning) or for
arbitrary new observations from the same statistical population
(inductive learning).
Note: To improve prediction accuracy, the information contained in
both labeled and unlabeled datasets is used.
8
Weakly-supervised learning problems
Def.: For some data points the labels are known, for some
unknown, and for others uncertain
(due to lack of resources, random distortions in labels
identification, etc).
To model uncertainty in the label identification, we suppose that
for each i-th point, i = 1, . . . , n1, the value yi of the target feature
is a realization of a random variable Yi with cumulative
distribution function (cdf) Fi (y) defined on DY .
We suppose that Fi (y) belongs to a given distribution family.
9
Weakly-supervised learning problems
Further assume:
Yi ∼ N(ai , σi ), (1)
where ai , σi are the mean and the st. dev.
Assume ai = yi and σi = si are known for each (weakly) labeled
observation, i = 1, . . . , n1.
For strictly determined observation yi , we postulate a normal
uncertainty model with ai = yi and small standard deviation
σi ≈ 0.
We aim at finding a weak labeling of points from X0, i.e.,
determining cfd Fi (y) for i = n1 + 1, . . . , n following an objective
criterion.
10
Assumptions:
Semi- and weakly- supervised learning assume:
1. cluster assumption (points from the same cluster often have
the same labels or labels close to each other)
2. manifold assumption (points with similar labels belong to a
smooth manifold).
11
Optimization problem
Let F = {F1, . . . , Fn} denote the set of cdfs for n data points; each
cdf Fi is represented by a pair of parameters (ai , σi ).
We solve
find F∗ = arg min
F
J(F), where
J(F) =
X
xi ∈X1
D(Yi , Fi ) + γ
X
xi ,xj ∈X
D(Fi , Fj )Wij . (2)
Here D is a statistical distance between two distributions (such as
the Wasserstein distance, KLD,...)1.
1st term: reduces the dissimilarity on labeled data;
2nd: (smoothing) if two points xi , xj are similar, their labeling
distribution should not be very different.
1
Litvinenko, A, Marzouk, Y, Matthies, HG, Scavino, M, Spantini, A.
Computing f-divergences and distances of high-dimensional probability density
functions. Numer Linear Algebra Appl. 2022;e2467.
https://doi.org/10.1002/nla.2467 12
Objective functional
Use the Wasserstein distance wp between distributions P and Q
over a set DY as a measure of their dissimilarity:
wp(P, Q) :=

inf
γ∈Γ(P,Q)
Z
DY ×DY
ρ(y1, y2)p
dγ(y1, y2)
1/p
,
where Γ(P, Q) is a set of all probability distributions on DY × DY
with marginal distributions P and Q, ρ a distance metric, and
p ≥ 1.
For normal distributions Pi = N(ai , σi ), Qj = N(aj , σj ) and the
Euclidean metric, the w2 distance is equal to
w2(Pi , Qj ) = (ai − aj )2
+ (σi − σj )2
.
13
Objective functional
Add an L2 regularizer:
J(a, σ) =
n1
X
i=1
(yi − ai )2
+ (si − σi )2

+
+ γ
n
X
i,j=1
(ai − aj )2
+ (σi − σj )2

Wij + β(∥a∥2
+ ∥σ∥2
), (3)
where β  0 is a regularization parameter, a = (a1, . . . , an)⊤,
σ = (σ1, . . . , σn)⊤, Wij describes the degree of similarity between
two points.
14
Optimal solution
To find the optimal solution, we differentiate (3) and get:
∂J
∂ai
= 2(ai − yi ) + 4γ
n
X
j=1
(ai − aj )Wij + 2βai = 0, i = 1, . . . , n1,
(4)
∂J
∂ai
= 4γ
n
X
j=1
(ai − aj )Wij + 2βai = 0, i = n1 + 1, . . . , n. (5)
By L := D − W we denote the standard Graph Laplacian, where D
is the diagonal matrix with elements Dii =
P
j
Wij .
15
Optimal solution
Denote Y1,0 = (y1, . . . , yn1 , 0, . . . , 0
| {z }
n−n1
)⊤ and let B be a diagonal
matrix with elements
Bii =
n
β+1, i=1,...,n1
β, i=n1+1,...,n.
Combining (4), (5) and using vector-matrix notation, we finally
get:
(B + 2γL)a = Y1,0,
thus the optimal solution is
a∗
= (B + 2γL)−1
Y1,0. (6)
Similarly, one can obtain the optimal value of σ:
σ∗
= (B + 2γL)−1
S1,0, (7)
where S1,0 = (s1, . . . , sn1 , 0, . . . , 0
| {z }
n−n1
)⊤.
16
Speeding up linear algebra operations
Let the similarity matrix be presented in the a low-rank form
W = AA⊤
, (8)
where matrix A ∈ Rn×m, m ≪ n. Further, we have
B + 2γL = B + 2γD − 2γAA⊤
= G − 2γAA⊤
, (9)
where G := B + 2γD.
The following Woodbury matrix identity is well-known in linear
algebra:
(S + UV )−1
= S−1
− S−1
U(I + VS−1
U)−1
VS−1
, (10)
where S ∈ Rn×n is an invertible matrix, U ∈ Rn×m and V ∈ Rm×n.
17
Speeding up linear algebra operations
Let S = G, U = −2γA and V = A⊤. One can see that
G−1
= diag (1/(B11 + 2γD11), . . . , 1/(Bnn + 2γDnn)) . (11)
From (6), (9), (10) and (11) we obtain:
a∗
= (G−1
+ 2γG−1
A(I − 2γA⊤
G−1
A)−1
A⊤
G−1
) Y1,0. (12)
Similarly, from (7), (9), (10) and (11) we have:
σ∗
= (G−1
+ 2γG−1
A(I − 2γA⊤
G−1
A)−1
A⊤
G−1
) S1,0. (13)
The comput. complexity of (12), (13) is reduced and can be
estimated as O(nm + m3) instead of O(n3).
18
An example of co-association matrix of cluster ensemble
W1,2 shows how often a pair {x1, x2} belongs to the same cluster.
19
An example of co-association matrix of cluster ensemble
W = 1
LA · A⊤, A = [A1, A2, ..., AL] a block matrix,
20
Co-association matrix of cluster ensemble
Use a co-association matrix of cluster ensemble as a similarity
matrix in (3).
Consider partitions {Pℓ}r
ℓ=1, where Pℓ = {Cℓ,1, . . . , Cℓ,Kℓ
},
Cℓ,k ⊂ X, Cℓ,k
T
Cℓ,k′ = ∅ and Kℓ is the number of clusters in ℓ-th
partition.
For each partition Pℓ determine matrix Hℓ = (hℓ(i, j))n
i,j=1 with
elements indicating whether a pair xi , xj belong to the same
cluster in ℓ-th variant or not.
We have
hℓ(i, j) = I[cℓ(xi ) = cℓ(xj )],
and cℓ(x) is the cluster label assigned to x.
The weighted averaged co-association matrix is
H =
r
X
ℓ=1
ωℓHℓ, (14)
where ω1, . . . , ωr are weights , ωℓ ≥ 0,
P
ωℓ = 1.
21
Graph Laplacian
Graph Laplacian matrix for H can be written in the form:
L = D′
− H,
where D′ = diag(D′
11, . . . , D′
nn), D′
ii =
P
j
H(i, j). One can see that
D′
ii =
n
X
j=1
r
X
ℓ=1
ωℓ
Kℓ
X
k=1
Zℓ(i, k)Zℓ(j, k) =
r
X
ℓ=1
ωℓNℓ(i), (15)
where Nℓ(i) is the size of the cluster which includes point xi in ℓ-th
partition variant.
Using H instead of the similarity matrix W in (8), and the matrix
D′ defined in (15), we obtain cluster ensemble based predictions in
the form given by (12), (13).
22
WSR-LRCM algorithm
Weakly Supervised Regression algorithm based on the Low-Rank
representation of the co-association matrix (WSR-LRCM):
Input:
X: dataset
ai , σi , i = 1, . . . , n1: uncertain input parameters for labeled and
inaccurately labeled points;
r, Ω: number of runs and set of parameters for the k-means
clustering (number of clusters, maximum number of iterations,
parameters of the initialization process).
Output:
a∗, σ∗: predicted estimates of uncertain parameters for objects
from sample X (including predictions for the unlabeled sample).
23
WSR-LRCM algorithm
Steps:
1. Generate r variants of clustering partition for parameters
randomly chosen from Ω; calculate weights ω1, . . . , ωr .
2. Find graph Laplacian in the low-rank representation;
3. Calculate predicted estimates of uncertainty parameters using
(12) and (13) .
end.
24
Numerical examples
Numerical examples
25
First example with artificial data
Dataset: a mixture of two multidim. normal distributions
N(m1, σX I), N(m2, σX I), m1, m2 ∈ R8, d = 8
Noise: 2 independent U(0, 1).
Ground truth: Y = 1 + ε, Y = 2 + ε, ε ∈ N(0, σ2
ε )
MC generates samples of the given size n according to the
specified distribution mixture.
Training: 66.6%, Xtrain,
Test: 33.4%, Xtest.
In the training dataset:
10% fully labeled samples;
20% inaccurately labeled objects;
rest: unlabeled data.
26
First example with artificial data
This partitioning mimics a typical situation in the weakly
supervised learning: a small number of accurately labeled instances,
medium sized uncertain labelings and a lot of unlabeled examples.
To model the inaccurate labeling, we use the parameters defined in
(1):
σi = δ · σY ,
σY is a standard deviation of Y over labeled data,
δ  0.
27
First example with artificial data
The ensemble variants are generated by random initialization of
centroids; to increase the diversity of base clusterings, we set the
number of clusters in each run as K = 2, . . . , Kmax , where
Kmax = 2 + r, and r = 10 is the ensemble size.
Parameters (Objective functional) β = 0.001 and γ = 0.001 were
estimated.
The quality of prediction is estimated on the test sample as the
Mean Wasserstein Distance between the predicted according to
(12), (13) and ground truth values of the parameters:
MWD =
1
ntest
X
xi ∈Xtest
(atrue
i − a∗
i )2
+ σ∗2
i

,
where ntest is the test sample size, and atrue
i = ytrue
i the true value
of the target feature.
28
First example with artificial data
We compare the suggested method WSR-LRCM with its simplified
version. The output predictions were calculated according to (6)
and (7).
Repeated 40 times.
m1 = (0, . . . , 0)⊤, m2 = (10, . . . , 10)⊤, σX = 3, and δ = 0.1.
29
n σε
WSR-LRCM WSR-RBF
MWD time (sec) MWD time (sec)
1000
0.01 0.002 0.04 0.007 0.04
0.1 0.012 0.04 0.017 0.04
0.25 0.065 0.04 0.070 0.04
5000
0.01 0.001 0.14 0.004 1.71
0.1 0.011 0.14 0.014 1.72
0.25 0.064 0.15 0.067 1.75
10000
0.01 0.001 0.33 0.002 9.40
0.1 0.011 0.33 0.012 9.35
0.25 0.064 0.33 0.065 9.36
105
0.01 0.001 6.72 - -
106
0.01 0.001 89.12 - -
30
Weakly supervised WSR-LRCM vs. semi-supervised SSR-RBF
n = 1000, σε = 0.1
Table: Results of experiments with WSR-LRCM and SSR-RBF
algorithms. Averaged MWD estimates are calculated for different values
of parameter δ.
δ 0.1 0.25 0.5
WSR-LRCM 0.012 0.017 0.038
SSR-RBF 0.051 0.051 0.051
δ accounts for the degree of uncertainty: the larger its value is, the
more similar the results become.
See more in [Berikov, V., Litvinenko, A. (2021). Weakly Supervised
Regression Using Manifold Regularization and Low-Rank Matrix
Representation. In: Pardalos, P., Khachay, M., Kazakov, A. (eds) MOTOR
2021. LNCS, vol 12755. Springer, Cham.
https://doi.org/10.1007/978-3-030-77876-7_30]
31
The second example with real data
Gas Turbine CO and NOx Emission Data Set: UC Irvine Machine
Learning Repository: Gas Turbine CO and NOx Emis- sion Data Set.
https://archive.ics.uci.edu/ml/datasets/Gas+Turbine+CO+
and+NOx+Emission+Data+Set (06 Apr 2021).
11 features (temperature, pressure, humidity, etc.) of a gas turbine.
The monitoring was carried out during 2011-2015.
Predicted outputs: Carbon monoxide (CO) and Nitrogen oxides (NOx).
32
The second example with real data
We make predictions for CO over the year 2015 (in total, 7384
observations).
Datasets: learning (66.6%) and test (33.4%) samples.
Accurately labeled sample is 1% from the entire dataset;
Inaccurately labeled instances 10% ;
Unlabeled samples: the rest.
Used k-means clustering (the number of clusters varies from 100
to 100 + r).
Forecast: the averaged MWD for
WSR-LRCM is 1.85
SSR-RBF is 5.18.
33
The second example with real data
WSR-LRCM vs. fully supervised algorithms,
We calculate the standard Mean Absolute Error (MAE) using
estimates of a∗ defined in (12) as the predicted feature outputs:
MAE =
1
ntest
X
xi ∈Xtest
|ytrue
i − a∗
i |.
MAE time
WSR-LRCM, 100+ clusters 0.634 1.99
Random Forest (RF), 300 trees 0.774 0.35
Linear Regression (LR) 0.873 0.38
From the experiments, one may conclude that the proposed
WSR-LRCM gives more accurate predictions than other compared
methods in case of a small proportion of labeled sample.
34
Conclusion
1. Introduced a weakly supervised regression method
(WSR-LRCM) using the manifold regularization technique.
2. To model uncertain labeling, we have used normal distribution
3. The measure of similarity between uncertain labelings was
formulated in terms of the Wasserstein distance between
probability distributions.
4. Ensemble clustering is used for obtaining the co-association
matrix which we consider as the similarity matrix.
5. WSR-LRCM is faster
6. Additional information on uncertain labelings improves the
regression quality.
35
Other applications
Other applications
36
Masks of healthy (purple contours) and affected (yellow contours) brain tissues
and fragments emplacement for healthy and affected tissues (green and red
squares respectively) for FS = 16 (left) and 30 (right). Red color saturation
indicates the probability of assigning the fragment to the affected tissue.
Nedel’ko V et al. Comparative Analysis of Deep Neural Network and
Texture-Based Classifiers for Recognition of Acute Stroke using Non-Contrast
CT Images // 2020 Ural Symposium (USBEREIT). – IEEE, 2020, pp 376-379.
DOI:10.1109/USBEREIT48449.2020.9117784
37
1. V. Berikov, I. Pestunov, Ensemble clustering based on weighted
co-association matrices: Error bound and convergence properties, Pattern
Recognition. 2017. Vol. 63. P. 427-436.
2. V. Berikov, A. Litvinenko, Semi-Supervised Regression using Cluster
Ensemble and Low-Rank Co-Association Matrix Decomposition under
Uncertainties, arXiv:1901.03919, (2019).
3. V. Berikov, N. Karaev, A. Tewari, Semi-supervised classification with
cluster ensemble. In Engineering, Computer and Information Sciences
(SIBIRCON), International Multi-Conference. 245-250. IEEE. (2017)
4. V.B. Berikov, Construction of an optimal collective decision in cluster
analysis on the basis of an averaged co-association matrix and cluster
validity indices. Pattern Recognition and Image Analysis. 27(2), 153-165
(2017)
5. V. Berikov, Cluster Ensemble with Averaged Co-Association Matrix
Maximizing the Expected Margin, CEUR Workshop Proceedings,
http://ceur-ws.org/Vol-1623/papercpr1.pdf, 2019
6. V.B. Berikov, A. Litvinenko, The influence of prior knowledge on the
expected performance of a classifier. Pattern recognition letters 24 (15),
2537-2548, (2003)
7. V Berikov, A Litvinenko, Methods for statistical data analysis with
decision trees, Novosibirsk, Sobolev Institute of Mathematics, 2003,
http://www.math.nsc.ru/AP/datamine/eng/context.pdf
Acknowledgements
The research was partly supported by
1. Scientific research program “Mathematical methods of pattern recognition
and prediction” in the Sobolev Institute of Mathematics SB RAS.
2. Russian Foundation for Basic Research grants 18-07-00600,
18-29-0904mk, 19-29-01175
3. Russian Ministry of Science and Education under the 5-100 Excellence
Programme.
4. A. Litvinenko by funding from the Alexander von Humboldt Foundation.
39

More Related Content

PDF
Semi-Supervised Regression using Cluster Ensemble
PDF
Two algorithms to accelerate training of back-propagation neural networks
PPT
Introduction to Machine Learning STUDENTS.ppt
PDF
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
PPTX
Supervised learning for IOT IN Vellore Institute of Technology
PPTX
Machine learning introduction lecture notes
PPTX
Anomaly detection using deep one class classifier
PPTX
super vector machines algorithms using deep
Semi-Supervised Regression using Cluster Ensemble
Two algorithms to accelerate training of back-propagation neural networks
Introduction to Machine Learning STUDENTS.ppt
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
Supervised learning for IOT IN Vellore Institute of Technology
Machine learning introduction lecture notes
Anomaly detection using deep one class classifier
super vector machines algorithms using deep

Similar to Litv_Denmark_Weak_Supervised_Learning.pdf (20)

PDF
super-cheatsheet-artificial-intelligence.pdf
PDF
lec6_annotated.pdf ml csci 567 vatsal sharan
PDF
CS229_MachineLearning_notes.pdfkkkkkkkkkk
PDF
machine learning notes by Andrew Ng and Tengyu Ma
PDF
Relationship between some machine learning concepts
PPT
SVM (2).ppt
PPTX
support vector machine
PPT
PPTX
PRML Chapter 4
PDF
Cheatsheet supervised-learning
PDF
lec4_annotated.pdf ml csci 567 vatsal sharan
PPTX
The world of loss function
PDF
lec3_annotated.pdf ml csci 567 vatsal sharan
PPT
Introduction to Support Vector Machine 221 CMU.ppt
PPTX
Data Mining Lecture_10(b).pptx
PPTX
ML unit3.pptx
PDF
Linear models for classification
PPTX
ML_in_QM_JC_02-10-18
PDF
Epsrcws08 campbell isvm_01
PPT
The Structured Prediction – An Overview.ppt
super-cheatsheet-artificial-intelligence.pdf
lec6_annotated.pdf ml csci 567 vatsal sharan
CS229_MachineLearning_notes.pdfkkkkkkkkkk
machine learning notes by Andrew Ng and Tengyu Ma
Relationship between some machine learning concepts
SVM (2).ppt
support vector machine
PRML Chapter 4
Cheatsheet supervised-learning
lec4_annotated.pdf ml csci 567 vatsal sharan
The world of loss function
lec3_annotated.pdf ml csci 567 vatsal sharan
Introduction to Support Vector Machine 221 CMU.ppt
Data Mining Lecture_10(b).pptx
ML unit3.pptx
Linear models for classification
ML_in_QM_JC_02-10-18
Epsrcws08 campbell isvm_01
The Structured Prediction – An Overview.ppt
Ad

More from Alexander Litvinenko (20)

PDF
Poster_density_driven_with_fracture_MLMC.pdf
PDF
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
PDF
litvinenko_Intrusion_Bari_2023.pdf
PDF
Density Driven Groundwater Flow with Uncertain Porosity and Permeability
PDF
litvinenko_Gamm2023.pdf
PDF
Litvinenko_Poster_Henry_22May.pdf
PDF
Uncertain_Henry_problem-poster.pdf
PDF
Litvinenko_RWTH_UQ_Seminar_talk.pdf
PDF
Computing f-Divergences and Distances of High-Dimensional Probability Density...
PDF
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
PDF
Low rank tensor approximation of probability density and characteristic funct...
PDF
Identification of unknown parameters and prediction of missing values. Compar...
PDF
Computation of electromagnetic fields scattered from dielectric objects of un...
PDF
Identification of unknown parameters and prediction with hierarchical matrice...
PDF
Low-rank tensor approximation (Introduction)
PDF
Computation of electromagnetic fields scattered from dielectric objects of un...
PDF
Application of parallel hierarchical matrices for parameter inference and pre...
PDF
Computation of electromagnetic fields scattered from dielectric objects of un...
PDF
Propagation of Uncertainties in Density Driven Groundwater Flow
PDF
Simulation of propagation of uncertainties in density-driven groundwater flow
Poster_density_driven_with_fracture_MLMC.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Intrusion_Bari_2023.pdf
Density Driven Groundwater Flow with Uncertain Porosity and Permeability
litvinenko_Gamm2023.pdf
Litvinenko_Poster_Henry_22May.pdf
Uncertain_Henry_problem-poster.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Computing f-Divergences and Distances of High-Dimensional Probability Density...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Low rank tensor approximation of probability density and characteristic funct...
Identification of unknown parameters and prediction of missing values. Compar...
Computation of electromagnetic fields scattered from dielectric objects of un...
Identification of unknown parameters and prediction with hierarchical matrice...
Low-rank tensor approximation (Introduction)
Computation of electromagnetic fields scattered from dielectric objects of un...
Application of parallel hierarchical matrices for parameter inference and pre...
Computation of electromagnetic fields scattered from dielectric objects of un...
Propagation of Uncertainties in Density Driven Groundwater Flow
Simulation of propagation of uncertainties in density-driven groundwater flow
Ad

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Business Analytics and business intelligence.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to machine learning and Linear Models
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Lecture1 pattern recognition............
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
Qualitative Qantitative and Mixed Methods.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
.pdf is not working space design for the following data for the following dat...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
Business Analytics and business intelligence.pdf
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Quality review (1)_presentation of this 21
Introduction to machine learning and Linear Models
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Lecture1 pattern recognition............
ISS -ESG Data flows What is ESG and HowHow
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Database Infoormation System (DBIS).pptx
Introduction-to-Cloud-ComputingFinal.pptx

Litv_Denmark_Weak_Supervised_Learning.pdf

  • 1. Weakly Supervised Regression on Uncertain Datasets Alexander Litvinenko1 , joint work with Vladimir Berikov2 , Roman Kozinets2 , Kirill Kalmutskiy and Layla Cherikbaeva4 1 RWTH Aachen, Germany, 2 Sobolev Institute of Mathematics, 3 Novosibirsk State University, Russia 4 Al-Farabi Kazakh National University, Kazakhstan
  • 2. Plan After short reviewing of ML tasks, we ▶ introduce weakly-supervised regression problem ▶ propose a method which combines graph Laplacian regularization and cluster ensemble (collective decision making) methodologies ▶ solve an auxiliary minimisation problem ▶ apply our method for solving two clustering problems. 2
  • 3. Machine learning (ML) problems ML problems can be classified as ▶ fully supervised, ▶ unsupervised, ▶ semi-supervised, ▶ weakly-supervised.
  • 4. Motivation: Recognition of acute stroke from CT scans Manual annotation of a large number of computed tomography (CT) digital images Figure: Example of CT image: (a) initial image, (b) stroke area annotated by a radiologist, (c) inaccurate mask. V. Berikov, A. Litvinenko, I. Pestunov, and Y.Sinyavskiy, On a Weakly Supervised Classification Problem, AIST 2021 Proceedings, accepted to Springer Nature 4
  • 5. Machine learning (ML) problems Consider a dataset X = {x1, . . . , xn}, where xi ∈ Rd is a feature vector, d the dimensionality of feature space, and n the sample size. Fully supervised learning assumes we are given a set Y = {y1, . . . , yn}, yi ∈ DY , of target feature labels for each data point. GOAL: To find a decision function y = f (x) (classifier, regression model) for predicting target feature values for any new data point x ∈ Rd from the same statistical population. The function should be optimal in some sense, e.g., give minimal value to the expected losses. 5
  • 6. Unsupervised learning setting In an unsupervised learning problem, the target feature is not specified. It is necessary to find a meaningful representation of data, i.e., find a partition P = {C1, . . . , CK } of X on a relatively small number K of homogeneous clusters describing the structure of data. The desired number of clusters is either a predefined parameter or should be found in the best way. The obtained cluster partition can be uncertain due to: 1. a lack of knowledge about data structure, 2. uncertaity in setting optional parameters of the learning algorithm, 3. dependence on random initializations. 6
  • 7. Construction of the consensus partition 7
  • 8. Semi-supervised learning problems The target feature labels are known only for a part of the data set X1 ⊂ X. We assume that X1 = {x1, . . . , xn1 }, and the unlabeled part is X0 = {xn1+1, . . . , xn}. The set of labels for points from X1 is denoted by Y1 = {y1, . . . , yn1 }. GOAL: Predict target features Y0 = (yn1+1, . . . , yn) either for given unlabeled data X0 (i.e., perform transductive learning) or for arbitrary new observations from the same statistical population (inductive learning). Note: To improve prediction accuracy, the information contained in both labeled and unlabeled datasets is used. 8
  • 9. Weakly-supervised learning problems Def.: For some data points the labels are known, for some unknown, and for others uncertain (due to lack of resources, random distortions in labels identification, etc). To model uncertainty in the label identification, we suppose that for each i-th point, i = 1, . . . , n1, the value yi of the target feature is a realization of a random variable Yi with cumulative distribution function (cdf) Fi (y) defined on DY . We suppose that Fi (y) belongs to a given distribution family. 9
  • 10. Weakly-supervised learning problems Further assume: Yi ∼ N(ai , σi ), (1) where ai , σi are the mean and the st. dev. Assume ai = yi and σi = si are known for each (weakly) labeled observation, i = 1, . . . , n1. For strictly determined observation yi , we postulate a normal uncertainty model with ai = yi and small standard deviation σi ≈ 0. We aim at finding a weak labeling of points from X0, i.e., determining cfd Fi (y) for i = n1 + 1, . . . , n following an objective criterion. 10
  • 11. Assumptions: Semi- and weakly- supervised learning assume: 1. cluster assumption (points from the same cluster often have the same labels or labels close to each other) 2. manifold assumption (points with similar labels belong to a smooth manifold). 11
  • 12. Optimization problem Let F = {F1, . . . , Fn} denote the set of cdfs for n data points; each cdf Fi is represented by a pair of parameters (ai , σi ). We solve find F∗ = arg min F J(F), where J(F) = X xi ∈X1 D(Yi , Fi ) + γ X xi ,xj ∈X D(Fi , Fj )Wij . (2) Here D is a statistical distance between two distributions (such as the Wasserstein distance, KLD,...)1. 1st term: reduces the dissimilarity on labeled data; 2nd: (smoothing) if two points xi , xj are similar, their labeling distribution should not be very different. 1 Litvinenko, A, Marzouk, Y, Matthies, HG, Scavino, M, Spantini, A. Computing f-divergences and distances of high-dimensional probability density functions. Numer Linear Algebra Appl. 2022;e2467. https://doi.org/10.1002/nla.2467 12
  • 13. Objective functional Use the Wasserstein distance wp between distributions P and Q over a set DY as a measure of their dissimilarity: wp(P, Q) := inf γ∈Γ(P,Q) Z DY ×DY ρ(y1, y2)p dγ(y1, y2) 1/p , where Γ(P, Q) is a set of all probability distributions on DY × DY with marginal distributions P and Q, ρ a distance metric, and p ≥ 1. For normal distributions Pi = N(ai , σi ), Qj = N(aj , σj ) and the Euclidean metric, the w2 distance is equal to w2(Pi , Qj ) = (ai − aj )2 + (σi − σj )2 . 13
  • 14. Objective functional Add an L2 regularizer: J(a, σ) = n1 X i=1 (yi − ai )2 + (si − σi )2 + + γ n X i,j=1 (ai − aj )2 + (σi − σj )2 Wij + β(∥a∥2 + ∥σ∥2 ), (3) where β 0 is a regularization parameter, a = (a1, . . . , an)⊤, σ = (σ1, . . . , σn)⊤, Wij describes the degree of similarity between two points. 14
  • 15. Optimal solution To find the optimal solution, we differentiate (3) and get: ∂J ∂ai = 2(ai − yi ) + 4γ n X j=1 (ai − aj )Wij + 2βai = 0, i = 1, . . . , n1, (4) ∂J ∂ai = 4γ n X j=1 (ai − aj )Wij + 2βai = 0, i = n1 + 1, . . . , n. (5) By L := D − W we denote the standard Graph Laplacian, where D is the diagonal matrix with elements Dii = P j Wij . 15
  • 16. Optimal solution Denote Y1,0 = (y1, . . . , yn1 , 0, . . . , 0 | {z } n−n1 )⊤ and let B be a diagonal matrix with elements Bii = n β+1, i=1,...,n1 β, i=n1+1,...,n. Combining (4), (5) and using vector-matrix notation, we finally get: (B + 2γL)a = Y1,0, thus the optimal solution is a∗ = (B + 2γL)−1 Y1,0. (6) Similarly, one can obtain the optimal value of σ: σ∗ = (B + 2γL)−1 S1,0, (7) where S1,0 = (s1, . . . , sn1 , 0, . . . , 0 | {z } n−n1 )⊤. 16
  • 17. Speeding up linear algebra operations Let the similarity matrix be presented in the a low-rank form W = AA⊤ , (8) where matrix A ∈ Rn×m, m ≪ n. Further, we have B + 2γL = B + 2γD − 2γAA⊤ = G − 2γAA⊤ , (9) where G := B + 2γD. The following Woodbury matrix identity is well-known in linear algebra: (S + UV )−1 = S−1 − S−1 U(I + VS−1 U)−1 VS−1 , (10) where S ∈ Rn×n is an invertible matrix, U ∈ Rn×m and V ∈ Rm×n. 17
  • 18. Speeding up linear algebra operations Let S = G, U = −2γA and V = A⊤. One can see that G−1 = diag (1/(B11 + 2γD11), . . . , 1/(Bnn + 2γDnn)) . (11) From (6), (9), (10) and (11) we obtain: a∗ = (G−1 + 2γG−1 A(I − 2γA⊤ G−1 A)−1 A⊤ G−1 ) Y1,0. (12) Similarly, from (7), (9), (10) and (11) we have: σ∗ = (G−1 + 2γG−1 A(I − 2γA⊤ G−1 A)−1 A⊤ G−1 ) S1,0. (13) The comput. complexity of (12), (13) is reduced and can be estimated as O(nm + m3) instead of O(n3). 18
  • 19. An example of co-association matrix of cluster ensemble W1,2 shows how often a pair {x1, x2} belongs to the same cluster. 19
  • 20. An example of co-association matrix of cluster ensemble W = 1 LA · A⊤, A = [A1, A2, ..., AL] a block matrix, 20
  • 21. Co-association matrix of cluster ensemble Use a co-association matrix of cluster ensemble as a similarity matrix in (3). Consider partitions {Pℓ}r ℓ=1, where Pℓ = {Cℓ,1, . . . , Cℓ,Kℓ }, Cℓ,k ⊂ X, Cℓ,k T Cℓ,k′ = ∅ and Kℓ is the number of clusters in ℓ-th partition. For each partition Pℓ determine matrix Hℓ = (hℓ(i, j))n i,j=1 with elements indicating whether a pair xi , xj belong to the same cluster in ℓ-th variant or not. We have hℓ(i, j) = I[cℓ(xi ) = cℓ(xj )], and cℓ(x) is the cluster label assigned to x. The weighted averaged co-association matrix is H = r X ℓ=1 ωℓHℓ, (14) where ω1, . . . , ωr are weights , ωℓ ≥ 0, P ωℓ = 1. 21
  • 22. Graph Laplacian Graph Laplacian matrix for H can be written in the form: L = D′ − H, where D′ = diag(D′ 11, . . . , D′ nn), D′ ii = P j H(i, j). One can see that D′ ii = n X j=1 r X ℓ=1 ωℓ Kℓ X k=1 Zℓ(i, k)Zℓ(j, k) = r X ℓ=1 ωℓNℓ(i), (15) where Nℓ(i) is the size of the cluster which includes point xi in ℓ-th partition variant. Using H instead of the similarity matrix W in (8), and the matrix D′ defined in (15), we obtain cluster ensemble based predictions in the form given by (12), (13). 22
  • 23. WSR-LRCM algorithm Weakly Supervised Regression algorithm based on the Low-Rank representation of the co-association matrix (WSR-LRCM): Input: X: dataset ai , σi , i = 1, . . . , n1: uncertain input parameters for labeled and inaccurately labeled points; r, Ω: number of runs and set of parameters for the k-means clustering (number of clusters, maximum number of iterations, parameters of the initialization process). Output: a∗, σ∗: predicted estimates of uncertain parameters for objects from sample X (including predictions for the unlabeled sample). 23
  • 24. WSR-LRCM algorithm Steps: 1. Generate r variants of clustering partition for parameters randomly chosen from Ω; calculate weights ω1, . . . , ωr . 2. Find graph Laplacian in the low-rank representation; 3. Calculate predicted estimates of uncertainty parameters using (12) and (13) . end. 24
  • 26. First example with artificial data Dataset: a mixture of two multidim. normal distributions N(m1, σX I), N(m2, σX I), m1, m2 ∈ R8, d = 8 Noise: 2 independent U(0, 1). Ground truth: Y = 1 + ε, Y = 2 + ε, ε ∈ N(0, σ2 ε ) MC generates samples of the given size n according to the specified distribution mixture. Training: 66.6%, Xtrain, Test: 33.4%, Xtest. In the training dataset: 10% fully labeled samples; 20% inaccurately labeled objects; rest: unlabeled data. 26
  • 27. First example with artificial data This partitioning mimics a typical situation in the weakly supervised learning: a small number of accurately labeled instances, medium sized uncertain labelings and a lot of unlabeled examples. To model the inaccurate labeling, we use the parameters defined in (1): σi = δ · σY , σY is a standard deviation of Y over labeled data, δ 0. 27
  • 28. First example with artificial data The ensemble variants are generated by random initialization of centroids; to increase the diversity of base clusterings, we set the number of clusters in each run as K = 2, . . . , Kmax , where Kmax = 2 + r, and r = 10 is the ensemble size. Parameters (Objective functional) β = 0.001 and γ = 0.001 were estimated. The quality of prediction is estimated on the test sample as the Mean Wasserstein Distance between the predicted according to (12), (13) and ground truth values of the parameters: MWD = 1 ntest X xi ∈Xtest (atrue i − a∗ i )2 + σ∗2 i , where ntest is the test sample size, and atrue i = ytrue i the true value of the target feature. 28
  • 29. First example with artificial data We compare the suggested method WSR-LRCM with its simplified version. The output predictions were calculated according to (6) and (7). Repeated 40 times. m1 = (0, . . . , 0)⊤, m2 = (10, . . . , 10)⊤, σX = 3, and δ = 0.1. 29
  • 30. n σε WSR-LRCM WSR-RBF MWD time (sec) MWD time (sec) 1000 0.01 0.002 0.04 0.007 0.04 0.1 0.012 0.04 0.017 0.04 0.25 0.065 0.04 0.070 0.04 5000 0.01 0.001 0.14 0.004 1.71 0.1 0.011 0.14 0.014 1.72 0.25 0.064 0.15 0.067 1.75 10000 0.01 0.001 0.33 0.002 9.40 0.1 0.011 0.33 0.012 9.35 0.25 0.064 0.33 0.065 9.36 105 0.01 0.001 6.72 - - 106 0.01 0.001 89.12 - - 30
  • 31. Weakly supervised WSR-LRCM vs. semi-supervised SSR-RBF n = 1000, σε = 0.1 Table: Results of experiments with WSR-LRCM and SSR-RBF algorithms. Averaged MWD estimates are calculated for different values of parameter δ. δ 0.1 0.25 0.5 WSR-LRCM 0.012 0.017 0.038 SSR-RBF 0.051 0.051 0.051 δ accounts for the degree of uncertainty: the larger its value is, the more similar the results become. See more in [Berikov, V., Litvinenko, A. (2021). Weakly Supervised Regression Using Manifold Regularization and Low-Rank Matrix Representation. In: Pardalos, P., Khachay, M., Kazakov, A. (eds) MOTOR 2021. LNCS, vol 12755. Springer, Cham. https://doi.org/10.1007/978-3-030-77876-7_30] 31
  • 32. The second example with real data Gas Turbine CO and NOx Emission Data Set: UC Irvine Machine Learning Repository: Gas Turbine CO and NOx Emis- sion Data Set. https://archive.ics.uci.edu/ml/datasets/Gas+Turbine+CO+ and+NOx+Emission+Data+Set (06 Apr 2021). 11 features (temperature, pressure, humidity, etc.) of a gas turbine. The monitoring was carried out during 2011-2015. Predicted outputs: Carbon monoxide (CO) and Nitrogen oxides (NOx). 32
  • 33. The second example with real data We make predictions for CO over the year 2015 (in total, 7384 observations). Datasets: learning (66.6%) and test (33.4%) samples. Accurately labeled sample is 1% from the entire dataset; Inaccurately labeled instances 10% ; Unlabeled samples: the rest. Used k-means clustering (the number of clusters varies from 100 to 100 + r). Forecast: the averaged MWD for WSR-LRCM is 1.85 SSR-RBF is 5.18. 33
  • 34. The second example with real data WSR-LRCM vs. fully supervised algorithms, We calculate the standard Mean Absolute Error (MAE) using estimates of a∗ defined in (12) as the predicted feature outputs: MAE = 1 ntest X xi ∈Xtest |ytrue i − a∗ i |. MAE time WSR-LRCM, 100+ clusters 0.634 1.99 Random Forest (RF), 300 trees 0.774 0.35 Linear Regression (LR) 0.873 0.38 From the experiments, one may conclude that the proposed WSR-LRCM gives more accurate predictions than other compared methods in case of a small proportion of labeled sample. 34
  • 35. Conclusion 1. Introduced a weakly supervised regression method (WSR-LRCM) using the manifold regularization technique. 2. To model uncertain labeling, we have used normal distribution 3. The measure of similarity between uncertain labelings was formulated in terms of the Wasserstein distance between probability distributions. 4. Ensemble clustering is used for obtaining the co-association matrix which we consider as the similarity matrix. 5. WSR-LRCM is faster 6. Additional information on uncertain labelings improves the regression quality. 35
  • 37. Masks of healthy (purple contours) and affected (yellow contours) brain tissues and fragments emplacement for healthy and affected tissues (green and red squares respectively) for FS = 16 (left) and 30 (right). Red color saturation indicates the probability of assigning the fragment to the affected tissue. Nedel’ko V et al. Comparative Analysis of Deep Neural Network and Texture-Based Classifiers for Recognition of Acute Stroke using Non-Contrast CT Images // 2020 Ural Symposium (USBEREIT). – IEEE, 2020, pp 376-379. DOI:10.1109/USBEREIT48449.2020.9117784 37
  • 38. 1. V. Berikov, I. Pestunov, Ensemble clustering based on weighted co-association matrices: Error bound and convergence properties, Pattern Recognition. 2017. Vol. 63. P. 427-436. 2. V. Berikov, A. Litvinenko, Semi-Supervised Regression using Cluster Ensemble and Low-Rank Co-Association Matrix Decomposition under Uncertainties, arXiv:1901.03919, (2019). 3. V. Berikov, N. Karaev, A. Tewari, Semi-supervised classification with cluster ensemble. In Engineering, Computer and Information Sciences (SIBIRCON), International Multi-Conference. 245-250. IEEE. (2017) 4. V.B. Berikov, Construction of an optimal collective decision in cluster analysis on the basis of an averaged co-association matrix and cluster validity indices. Pattern Recognition and Image Analysis. 27(2), 153-165 (2017) 5. V. Berikov, Cluster Ensemble with Averaged Co-Association Matrix Maximizing the Expected Margin, CEUR Workshop Proceedings, http://ceur-ws.org/Vol-1623/papercpr1.pdf, 2019 6. V.B. Berikov, A. Litvinenko, The influence of prior knowledge on the expected performance of a classifier. Pattern recognition letters 24 (15), 2537-2548, (2003) 7. V Berikov, A Litvinenko, Methods for statistical data analysis with decision trees, Novosibirsk, Sobolev Institute of Mathematics, 2003, http://www.math.nsc.ru/AP/datamine/eng/context.pdf
  • 39. Acknowledgements The research was partly supported by 1. Scientific research program “Mathematical methods of pattern recognition and prediction” in the Sobolev Institute of Mathematics SB RAS. 2. Russian Foundation for Basic Research grants 18-07-00600, 18-29-0904mk, 19-29-01175 3. Russian Ministry of Science and Education under the 5-100 Excellence Programme. 4. A. Litvinenko by funding from the Alexander von Humboldt Foundation. 39