Lec14: Evaluation Framework for Medical Image Segmentation

MEDICAL IMAGE COMPUTING (CAP 5937)
LECTURE 14: Evaluation Framework for Medical Image
Segmentation
Dr. Ulas Bagci
HEC 221, Center for Research in Computer
Vision (CRCV), University of Central Florida
(UCF), Orlando, FL 32814.
bagci@ucf.edu or bagci@crcv.ucf.edu
1SPRING 2017

Outline
• How to evaluate accuracy of image segmentation?
– Gold standard ~ surrogate of truths
– Qualitative
• Visual
• Inter- and intra-observer agreement rates
– Quantitative
• Volumetric measurements (regression)
• Region overlaps
• Shape based measurements
• Theoretical comparisons
• STAPLE, Uncertainty guidance, and evaluation w/o truths
2

Visual Assessment
3
Manual image segmentation from the full spectrum of IDEAL MRI data to delineate red: SAT,
green: VAT, blue: liver, yellow: pancreas, purple: kidneys. Left to right: water- only, fat-only, in-
phase, out-of-phase, fat fraction, and segmented labels from SliceOmatic.
Reference: Assessment of Abdominal Adiposity and Organ Fat with Magnetic Resonance Imaging (chp11).

Inherent Uncertainty
4
Comparison of glioblastoma multiforme (GBM) segmentation results on an axial slice: semi-
automatic segmentation under Slicer (green, left image) and pure manual segmentation (blue,
middle image). Egger et al., Nat Sci Rep., 2012.

Inherent Uncertainty 5
red: endocardium; green: epicardium; yellow: ground truth
Queiros et al., European Heart Journal, 2016.

Segmentation Evaluation
Can be considered to consist of two components:
(1) Theoretical
Study mathematical equivalence among algorithms.
(2) Empirical
Study practical performance of algorithms in specific application
domains.
6

Segmentation Evaluation: Theoretical
Fundamental challenges in segmentation evaluation:
(Ch1) Are major pI (purely Image based) frameworks such as active
contours, level sets, graph cuts, fuzzy connectedness, watersheds,
truly distinct or some level of equivalence exists among them?
7

(Ch2) How to develop truly distinct methods constituting real
advance?
8

advance?
(Ch3) How to choose a method for a given application domain?
9

advance?
(Ch4) How to set an algorithm optimally for an application
domain?
10

advance?
(Ch4) How to set an algorithm optimally for an application
domain?
Currently any method A can be shown empirically to be better than any
method B, even when they are equivalent.
11

Attributes commonly used by segmentation methods:
(1) Connectedness
(2) Texture
(3) Smoothness of boundary
(4) Gradient / homogeneity
(5) Shape information about object
(6) Noise handling
(7) Optimization employed
(8) Orientedness of boundary

Attributes utilized by well-known delineation models
Connected Gradient Texture Smooth Shape Noise Optimize
Fuzzy con Yes Gr = hom
affinity
Obj feat
affinity
No No Scale
FC
In RFC
Chan-Vese No No Yes Yes No No Yes
Mum-Shah No No Yes Yes No Yes Yes
KWT snake Boundary Yes No Yes No No Yes
MSV LS Fg when
expandng
Yes No No No No No
Live wire Boundary Yes Yes Yes User No Yes
Act. shape Yes No No No Yes No Yes
Act. app Yes No Yes No Yes No Yes
Graph cut Usly not Yes Possible No No No Yes
Clustering No No Yes No No No Yes
SEGMENTATION EVALUATION: Theoretical

Attributes utilized by well-known delineation models
Connected Gradient Texture Smooth Shape Noise Optimize
Fuzzy con Yes Gr = hom
affinity
Obj feat
affinity
No No Scale
FC
In RFC
Chan-Vese No No Yes Yes No No Yes
Mum-Shah No No Yes Yes No Yes Yes
KWT snake Boundary Yes No Yes No No Yes
MSV LS Fg when
expandng
Yes No No No No No
Live wire Boundary Yes Yes Yes User No Yes
Act. shape Yes No No No Yes No Yes
Act. app Yes No Yes No Yes No Yes
Graph cut Usly not Yes Possible No No No Yes
Clustering No No Yes No No No Yes
SEGMENTATION EVALUATION: Theoretical
Deep
Learning
Yes Yes Yes Yes Yes Yes Yes

Segmentation Evaluation: Empirical
T :
B :
P :
Example: Estimating the volume
of brain.
A body region -
Imaging protocol -
Application domain: A particular triple .
A task -
Example: Head.
Example: T2 weighted MR
imaging with a particular set of
parameters.
Q: A set of scenes acquired for a particular application
domain
, ,á ñT B P
, , .T B Pá ñ

16
The segmentation efficacy of a method M in an application
domain may be characterized by three groups
of factors:
Precision :
(Reliability)
Repeatability taking into account all
subjective actions influencing the result.
Accuracy :
(Validity)
Degree to which the result agrees with
truth.
Efficiency :
(Viability)
Practical viability of the method.
, ,T B Pá ñ

Validation of Image Segmentation
• Spectrum of accuracy versus realism in reference standard.
• Digital phantoms.
– Ground truth known accurately.
– Not so realistic.
• Acquisitions and careful segmentation.
– Some uncertainty in ground truth.
– More realistic.
• Autopsy/histopathology.
– Addresses pathology directly; resolution.
• Clinical data ?
– Hard to know ground truth.
– Most realistic model.
Slide Credit: N. Archip

Comparison To Higher Resolution
MRI Photograph MRI
Provided by Peter Ratiu and Florin Talos.
Credit: N. Archip

19
Intra operator variations
Inter operator variations
Intra scanner variations
Inter scanner variations
Inter scanner variations include variations due to the
same brand and different brands.
Repeatability taking into account all subjective actions
that influence the segmentation result.
Precision

20
Precision
( )
-
1 - , = 3, 4.
+ 2
1 2
i
1 2
O O
M MT
M O O
M M
PR i=
C C
C C
A measure of precision for method M in a trial that produces
and for situation Ti is given by
Intra/inter operator
Intra/inter scanner
may be binary or fuzzy segmentations.
1O
MC 2O
MC
CM
O1
,CM
O2

21
Accuracy
The degree to which segmentations agree with true
segmentation.
Surrogates of truth are needed.
For any image C acquired for application domain
CM
O
- segmentation of O in C by method M,
Ctd
- surrogate of true delineation of O in C.

22
TPFP
TN
FN
True segmentation
O
MC
tdC
Segmentation
by algorithm M.
FP
FN
Ud

23
FNVFM
d
=
Ctd
− CM
O
Ctd
, TPVFM
d
=
Ctd
∩ CM
O
Ctd
FPVFM
d
=
CM
O
− Ctd
Ud
- Ctd
, TNVFM
d
=
Ud
− CM
O
-Ctd
Ud
-Ctd
,
Ud : A binary scene representing a reference super set
(for example, this may be the body region that is imaged).
: Amount of tissue truly in that is missed by .
: Amount of tissue falsely delineated by .
d
M
d
M
FNVF O M
FPVF M

24
Requirements for accuracy metrics:
(1) Capture M’s behavior of trade-off between FP and FN.
(2) Satisfy laws of tissue conservation:
(3) Capable of characterizing the range of behavior of M.
(4) Any monotonic function g(FNVF, FPVF) is fine as a
metric.
(5) Appropriate for
1
1
d d
M M
d d
M M
FNVF TPVF
FPVF TNVF
= -
= -
, , .T B Pá ñ

25

26
1-FNVF
FPVF
Brain WM
segmentation
in PD MR
images.
Each value of parameter vector p of M gives a point on the
DOC curve.
The DOC curve characterizes the behavior of M over a range of
parametric values of M.
Delineation Operating Characteristic
:MA Area under
the DOC curve

27
, ,á ñT B P
.
FPVF
1-FNVF
0
1
p - parameter vector for method M
gp(FPVF, FNVF) - monotonic fn
p* = arg min p [gp(FPVF, FNVF)]
Set M to operate at p*.
Optimally setting an algorithm for
1

Existent Segmentation Data
28
Expert 1 Expert 2 Expert 3 Expert 4
Original
Image
• Manual
segmentation
performed by 4
independent experts
• low grade glioma

Expert and Student Segmentations
29
Test image ? ?
? ?

Expert and Student Segmentations
30
Test image Expert consensus Student 1
Student 2 Student 3

31
Describes practical viability of a method.
Four factors should be considered:
(1) Computational time – for one time training of M
(2) Computational time – for segmenting each scene
(3) Human time – for one-time training of M
(4) Human time – for segmenting each scene
(2) and (4) are crucial. (4) determines the degree of
automation of M.
Efficiency
( )1c
Mt
( )2c
Mt
( )1h
Mt
( )2h
Mt

32
Precision : Accuracy :
:
:
:
: Area under the DOC curveintra scanner
FN fraction for delineation:inter operator
FP fraction for delineation:intra operator1T
MPR
2T
MPR
3T
MPR
d
MFPVF
MA
d
MFNVF
Efficiency :
operator time for scene segmentation.:
operator time for algorithm training.:
computational time for scene segmentation.:
computational time for algorithm training.:1c
Mt
2c
Mt
1h
Mt
2h
Mt
4T
MPR : inter scanner

Remarks
33
(1) Precision, accuracy, efficiency are interdependent.
accuracy à efficiency.
precision and accuracy à difficult.
(2) “Automatic segmentation method” has no meaning unless the
results are proven on a large number of data sets with
acceptable precision, accuracy, efficiency, and with .
(3) A descriptive answer to “is method M1 better than M2 under
?” in terms of the 11 parameters is more meaningful
than a “yes” or “no” answer.
(4) DOC is essential to describe the range of behavior of M.
2h
Mt = 0
, ,T B Pá ñ

Velazquez et al, Scientific Reports 2013.
34

Shape Based Metrics for Segmentation
Evaluation
35
Sensitivity=94.69%
Specificity=94.19%
Sensitivity=72.99%
Specificity=78.16%
If you use only DSC (dice similarity, or overlap measure), DSC values are similar to each other
In both examples (but not sensitivity-specificity values).
Sufficient Enough?

Hausdorff Distance
• Can be used for a complementary evaluation metric to the
overlap measure for measuring boundary mismatches!
36

Hausdorff Distance
• Can be used for a complementary evaluation metric to the
overlap measure for measuring boundary mismatches!
• Lower Haussdorff Distance (HD), Better segmentation
accuracy!
37
( ))(max),(maxmax),( bdadBAHD A
Bb
B
Aa ÎÎ
=
( )),(min)( badad
Bb
B
Î
= is a distance of one point a
on A from B

Segmentation Evaluation: STAPLE
38
• STAPLE (Simultaneous Truth and Performance Level
Estimation):
– An algorithm for estimating performance and ground truth from a
collection of independent segmentations.
– Warfield, Zou, Wells MICCAI 2002.
– Warfield, Zou, Wells, IEEE TMI 2004.
– Publicly Available
– The STAPLE algorithm ( Warfield et al., 2004) is a region formulation
for producing consensus segmentations.
– When foreground is small à weight w is small

Segmentation Evaluation: STAPLE
• Segmentations are generated by sampling independently at
each voxel.
• However, the produced segmentations may not be realistic
for two reasons.
– First, the variability of the segmentation does not account for the
intensity in the image such that borders with strong gradients are
equally variable as borders with weak gradient. This is counter intuitive
as the basic hypothesis of image segmentation is that changes of
intensity are correlated with changes of labels.
– Second, borders of the segmented structures are unrealistic mainly
due to their lack of geometric regularity.
39

Regression Analysis in Clinical Problems
• Linear regression between volume(s)
– automated segmentation’s volume vs. manual segmentation’s volume
– Bland-Altman plot
• Linear regression between visual inspection (raters)
– Kappa statistics
– t-test / p-value
• Significantly different volumes ? Score ?
40

Regression Analysis in Clinical Problems
41
Manual segmentation
Vedentham, et al. JCIS, 2014

What is Bland-Altman plot?
• is a method of data plotting used in analyzing the agreement
between two different assays.
• Claim: any two methods that are designed to measure the
same parameter should have good correlation.
– X-axis: mean of the two measurement
– Y-axis: difference between the two values
• Good first step analyzing the data!
43

Bland-Altman Plots (e.g., airway segmentation
evaluation)
44
Xu, Bagci, et al. MedIA, 2015.

New Directions: Sampling Image
Segmentations (Le et al, MedIA, 2016)
• Automatically produce plausible image segmentation samples
from a single expert segmentation!
45

• A probability distribution of image segmentation boundaries is
defined as Gaussian Process, which leads to segmentations
which are spatially coherent and consistent with the presence
of salient borders in the image.
46

Remark: Gaussian Process (GP) ?
48
Credit: Ghahramani

49
Credit: Ghahramani

50
Credit: Ghahramani

• A probability distribution of image segmentation boundaries is
defined as Gaussian Process, which leads to segmentations
which are spatially coherent and consistent with the presence
of salient borders in the image.
53

Sample segmentation contours according to
mean inter-sample dice coefficient!
54
(Top Left) Mean of the GP µ; (Top Middle) Sample of the level set function φ(a) drawn from
𝒢𝒫(µ,Σ) (Others) GPSSI samples. The ground truth is outlined in red, the GPSSI samples are
outlined in orange.

55
(Left) Signed geodesic distance µ(a) of the ROI with isocontours –45, 0, 45, 100, 200.
(Right) One can check that the samples most probably lie in the region delineated by
the isocontours µ(a)=±45. The sampled contours are in orange.

56

57

Provocative Question?
• Can we evaluate segmentation error without the ground
truth?
58

Provocative Question?
• Can we evaluate segmentation error without the ground
truth?
– With the machine learning support, can we design a classifier which
LEARNS segmentation error and adapt itself for better delineation?
59

Summary
• Segmentation Evaluation
– Theoretical vs. Empirical
– Visual Assessment
– Volumetric Agreement
– Efficacy (efficiency, accuracy, …)
– STAPLE
– New Trends!
– Segmentation Challenges (choose your project!)
60

Slide Credits and References
• Credits to: Jayaram K. Udupa of Univ. of Penn., MIPG
• Bagci’s CV Course 2015 Fall.
• K.D. Toennies, Guide to Medical Image Analysis,
• Handbook of Medical Imaging, Vol. 2. SPIE Press.
• Handbook of Biomedical Imaging, Paragios, Duncan, Ayache.
• Seutens,P., Medical Imaging, Cambridge Press.
• Neculai Archip, Ph.D
• Simon K. Warfield, Ph.D. (See STAPLE Algorithm)
61

Lec14: Evaluation Framework for Medical Image Segmentation

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Lec14: Evaluation Framework for Medical Image Segmentation (20)

Recently uploaded (20)

Lec14: Evaluation Framework for Medical Image Segmentation