Machine Learning for Computer Vision.pdf

Introduction to Computer Vision
Quantitative Biomedical Imaging Group
Institute of Biomedical Engineering
Big Data Institute
University of Oxford
Jens Rittscher

Mathematics (with Computer Science)
University of Bonn, Germany

Mathematics – a universal language plays a role in many disciplines
questions / opportunities
Economics Biology Computer Vision

Machine Learning for Computer Vision
What is Computer Vision?
o train machines to interpret the visual world
o analyse what objects are in an image
o detect specific objects of interest
Edge Detection
Image Segmentation
Classification
Visual Motion
C
o
u
r
s
e
T
h
e
m
e
s

Jens Rittscher
Institute of Biomedical
Engineering & Big Data Institute
University of Oxford
5
GE
-
Global
Research
Niskayuna,
NY
University
of
Oxford
DPhil - Engineering Science
(Computer Vision)
Title: Recognising Human Motion
Senior Scientist
Computer Vision and
Visualisation
Project Leader
Biomedical Imaging
Manager
Computer Vision
Senior Research Fellow (IBME)
Group Leader (TDI)
Adjunct Member (LICR)
2000
2005
2013
Professor of Engineering Science
Cell Tracking
`
Zebrafish Imaging
Computational Pathology
Re-identification
Group Segmentation
U
Oxford
Tissue Imaging
Endoscopy
length of “tongues” of BE, rather than the total length above
the GEJ.
Thus, the grading system defined by the working group to
improve the recognition of and reporting of gastroesophageal
landmarks and endoscopically recognized BE included the C &
M extent of endoscopically recognized BE, GEJ, SCJ, and dia-
phragmatic hiatus (Figure 2). Figures 3 and 4 show the C & M
extents of endoscopically recognized BE, with C ! 2 cm and M
! 5 cm, giving a classification of C2M5.
Initial Validation of the Classification System:
Internal Study
The grading system was validated initially by a panel of
5 members of the working group, who assessed a selection of 50
video clips. The video clips were viewed in random order. The
internal assessment produced reliability coefficients of 0.91 for
C and 0.66 for M. This correlates to an “almost perfect” level of
reliability for C and “substantial” reliability for M (Table 2).
One assessor misinterpreted M as being the “tongue” length,
and, if the results from this assessor were excluded, the reliabil-
ity coefficients were 0.94 for C and 0.88 for M. There were only
minimal differences between the reliability coefficients for
push-only and pull-only endoscopic procedures (Table 2), indi-
cating that these criteria could be used either during endoscope
insertion or toward the completion of endoscopic procedure, ie,
withdrawal.
Validation of the Classification System:
External Study
Of the 29 external assessors invited to participate in the
analysis, 22 submitted complete data for C & M values for the
selection of the 29 video clips selected for this study. One
observer assessed only 1 video clip, and these data were ex-
cluded from analysis. Moreover, 9 observers had at least once
recorded an M value that was numerically smaller than the C
value on the same clip (the M value should always be !C value).
In these situations, the M value was replaced with the C value.
The distribution of mean C & M assessments of the 29 video
clips is presented in Table 3. Almost half of the C assessments
but only 5 of the M assessments were less than 0.5 cm.
The overall reliability coefficients from the external assess-
ment were 0.94 for C and 0.93 for M, representing an “almost
perfect” level of reliability for both. Using the C & M criteria,
assessors were able to agree on the presence of endoscopic BE
greater than 1 cm in length with substantial reliability (RC !
0.72). The recognition of endoscopic BE "1 cm in length was
only slightly reliable (RC ! 0.21), making the recognition of
endoscopic BE of any length moderately reliable (RC ! 0.49).
The assessors were able to recognize the proximal margin of the
gastric folds and the diaphragmatic hiatus with almost perfect
reliability (RC ! 0.88 and 0.85, respectively). When calculating
percentage agreement, each observer was compared with every
other observer. For such pairwise assessment, there were a total
of 6699 comparisons from the 29 video clips. Of these compar-
isons for C & M values, the exact rates of agreement were 53%
and 38%, respectively. The comparisons differed at most by 1
cm in 88% and 82% and differed at most by 2 cm in 97% and
95% of the C & M values, respectively. The detailed breakdown
of results from the external assessment by length of BE and
reliability coefficients for recognizing the position of gastro-
esophageal landmarks are presented in Tables 4–6.
There were no observers that recorded extreme values, ie,
consistently the highest or lowest recordings. The observer with
the highest number of extreme recordings had, out of the 29
clips, 3 highest recordings on C and 4 highest recordings on M.
The results did not change when this observer was excluded
from the analysis.
Discussion
At present, standardized, validated criteria for the en-
doscopic description of BE are not routinely used. Endoscopists
currently adopt a loose classification system, defining endo-
scopic segments of BE as “long,” “short,” or “ultra-short,” with-
Figure 4. Video still of endoscopic Barrett’s esophagus showing an
area classified as C2M5. C: extent of circumferential metaplasia; M:
maximal extent of the metaplasia (C plus a distal “tongue” of 3 cm).
Table 2. Reliability Coefficients for the Initial Validation of
the Classification System: Internal Study
All endoscopies
(push or pull)
Push-only
endoscopy
Pull-only
endoscopy
Circumferential extent (C) 0.91 0.93 0.91
(0.94)a (0.94)a (0.94)a
Maximal extent (M) 0.66 0.65 0.67
(0.88)a (0.96)a (0.81)a
aReliability coefficient if the results from 1 of the 5 internal assessors,
who did not understand the “M” classification, are not included in the
analysis.
Table 3. Number of Video Clips With C & M Assessments
in Relationship to the Length of the BE Segment
Estimated BE length
Number of video clips
(C value)
Number of video clips
(M value)
0.0 to "0.5 cm 14 5
0.5 to "1.0 cm 4 2
1.0 to "3.0 cm 4 11
3.0 to "5.0 cm 2 4
!5.0 cm 5 7
CLINICAL–
ALIMENTARY
TRACT
1396 SHARMA ET AL GASTROENTEROLOGY Vol. 131, No. 5

• Learn image processing & machine learning
techniques in the context of a concrete application
setting
• Gain experience in working with images and the
application of machine learning models
Machine Learning for Computer Vision
Lectures
Exercises
C
o
u
r
s
e
C
o
m
p
o
n
e
n
t
s

Data Science
Theory
You have a strong
background in
mathematics and statistics
and like to apply the
methods to real-world
problems.
Practice
You have the necessary
practical programming
skills to implement your
ideas and work on large
data sets.
Context
You have a strong interest
or background knowledge
in a particular scientific
field that excites you.

Structure of the course
Feature Extraction Image Segmentation Object Detection
Traditional Computer Vision
Revisiting Computer Vision with Deep Learning
Object Detection
Semantic Segmentation
Machine Learning
Deep Learning
Motion & Tracking

Course structure
Unit Core Topics Lectures & Exercises
Day 1 Introduction, representation of digital images,
filtering, feature extraction
Lectures 1, 2
Day 2 Image segmentation Lectures 3, 4
Exercises 1, 2, (3)
Day 3 Machine learning (part 1)
Discussion of exercise sheet 1
Lecture 5
Day 4 Machine learning (part 2)
Object detection
Lectures 6, 7
Exercises 3, 4, (5)
Day 5 Deep learning elements
Lecture 8

Course structure
Unit Core Topics Lectures & Exercises
Day 6 Deep learning detection
Deep learning segmentation
Lectures 9, 10
Exercises 6, 7, (8)
Day 7 Autoencoders
Lecture 11
Day 8 Video processing
Visual tracking
Lectures 12, 13
Exercises 9, 10, (11)
Day 9 Application and translation of AI
Lecture 14
Day 10 Research Talk

The exercises are a fundamental part of the course. They are important
as they help you to understand the course material in more depth.
They will cover the following aspects:
• Understanding of the core methods
• Help to apply the concepts in practice
• Provide direction for additional study
The points from the exercises are account for 30% of the final grade.
Exercises 3, 6, 9, 12, 18 are optional.
Exercises

Python libraries
We advise to work with the Anaconda distribution that is based on Python
3.x. Using the conda installer is it possible to install missing packages
• Numpy
• Scikit-image (http://scikit-image.org/)
• Scikit-learn (http://scikit-learn.org/)
• OpenCV (http://opencv.org/) – not required for the exercises
• pyDICOM
For medical image processing:
• SimpleITK (http://www.simpleitk.org/)

You will find a set of Python
notebooks on github.
You can copy these onto your local
computer or run them online.
Your Python setup

• Computer vision started in the late 1960s in groups that pioneered
artificial intelligence
• The goal was to build machines and systems that could ‘see’, i.e.
interpret the visual word
• As such the field has very close links with robotics
Computer Vision

• A seminal book that describes a
general framework for understanding
visual perception
• Reconstructing the scene from a set of
primitives (lines, simple geometric
structures) is a central theme
David Marr - Vision

Takeo Kanade
Contributions to computer vision and robotics in over 50 years

PhD Thesis 1974
Kyoto, Japan
Neural Network Based
Face Detection
H. A.Rowley, S. Baluja, T. Kanade
CVPR 1996
Input
Network Output
su
bs
am
pl
in
g
Preprocessing Neural network
pixels
20 by 20
Extracted window
Input image pyramid
(20 by 20 pixels)
Correct lighting Histogram equalization Receptive fields
Hidden units
variation across the face. The linear function will approx-
imate the overall brightness of each part of the window,
and can be subtracted from the window to compensate for a
variety of lightingconditions. Then histogram equalization
is performed, which non-linearly maps the intensity values
to expand the range of intensities in the window. The his-
togram is computed for pixels inside an oval region in the
window. This compensates for differences in camera input
gains, and improves the contrast in some cases.
The preprocessed windowis then passed througha neural
network. The network has retinal connections to its input
layer; the receptive fields of hidden units are shown in
Figure 1. There are three types of hidden units: 4 which
look at 10x10 pixel subregions, 16 which look at 5x5 pixel
subregions, and 6 which look at overlapping 20x5 pixel
horizontal stripes of pixels. Each of these types was chosen
to allow the hidden units to represent localized features that
might be important for face detection. Although the figure
shows a single hidden unit for each subregion of the input,
these units can be replicated. For the experiments which
are described later, we use networks with two and three sets
of these hidden units. Similar input connection patterns are
commonly used in speech and character recognition tasks
[Waibel et al., 1989, Le Cun et al., 1989]. The network has
a single, real-valued output, which indicates whether or not
the window contains a face.
Totraintheneural networkusedinstageonetoserveas an
accurate filter, a large number of face and non-face images
are needed. Nearly 1050 face examples were gathered
from face databases at CMU and Harvard2
. The images
and position, as follows:
1. Rotate image so both eyes appear on a horizontal line.
2. Scale image so the distance from the point between
the eyes to the upper lip is 12 pixels.
3. Extract a 20x20 pixel region, centered 1 pixel above
the point between the eyes and the upper lip.
In the training set,15 face examples are generated from each
original image,by randomly rotating the images (about their
center points) up to 10 , scaling between 90% and 110%,
translating up to half a pixel, and mirroring. Each 20x20
windowin the set is then preprocessed (by applyinglighting
correction and histogram equalization). The randomization
gives the filter invariance to translations of less than a pixel
and scalings of 10%. Larger changes in translation and
scale are dealt with by applying the filter at every pixel
position in an image pyramid, in which the images are
scaled by factors of 1.2.
Practically any image can serve as a non-face example
because the space of non-face images is much larger than
the space of face images. However, collecting small yet a
“representative” set of non-faces is difficult. Instead of col-
lecting the images before training is started, the images are
collected during training in the following manner, adapted
from [Sung and Poggio, 1994]:
1. Create an initial set of non-face images by generating
1000 images with random pixel intensities. Apply the
preprocessing steps to each of these images.
2. Train the neural network to produce an output of 1 for
A Statistical Approach to 3D Object Detection
Applied to Faces and Cars
H. Scheiderman and T. Tanade
2000
quencies. Each subsequent level represents a higher octave of frequencies. In terms of spatial
Level 3
LH
Level 3
HH
Level 3
HL
Level 2
LH
Level 2
HL
Level 2
HH
L1
HL
L1
LH
L1
HH
L1
LL
Figure 15. Wavelet representation of an image
Figure 16. Images and their wavelet transforms.
Note: the wavelet coefficients are each quantized to five values.

Structure from motion
C Tomasi and T. Kanade, 1991
Feature tracking
B.D. Lucas and T. Kanade 1981
C. Tomasi and T. Kanade 1991
If we now partition the matrices L, E, and R as follows:
E
R
II
> * l
L" ] }2F
" E' 0
II
0
3
E"
}
/
»
-
3
II
' R! '
R"
p
}
3-
3 '
we have
LSR = L'E'R! +L"E"R!f
.
Let be the ideal measurement matrix, that is, the matrix we would obtain in
the absence of noise. Because of the rank principle, the non-zero singular values
of W* are atmost three. Since the singular values in E are sorted in non-increasing
order, 17 must contain all the singular values of W* that exceed the noise level.
As a consequence, the term ¿"17"/?" must be due entirely to noise, and the product
L'E'R! is the best possible rank-3 approximation to W*.
We can now restate our key point.
The Rank Principle for Noisy Measurements
All the shape and motion information in W is contained in its three
greatest singular values, together with the corresponding left and
right eigenvectors.
Thus, the best possible approximations to the ideal measurement matrix W is
the product
W = L'E'R!
12
• Registration can be achieved through a local search
using gradients
• Tracking is improved by selecting which features
should be tracked

• The robotic system controls 30 cameras based
on the operator controlled master camera
• The feeds from 30 cameras are blended into
one dynamic panorama
In collaboration with CBS & Princeton Video Imaging

Image Guided Navigation System
to Measure Intraoperatively
Acetabular Implant Alignment
1998
transformation is first determined using manually specified anatomical landmarks to
perform corresponding point registration [6]. Once this initial estimate is determined,
the surface-based registration algorithm described in [15] uses the pre- and intra-oper-
ative data to refine the initial transformation estimate.
Once the location of the pelvis is determined via registration, navigational feedback can
be provided to the surgeon on a television monitor, as seen in Fig. 7. This feedback is
used by the surgeon to accurately position the acetabular implant within the acetabular
cavity. To align the cup within the acetabulum in the placement determined by the pre-
operative plan, the cross-hairs representing the tip of the implant and the top of the han-
dle must be aligned at the fixed cross hair in the center of the image. Once aligned, the
implant is in the pre-operatively planned orientation.
Fig. 6. Surface-based registration.
Fig. 7. Navigational feedback. Fig. 8. Real-time tracking of the pelvis.
Foundation of Quality of
Life Technology Center
CMU, 2208
Understanding the Phase Contrast Optics
to Restore Artifact-free Microscopy
Images for Segmentation
MICCAI 2012

Link to the youtube lecture.
Takeo Kanade’s Kyoto Prize lecture

• Lectures and attendance
• 20% attendance
• 20% exercises
• Examination
• 30% mid-term exam
• 30% mid-term exam
Course evaluation

• Form a study group of four students – in the second half of the course
we will have a small challenge and you will have to work as a team
• Every week we will devote 15 min of time to answer questions. In the
entire course we expect each group to prepare 3 questions. Please
submit these questions to the coordinator.
• In week 5 we will have a revision class. Each group should submit one
question/topic they like to revise
Group working & participation

Machine Learning for Computer Vision.pdf

More Related Content

Similar to Machine Learning for Computer Vision.pdf (20)

Recently uploaded (20)

Machine Learning for Computer Vision.pdf