Vision Algorithmics
Adrian F. Clark
VASE Laboratory, Computer Science & Electronic Engineering
University of Essex, Colchester, CO4 3SQ
halien@essex.ac.uki
This essay explores vision software development. Rather than focussing on how to use languages,
libraries and tools, it instead explores some of the more difficult problems to solve: having the
wrong software model, numerical problems, and so on. It also touches on non-vision software that
often proves useful, and offers some guidelines on the practicalities of developing vision software.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Failure of Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Programming Versus Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Numerical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 The Right Algorithm in The Right Place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Vision Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 Massaging Program Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
Vision Algorithmics 2
1 Introduction
The purpose of a PhD is to give training and experience in research. The principal output of a
PhD is one or more contributions to the body of human knowledge. This might suggest that the
main concern of a thesis is describing the contribution, the “Big Idea,” but the reality is different:
the thesis must present experimental evidence, analysed in a robust way, that demonstrates that
the Big Idea is indeed valid. For theses in the computer vision area, this gathering and analysis
of experimental evidence invariably involves programming algorithms. Indeed, informal surveys
have found that PhD students working in vision typically spend over 50% of their time in pro-
gramming, debugging and testing algorithms. Hence, becoming experienced in vision software
development and evaluation is an important, if not essential, element of research training.
Many researchers consider the programming of vision algorithms to be straightforward be-
cause it simply instantiates the underlying mathematics. This attitude is naïve, for it is well
known that mathematically-tractable solutions do not guarantee viable computational algorithms.
Indeed, the author considers a programming language to be a formal notation for expressing al-
gorithms, one that can be checked by a machine, and a program to be the formal description of
an algorithm that incorporates mathematical, numerical and procedural aspects.
Rather than present a dusty survey of the vision algorithms available in the 100+ image pro-
cessing packages available commercially or for free, this essay tries to do something different:
it attempts to illustrate, through the use of examples, the major things that you should bear in
mind when coding up your own algorithms or looking at others’ implementations. It concentrates
largely on what can go wrong. The main idea the author is trying to instill is a suspicion of all
results produced by software! Example code is written in C but the principles are equally valid in
C++, Java, Matlab, Python or any other programming language you may use.
Having made you thoroughly paranoid about software issues, the document will then briefly
consider what packages are available free of charge both to help you write vision software and
analyse the results that come from it. That will be followed by a brief discussion that leads into
the exploration of performance characterisation in a separate lecture in the Summer School.
Vision Algorithmics 3
2 Failure of Programming Model
Let us consider the first area in which computer vision programs often go wrong. The problem,
which is a fundamental one in programming terms, is that the conceptual model one is using as
the basis of the software has deficiencies.
Perhaps the best way to illustrate this is by means of an example. The following C code
performs image differencing, a simple technique for isolating moving areas in image sequences.
typedef unsigned char byte;
void sub_ims (byte **i1, byte **i2, int ny, int nx)
{
int y, x;
for (y = 0; y < ny; y++)
for (x = 0; x < nx; x++)
i1[y][x] = i1[y][x] - i2[y][x];
}
The procedure consists of two nested loops, the outer one scanning down the lines of the image
and the inner one accessing each pixel of the line; it was written with exposition in mind, not
speed. Let us consider what is good and bad about the code.
Firstly, the code accesses the pixels in the correct order. Two-dimensional arrays in C are
represented as ‘arrays of arrays,’ which means that the last subscript should cycle fastest to access
adjacent memory locations. On PC- and workstation-class machines, all of which have virtual
memory subsystems, failure to do this could lead to large numbers of page faults, substantially
slowing down the code. Incidentally, it is commonly reported that this double-subscript approach
is dreadfully inefficient as it involves multiplications to subscript into the array — but that is
complete bunkum!
There are several bad aspects to the code. Arrays i1 and i2 are declared as being of type
unsigned char, which means that pixel values must lie in the range 0–255; so the code is
constrained to work with 8-bit imagery. This is a reasonable assumption for the current generation
of video digitizers, but the software cannot be used on data recorded from a 10-bit flat-bed scanner,
or from a 14-bit remote sensing satellite, without throwing some potentially important resolution
away.
What happens if, for some particular values of the indices y and x, the value in i2 is greater
than that in i1? The subtraction will yield a negative value, one that cannot be written back
into i1. Some run-time systems will generate an exception; others will silently reduce the result
modulo 256 and continue, so that the user erroneously believes that the operation succeeded.
In any case, one will not get what one expected. The simplest way around both this and the
previous problem is to represent each pixel as a 32-bit signed integer rather than as an unsigned
byte. Although images will then occupy four times as much memory (and require longer to read
from disk, etc.), it is always better to get the right answer slowly than an incorrect one quickly.
Alternatively, one could even represent pixels as 32-bit floating-point numbers as they provide the
larger dynamic range (at the cost of reduced accuracy) needed for things like Fourier transforms.
On modern processors, the penalty in computation speed is negligible for either 32-bit integers or
floating-point numbers. Indeed, because of the number of integer units on a single processor and
the way its pipelines are filled, floating-point code can sometimes run faster than integer code!
Vision Algorithmics 4
You should note that this problem with overflowing the representation is not restricted to
subtraction: addition and multiplication have at least as much potential to wreak havoc with the
data. Division requires even more care for, with a purely integer representation, one must either
recognise that the fractional part of any division will be discarded or remember to include code
that rounds the result, as in
i1[y][x] = (i1[y][x] + 255) / i2[y][x];
for 8-bit unsigned imagery; the number added changes to 65535 for 16-bit unsigned imagery, and
so on.
There are ways of avoiding these types of assumptions. Although it is not really appropriate
to go into these in detail in this document, an indication of the general approaches is relevant.
Firstly, by making a class (or structured data type in non-object-oriented languages such as C)
called ‘image’, one can arrange to record the data type of the pixels in each image; a minimal
approach in C is:
struct {
int type;
void *data;
} image;
...
image *i1, *i2;
Inside the processing routine, one can then select separate loops based on the value of i1->type.
The code also has serious assumptions implicitly built in. It assumes that images comprise only
one channel of information, so that colour or multi-spectral data require multiple invocations.
More seriously, it assumes that an image can be held entirely in computer memory. Again, this is
reasonable for imagery grabbed from a video camera but perhaps unreasonable for 6000 ⇥ 6000
pixel satellite data or for images recorded from 24 Mpixel digital cameras being processed on an
embedded system.
The traditional way of avoiding having to store the entire image in memory is to employ line-
by-line access:
for (y = 0; y < ny; y++) {
buf = getline (y);
for (x = 0; x < nx; x++)
...operate on buf[x]...
putline (y, buf);
}
This doesn’t actually have to involve line-by-line access to a disk file as getline can return a
pointer to the line of an image held in memory: it is logical line-by-line access. This approach has
been used by the author and colleagues to produce code capable of being used unchanged on both
serial and parallel hardware as well as on clusters, grids and clouds.
Vision Algorithmics 5
3 Programming Versus Theory
Sticking with this notion of using familiar image processing operations to illustrate problems, let us
consider convolution performed in image space. This involves multiplying the pixels in successive
regions of the image with a set of coefficients, putting the result of each set of computations
back into the middle of the region. For example, the well-known Laplacean operator uses the
coefficients 0
@
1 1 1
1 8 1
1 1 1
1
A
while the coefficients 0
@
1 1 1
1 1 1
1 1 1
1
A
yield a 3 ⇥ 3 blur. When programming up any convolution operator that uses such a set of coef-
ficients, the most important design question is what happens at the edges? There are several
approaches in regular use:
• don’t process the edge region, which is a one-pixel border for a 3 ⇥ 3 ‘mask’ of coefficients,
for example;
• reduce the size of the mask as one approaches the edge;
• imagine the image is reflected along its first row and column and program the edge code
accordingly;
• imagine the image wraps around cyclically.
Which of these is used is often thought to be down to the whim of the programmer; but only one
of these choices matches the underlying mathematics.
The correct solution follows from the description of the operator as a convolution performed in
image space. In signal processing, convolution is usually performed in Fourier space: the image is
Fourier transformed to yield a spectrum; that spectrum is multiplied by a filter, and the resulting
product inverse-transformed. The reason that convolutions are not programmed in this way in
practice is that, for small-sized masks, the Fourier approach requires more computation, even
when using the FFT discussed below. The use of Fourier transformation, or more precisely the
discrete Fourier transform, involves implicit assumptions; the one that is relevant here is that the
data being transformed are cyclic in nature, as though the image is the unit cell of an infinitely-
repeating pattern. So the only correct implementation, at least as far as the theory is concerned,
is the last option listed above.
Vision Algorithmics 6
4 Numerical Issues
One thing that seems to be forgotten far too often these days is that floating-point arithmetic is
not particularly accurate. The actual representation of a floating-point number is m ⇥ 2e
, analog-
ous to scientific notation, where m, the mantissa, can be thought of as having the binary point
immediately to its left and e, the exponent, is the power of two that ‘scales’ the mantissa. Hence,
(IEEE-format) 32-bit floating-point corresponds to 7–8 decimal digits and 64-bit to roughly twice
that. Because of this and other issues, floating-point arithmetic is significantly less accurate than a
pocket calculator! The most common problem is that, in order to add or subtract two numbers,
the representation of the smaller number must be changed so that it has the same exponent as
the larger, and this involves right-shifting binary digits in the mantissa to the right. This means
that, if the numbers differ by about 107
, all the digits of the mantissa are shifted out and the lower
number effectively becomes zero. The ‘obvious’ solution is to use 64-bit floating-point (double
in C) but this simply postpones the problem to bigger numbers, it does not solve it.
You might think that these sorts of problems will not occur in computer vision; after all, images
generally involve numbers only in the range 0–255. But that is simply not true; let us consider
two examples of where this kind of problem can bite the unwary programmer.
The first example is well-known in numerical analysis. The task of solving a quadratic equation
crops up surprisingly frequently, perhaps in fitting to a maximum when trying to locate things
accurately in images, or when intersecting vectors with spheres when working on 3D problems.
The solution to
ax2
+ bx + c = 0
is something almost everyone learns at school:
x =
b ±
p
b2 4ac
2a
and indeed this works well for many quadratic problems. But there is a numerical problem hidden
in there.
When the discriminant, b2
4ac, involves values that make b2
4ac, the nature of floating-
point subtraction alluded to above can make 4ac ! 0 relative to b2
so that the discriminant
becomes ±b...and this means that the lower of the two solutions to the equation is b + b = 0.
Fortunately, numerical analysts realized this long ago and have devised mathematically-equivalent
formulations that do not suffer from the same numerical instability. If we first calculate
q =
1
2
Ä
b + sgn(b)
p
b2 4ac
ä
then the two solutions to the quadratic are given by
x1 = c/q
and
x2 = q/a.
Even code that looks straightforward can lead to difficult-to-find numerical problems. Let us
consider yet another simple image processing operation, finding the standard deviation (SD) of
the pixels in an image. The definition of the SD is straightforward enough
1
N
NX
i=1
(xi ¯x)2
Vision Algorithmics 7
where ¯x is the mean of the image and N the number of pixels in it. (We shall ignore the distinction
between population and sample SDs, which is negligible for images.) If we program this equation
to calculate the SD, the code has to make two passes through the image: the first calculates the
mean, ¯x, while the second finds deviations from the mean and hence calculates the SD.
Most of us probably learnt in statistics courses how to re-formulate this: substitute the defini-
tion of the mean in this equation and simplify the result. We end up needing to calculate
X
x2
P
x
2
N
which can be programmed up using a single pass through the image as follows.
float sd (image *im, int ny, int nx)
{
float v, var, sum = sum2 = 0.0;
int y, x;
for (y = 0; y < ny; y++) {
for (x = 0; x < nx; x++) {
v = im[y][x];
sum = sum + v;
sum2 = sum2 + v * v;
}
}
v = nx * ny;
var = (sum2 - sum * sum/v) / v;
if (var <= 0.0) return 0.0;
return sqrt(var);
}
I wrote such a routine myself while doing my PhD and it was used by about thirty people on a
daily basis for two or three years before someone came to see me with a problem: on his image,
the routine was returning a value of zero for the SD, even though there definitely was variation in
the image.
After a few minutes’ examination of the image, we noticed that the image in question com-
prised large values but had only a small variation. This identified the source of the problem: the
single-pass algorithm involves a subtraction and the quantities involved, which represented the
sums of squares or similar, ended up differing by less than one part in ⇠ 107
and hence yielded
inaccurate results from my code. I introduced two fixes: the first was to change sum and sum2 to
be double rather than float; and to look for cases where the subtraction in question yielded a
result that was was small or negative and, in those cases, calculate only the mean and then make
a second pass through the image. It’s a fairly ad hoc solution but proved adequate, and I and my
colleagues haven’t been bitten again.
Vision Algorithmics 8
5 The Right Algorithm in The Right Place
The fast Fourier transform (FFT) is a classic example of the distinction between mathematics and
‘algorithmics.’ The underlying discrete Fourier transform is normally represented as a matrix mul-
tiplication; hence, for N data points, O(N2
) complex multiplications are required to evaluate it.
The (radix-2) FFT algorithm takes the matrix multiplication and, by exploiting symmetry proper-
ties of the transform kernel, reduces the number of multiplications required to O(N log2 N); for a
256 ⇥ 256 pixel image, this is a saving of roughly
✓
2562
256 ⇥ 8
◆2
=
✓
216
211
◆2
= 1024 times!
These days, an FFT is entirely capable of running in real time on even a moderately fast processor;
a DFT is not.
On the other hand, one must not use cute algorithms blindly. When median filtering, for
example, one obviously needs to determine the median of a set of numbers, and the obvious
way to do that is to sort the numbers into order and choose the middle value. Reaching for our
textbook, we find that the ‘quicksort’ algorithm (see, for example, [1]) has the best performance
for random data, again with computation increasing as O(N log2 N), so we choose that. But
O(N log2 N) is its best-case performance; its worst-case performance is O(N2
), so if we have an
unfortunately-ordered set of data, quicksort is a poor choice! A better compromise is probably
Shell’s sort algorithm: its best-case performance isn’t as good as that of the quicksort but its
worst-case performance is nowhere near as bad — and the algorithm is simpler to program and
debug.
Even selecting the ‘best’ algorithm is not the whole story. Median filtering involves working
on many small sets of numbers rather than one large set, so the way in which the sort algorithm’s
performance scales is not the overriding factor. Quicksort is most easily implemented recursively,
and the time taken to save registers and generate a new call frame will probably swamp the com-
putation, even on a RISC processor; so we are pushed towards something that can be implemented
iteratively with minimal overhead — which might even mean a bubble sort! But wait: why bother
sorting at all? There are median-finding algorithms that do not involve re-arranging data, either
by using multiple passes over the data or by histogramming. One of these is almost certainly
faster.
Where do you find out about numerical algorithms? The standard reference these days is [2],
though I should warn you that many numerical analysts do not consider its algorithms to be state-
of-the-art, or even necessarily good. (A quick web-search will turn up many discussions regarding
the book and its contents.) The best numerical algorithms that the author is aware of are those
licensed from NAg, the Numerical Algorithms Group. All UK higher education establishments
should have NAg licenses.
Considerations of possible numerical issues such as the ones highlighted here are important if
you need to obtain accurate answers or if your code forms part of an active vision system.
Vision Algorithmics 9
6 Vision Packages
If you’ve been reading through this document carefully, you’re probably wondering whether it is
ever possible to produce vision software that works reliably. Well, it is possible — but you cannot
tell whether or not a piece of software has been written carefully and uses sensible numerical tech-
niques without looking at its source code. For that reason, the author, like many other researchers,
has a strong preference for open-source software.
What is out there in there in the open source world? There are many vision packages, though
all of them are the results of fairly small groupings of researchers and developers. Recent years
have seen the research community focus on OpenCV and Matlab. As a rule of thumb, people who
need their software to run in real time or have huge amounts of data to crunch through choose
OpenCV, while people whose focus is algorithm development often work with Matlab. As both
of these are covered elsewhere in this Summer School, here are a few others that you might be
interested in looking at:
VXL: http://vxl.sourceforge.net/
A heavily-templated C++ class library being developed jointly by people in a few research
groups, including (the author believes) Oxford and Manchester. VXL grew out of Target Jr,
which was a prototype of the Image Understanding Environment, a well-funded initiative
to develop a common programming platform for the vision community. (Unfortunately, the
IUE’s aspirations were beyond what could be achieved with the hardware and software of
the time and it failed.)
Tina: http://www.tina-vision.net/
Tina is probably the only vision library currently around that makes a conscious effort to
provide facilities that are statistically robust. While this is a great advantage and a lot of
work has been put into making it more portable and easy to work with, it is fair to say that
a significant amount of effort needs to be put into learning to use it.
EVE: http://vase.essex.ac.uk/software/eve.html
EVE, the Easy Vision Environment, is a pure-Python vision package built on top of numpy.
It aims to provide easy-to-use functionality for common image processing and computer
vision tasks, the intention being for them to be used during interactive sessions, from the
Python interpreter’s command prompt or from an enhanced interpreter such as ipython as
well as in scripts. It does not use Python’s OpenCV wrappers, giving the Python developer
an independent check of OpenCV functionality.
In the author’s opinion, the biggest usability problem with current computer vision packages
is that they cater principally for the programmer: they rarely provide facilities for prototyping
algorithms via a command language or for producing graphical user interfaces; and their visual-
isation capabilities are typically not great. These deficiencies are ones that Matlab, for example,
addresses very well.
Essentially all of the vision packages available today, free or commercial, are designed for a
single processor; they cannot spread computation across a cluster, for example, or migrate compu-
tation onto a GPU. That is a problem for algorithms that may take minutes to run and consequently
tens of hours to evaluate. There is a definite need for something that does this, provides a script-
ing language and good graphical capabilities, is based around solid algorithms, and is well tested.
OpenCL (not to be confused with OpenGL, the well-known 3D rendering package) may help us
Vision Algorithmics 10
achieve that, and the author has observed a few GPU versions of vision algorithms (SIFT, ransac,
etc.) starting to appear.
With the exception of OpenCV, which includes a (limited) machine learning library, vision
packages work on images. Once information has been extracted from images, you usually have to
find other facilities. This is another distinct shortcoming, as the author is certain that many task-
specific algorithms are very weak in their processing of the data extracted from images. However,
there are packages around that can help:
Image format conversion. It sometimes seems that there are as many different image formats
as there are research groups! You are almost certain to need to convert the format of some
image data files, and there are two good toolkits around that makes this very easy, both of
which are available for Windows and Unix (including Linux and MacOS X). Firstly, there is
NETPBM, an enhancement of Jef Poskanzer’s PBMPLUS (http://netpbm.sourceforge.
net/): this is a set of programs that convert most popular formats to and from its own
internal format. If you use an image format within your research group that isn’t already
supported by NETPBM, the author’s advice to you is to write compatible format converters
as quickly as possible; they’ll prove useful at some point. The second package worthy of
mention is ImageMagick (http://www.imagemagick.org/), which includes both a good
converter and a good display program. There are also interfaces between ImageMagick and
the Perl, Python and Tcl interpreters, which can be useful.
Machine learning. Weka (http://www.cs.waikato.ac.nz/ml/weka/) is a Java machine learn-
ing package really intended for data mining. It contains tools for data pre-processing, classi-
fication, regression, clustering, association rules, and visualization. I’ve heard good reports
about its facilities — if your data can be munged into something that it can work on.
Statistical software. The statistics research community has been good at getting its act together,
much better than the vision community, and much of its research now provides facilities that
interface to R (http://www.r-project.org/). R is a free implementation of S (and S+),
commercial statistical packages that grew out of work at Bell Laboratories. R is available
for both Unix and Windows (it’s a five-minute job to install on Debian Linux, for example)
and there are books around that provide introductions for all levels of users. Unusually for
free software, it provides excellent facilities for graphical displays of data.
Neural networks. Many good things have been said about netlab (http://www.ncrg.aston.
ac.uk/netlab/), a Matlab plug-in that provides facilities rarely seen elsewhere.
Genetic algorithms. Recent years have seen genetic algorithms gain in popularity as a robust
optimisation mechanism. There are many implementations of GAs around the Net; but the
author has written a simple, real-valued GA package with a root-polishing option which
seems to be both fast and reliable. It’s available in both C and Python; do contact him if you
would like a copy.
Whatever you choose to work with, whether from this list or not, treat any results with the same
suspicion as you treat those from your own code.
Vision Algorithmics 11
7 Massaging Program Outputs
One of the things you will inevitably have to do is take the output from a program and manipulate
it into some other form. This usually occurs when “plugging together” two programs or when
trying to gather success and failure rates. To make this easier, there are two specific things that
make your life easier:
• learn a scripting language; the most suitable is currently Python, largely because of the
excellent numpy and scipy extensions and its OpenCV functionality;
• don’t build a graphical user interface (GUI) into your program.
Scripting languages are designed to be used as software ‘glue’ between programs; they all provide
easy-to-use facilities for processing textual information, including regular expressions. Which of
them you choose is largely up to personal preference; the author has used Lua, Perl, Python, Ruby
and Tcl — after a long time working with Perl and Tcl, he finally settled on Python. Indeed, most of
the author’s programming is done in a scripting language as it is really a much more time-efficient
way to work.
Building a GUI into a piece of research code is a mistake because it makes it practically im-
possible to use in any kind of automated processing. A much better solution is to produce a
comprehensive command-line interface, one that allows most internal parameters and thresholds
to be set, and then use one of the scripting languages listed above to produce the GUI. For ex-
ample, the author built the interface shown in Figure 1 using Tcl and its Tk graphical toolkit in
about two hours. The interface allows the user to browse through a set of nearly 8,000 images and
an expert’s opinion of them, fine-tuning them when the inevitable mistakes crop up. This 250-line
GUI script works on Unix (including Linux and MacOS X) and Windows entirely unchanged.
Figure 1: Custom-designed graphical interface using a scripting language
Vision Algorithmics 12
8 Concluding Remarks
The aim of this essay is to make you aware of the major classes of problem that can and do crop
up in vision software. Beyond that, the major message to take home is not to believe results, either
those from your own software or anyone else’s, without checking. A close colleague of the author’s
will often spend a day getting a feel for the nature of her imagery, and she can spot a problem
in a technique better than anyone else I’ve ever worked with. That is not, I suggest, entirely
coincidence.
Another important thing is to see if changing tuning parameters has the effect that you expect.
For example, one of the author’s students recently wrote some stereo software. Its accuracy was
roughly as anticipated but, following my suggestion that she explore the software using simulated
imagery, found that accuracy decreased as the image size increased. She was uneasy about this
and, after checking with me for a ‘second opinion,’ took a good look at her software. Sure enough,
she found a bug and, re-running the tests after fixing it, found that accuracy then increased with
image size — in keeping with both our expectations. Without both a feel for the problem and
carrying out experiments to see that expected results do indeed occur, the bug may have lain
dormant in the code for weeks or months, only to become apparent after more research had been
built on it — effort that would have to be repeated.
This idea of exploring an algorithm by varying its parameters is actually an important one.
Measuring the success and failure rates as each parameter is varied allows the construction of re-
ceiver operating characteristic (ROC) curves, currently one of the most popular ways of presenting
performance data in vision papers.
However, it is possible to go somewhat further. Let us imagine we have written a program
that processes images of the sky and attempts to estimate the proportion of the image that has
cloud cover (a surprisingly difficult task); this is one of the problems that the data in Figure 1 are
used for. A straightforward evaluation of the algorithm, the kind presented in most papers, would
simply report how often the amount of cloud cover detected by the program matches the expert’s
opinion. However, as the expert also indicated the predominant type of cloud in the image, we
can explore the algorithm in much more detail. We might expect that a likely cause of erroneous
classification is mistaking thin, wispy cloud as blue sky. As both cirrus and altostratus are thin and
wispy, we can run the program against the database of images and see how often the estimate
of cloud cover is wrong when these types of cloud occur. This approach is much, much more
valuable: it gathers evidence to see if there is a particular type of imagery on which our program
will fail, and that tells us where to expend effort in improving the algorithm. This is performance
characterisation, the first step towards producing truly robust vision algorithms — and that forms
the basis of the material that is presented in a separate Summer School lecture.
References
[1] Robert Sedgewick. Algorithms in C. Addison-Wesley, 1990.
[2] William H. Press, Saul A. Teukolsky, Willian T. Vetterling, and Brian P. Flannery. Numerical
Recipes in C. Cambridge University Press, second edition, 1992.

More Related Content

PDF
2010 bristol q1_formal-property-checkers
PDF
Better neuroimaging data processing: driven by evidence, open communities, an...
PPTX
Defuzzification
PDF
Face Identification Project Abstract 2017
PDF
Midterm Exam Solutions Fall03
PDF
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
PDF
Final Exam Questions Fall03
PDF
Deep learning for detecting anomalies and software vulnerabilities
2010 bristol q1_formal-property-checkers
Better neuroimaging data processing: driven by evidence, open communities, an...
Defuzzification
Face Identification Project Abstract 2017
Midterm Exam Solutions Fall03
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Final Exam Questions Fall03
Deep learning for detecting anomalies and software vulnerabilities

What's hot (17)

PPTX
Design and Analysis of DNA String Matching by Optical Parallel Processing
PDF
C04701019027
PDF
Credit card fraud detection and concept drift adaptation with delayed supervi...
PDF
Accord.Net: Looking for a Bug that Could Help Machines Conquer Humankind
PDF
Deep learning and applications in non-cognitive domains I
PDF
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
PDF
New Design Architecture of Chaotic Secure Communication System Combined with ...
PDF
Deep learning and applications in non-cognitive domains II
PDF
Using Deep Learning to Find Similar Dresses
DOC
Presentation on Machine Learning and Data Mining
PDF
Recent Trends in Deep Learning
PDF
A Survey on Visual Cryptography Schemes
PPT
Eckovation machine learning project
PPTX
Inverse Modeling for Cognitive Science "in the Wild"
PDF
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...
PPTX
Attractive light wid
PPTX
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
Design and Analysis of DNA String Matching by Optical Parallel Processing
C04701019027
Credit card fraud detection and concept drift adaptation with delayed supervi...
Accord.Net: Looking for a Bug that Could Help Machines Conquer Humankind
Deep learning and applications in non-cognitive domains I
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
New Design Architecture of Chaotic Secure Communication System Combined with ...
Deep learning and applications in non-cognitive domains II
Using Deep Learning to Find Similar Dresses
Presentation on Machine Learning and Data Mining
Recent Trends in Deep Learning
A Survey on Visual Cryptography Schemes
Eckovation machine learning project
Inverse Modeling for Cognitive Science "in the Wild"
Performance Analysis of Various Data Mining Techniques on Banknote Authentica...
Attractive light wid
User Interfaces that Design Themselves: Talk given at Data-Driven Design Day ...
Ad

Viewers also liked (6)

PDF
Performance characterization in computer vision
PDF
Machine learning for computer vision part 2
PDF
A primer for colour computer vision
PDF
BMVA summer school MATLAB programming tutorial
PDF
Statistical models of shape and appearance
PDF
Image formation
Performance characterization in computer vision
Machine learning for computer vision part 2
A primer for colour computer vision
BMVA summer school MATLAB programming tutorial
Statistical models of shape and appearance
Image formation
Ad

Similar to Vision Algorithmics (20)

PDF
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDS
PDF
Modul mBlock 5 and arduino.pdf
PPT
UNIT-II.ppt kkljfuudvmllmhghdwscnmlitfxcchmkk
PPT
UNIT-II.ppt artificial intelligence cse bkk
PDF
rooter.pdf
PDF
Report face recognition : ArganRecogn
PDF
Partial Object Detection in Inclined Weather Conditions
PDF
Automatic License Plate Recognition using OpenCV
PDF
Automatic License Plate Recognition using OpenCV
PPTX
Code instrumentation
PDF
Questions On The Equation For Regression
PDF
ODSC West 2022 – Kitbashing in ML
DOCX
Image Recognition Expert System based on deep learning
PDF
Color based image processing , tracking and automation using matlab
PDF
Computer graphics by bahadar sher
PDF
Log polar coordinates
PDF
Finding Resource Manipulation Bugs in Linux Code
PDF
How to avoid bugs using modern C++
PDF
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
PDF
Lecture-1-2-+(1).pdf
FACE COUNTING USING OPEN CV & PYTHON FOR ANALYZING UNUSUAL EVENTS IN CROWDS
Modul mBlock 5 and arduino.pdf
UNIT-II.ppt kkljfuudvmllmhghdwscnmlitfxcchmkk
UNIT-II.ppt artificial intelligence cse bkk
rooter.pdf
Report face recognition : ArganRecogn
Partial Object Detection in Inclined Weather Conditions
Automatic License Plate Recognition using OpenCV
Automatic License Plate Recognition using OpenCV
Code instrumentation
Questions On The Equation For Regression
ODSC West 2022 – Kitbashing in ML
Image Recognition Expert System based on deep learning
Color based image processing , tracking and automation using matlab
Computer graphics by bahadar sher
Log polar coordinates
Finding Resource Manipulation Bugs in Linux Code
How to avoid bugs using modern C++
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Lecture-1-2-+(1).pdf

More from potaters (10)

PDF
Ln l.agapito
PDF
Motion and tracking
PDF
Machine learning fro computer vision - a whirlwind of key concepts for the un...
PDF
Low level vision - A tuturial
PDF
Local feature descriptors for visual recognition
PDF
Image segmentation
PDF
Cognitive Vision - After the hype
PDF
Graphical Models for chains, trees and grids
PDF
Medical image computing - BMVA summer school 2014
PDF
Decision Forests and discriminant analysis
Ln l.agapito
Motion and tracking
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Low level vision - A tuturial
Local feature descriptors for visual recognition
Image segmentation
Cognitive Vision - After the hype
Graphical Models for chains, trees and grids
Medical image computing - BMVA summer school 2014
Decision Forests and discriminant analysis

Recently uploaded (20)

PPTX
PMR- PPT.pptx for students and doctors tt
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PDF
CuO Nps photocatalysts 15156456551564161
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
PPTX
A powerpoint on colorectal cancer with brief background
PPTX
limit test definition and all limit tests
PPTX
bone as a tissue presentation micky.pptx
PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
PPTX
Substance Disorders- part different drugs change body
PDF
Chapter 3 - Human Development Poweroint presentation
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PPTX
Understanding the Circulatory System……..
PDF
5.Physics 8-WBS_Light.pdfFHDGJDJHFGHJHFTY
PDF
Metabolic Acidosis. pa,oakw,llwla,wwwwqw
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
Platelet disorders - thrombocytopenia.pptx
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PPT
Cell Structure Description and Functions
PDF
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
PMR- PPT.pptx for students and doctors tt
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
CuO Nps photocatalysts 15156456551564161
Animal tissues, epithelial, muscle, connective, nervous tissue
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
A powerpoint on colorectal cancer with brief background
limit test definition and all limit tests
bone as a tissue presentation micky.pptx
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
Substance Disorders- part different drugs change body
Chapter 3 - Human Development Poweroint presentation
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
Understanding the Circulatory System……..
5.Physics 8-WBS_Light.pdfFHDGJDJHFGHJHFTY
Metabolic Acidosis. pa,oakw,llwla,wwwwqw
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
Platelet disorders - thrombocytopenia.pptx
TORCH INFECTIONS in pregnancy with toxoplasma
Cell Structure Description and Functions
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...

Vision Algorithmics

  • 1. Vision Algorithmics Adrian F. Clark VASE Laboratory, Computer Science & Electronic Engineering University of Essex, Colchester, CO4 3SQ halien@essex.ac.uki This essay explores vision software development. Rather than focussing on how to use languages, libraries and tools, it instead explores some of the more difficult problems to solve: having the wrong software model, numerical problems, and so on. It also touches on non-vision software that often proves useful, and offers some guidelines on the practicalities of developing vision software. Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Failure of Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Programming Versus Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Numerical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5 The Right Algorithm in The Right Place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6 Vision Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 7 Massaging Program Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1
  • 2. Vision Algorithmics 2 1 Introduction The purpose of a PhD is to give training and experience in research. The principal output of a PhD is one or more contributions to the body of human knowledge. This might suggest that the main concern of a thesis is describing the contribution, the “Big Idea,” but the reality is different: the thesis must present experimental evidence, analysed in a robust way, that demonstrates that the Big Idea is indeed valid. For theses in the computer vision area, this gathering and analysis of experimental evidence invariably involves programming algorithms. Indeed, informal surveys have found that PhD students working in vision typically spend over 50% of their time in pro- gramming, debugging and testing algorithms. Hence, becoming experienced in vision software development and evaluation is an important, if not essential, element of research training. Many researchers consider the programming of vision algorithms to be straightforward be- cause it simply instantiates the underlying mathematics. This attitude is naïve, for it is well known that mathematically-tractable solutions do not guarantee viable computational algorithms. Indeed, the author considers a programming language to be a formal notation for expressing al- gorithms, one that can be checked by a machine, and a program to be the formal description of an algorithm that incorporates mathematical, numerical and procedural aspects. Rather than present a dusty survey of the vision algorithms available in the 100+ image pro- cessing packages available commercially or for free, this essay tries to do something different: it attempts to illustrate, through the use of examples, the major things that you should bear in mind when coding up your own algorithms or looking at others’ implementations. It concentrates largely on what can go wrong. The main idea the author is trying to instill is a suspicion of all results produced by software! Example code is written in C but the principles are equally valid in C++, Java, Matlab, Python or any other programming language you may use. Having made you thoroughly paranoid about software issues, the document will then briefly consider what packages are available free of charge both to help you write vision software and analyse the results that come from it. That will be followed by a brief discussion that leads into the exploration of performance characterisation in a separate lecture in the Summer School.
  • 3. Vision Algorithmics 3 2 Failure of Programming Model Let us consider the first area in which computer vision programs often go wrong. The problem, which is a fundamental one in programming terms, is that the conceptual model one is using as the basis of the software has deficiencies. Perhaps the best way to illustrate this is by means of an example. The following C code performs image differencing, a simple technique for isolating moving areas in image sequences. typedef unsigned char byte; void sub_ims (byte **i1, byte **i2, int ny, int nx) { int y, x; for (y = 0; y < ny; y++) for (x = 0; x < nx; x++) i1[y][x] = i1[y][x] - i2[y][x]; } The procedure consists of two nested loops, the outer one scanning down the lines of the image and the inner one accessing each pixel of the line; it was written with exposition in mind, not speed. Let us consider what is good and bad about the code. Firstly, the code accesses the pixels in the correct order. Two-dimensional arrays in C are represented as ‘arrays of arrays,’ which means that the last subscript should cycle fastest to access adjacent memory locations. On PC- and workstation-class machines, all of which have virtual memory subsystems, failure to do this could lead to large numbers of page faults, substantially slowing down the code. Incidentally, it is commonly reported that this double-subscript approach is dreadfully inefficient as it involves multiplications to subscript into the array — but that is complete bunkum! There are several bad aspects to the code. Arrays i1 and i2 are declared as being of type unsigned char, which means that pixel values must lie in the range 0–255; so the code is constrained to work with 8-bit imagery. This is a reasonable assumption for the current generation of video digitizers, but the software cannot be used on data recorded from a 10-bit flat-bed scanner, or from a 14-bit remote sensing satellite, without throwing some potentially important resolution away. What happens if, for some particular values of the indices y and x, the value in i2 is greater than that in i1? The subtraction will yield a negative value, one that cannot be written back into i1. Some run-time systems will generate an exception; others will silently reduce the result modulo 256 and continue, so that the user erroneously believes that the operation succeeded. In any case, one will not get what one expected. The simplest way around both this and the previous problem is to represent each pixel as a 32-bit signed integer rather than as an unsigned byte. Although images will then occupy four times as much memory (and require longer to read from disk, etc.), it is always better to get the right answer slowly than an incorrect one quickly. Alternatively, one could even represent pixels as 32-bit floating-point numbers as they provide the larger dynamic range (at the cost of reduced accuracy) needed for things like Fourier transforms. On modern processors, the penalty in computation speed is negligible for either 32-bit integers or floating-point numbers. Indeed, because of the number of integer units on a single processor and the way its pipelines are filled, floating-point code can sometimes run faster than integer code!
  • 4. Vision Algorithmics 4 You should note that this problem with overflowing the representation is not restricted to subtraction: addition and multiplication have at least as much potential to wreak havoc with the data. Division requires even more care for, with a purely integer representation, one must either recognise that the fractional part of any division will be discarded or remember to include code that rounds the result, as in i1[y][x] = (i1[y][x] + 255) / i2[y][x]; for 8-bit unsigned imagery; the number added changes to 65535 for 16-bit unsigned imagery, and so on. There are ways of avoiding these types of assumptions. Although it is not really appropriate to go into these in detail in this document, an indication of the general approaches is relevant. Firstly, by making a class (or structured data type in non-object-oriented languages such as C) called ‘image’, one can arrange to record the data type of the pixels in each image; a minimal approach in C is: struct { int type; void *data; } image; ... image *i1, *i2; Inside the processing routine, one can then select separate loops based on the value of i1->type. The code also has serious assumptions implicitly built in. It assumes that images comprise only one channel of information, so that colour or multi-spectral data require multiple invocations. More seriously, it assumes that an image can be held entirely in computer memory. Again, this is reasonable for imagery grabbed from a video camera but perhaps unreasonable for 6000 ⇥ 6000 pixel satellite data or for images recorded from 24 Mpixel digital cameras being processed on an embedded system. The traditional way of avoiding having to store the entire image in memory is to employ line- by-line access: for (y = 0; y < ny; y++) { buf = getline (y); for (x = 0; x < nx; x++) ...operate on buf[x]... putline (y, buf); } This doesn’t actually have to involve line-by-line access to a disk file as getline can return a pointer to the line of an image held in memory: it is logical line-by-line access. This approach has been used by the author and colleagues to produce code capable of being used unchanged on both serial and parallel hardware as well as on clusters, grids and clouds.
  • 5. Vision Algorithmics 5 3 Programming Versus Theory Sticking with this notion of using familiar image processing operations to illustrate problems, let us consider convolution performed in image space. This involves multiplying the pixels in successive regions of the image with a set of coefficients, putting the result of each set of computations back into the middle of the region. For example, the well-known Laplacean operator uses the coefficients 0 @ 1 1 1 1 8 1 1 1 1 1 A while the coefficients 0 @ 1 1 1 1 1 1 1 1 1 1 A yield a 3 ⇥ 3 blur. When programming up any convolution operator that uses such a set of coef- ficients, the most important design question is what happens at the edges? There are several approaches in regular use: • don’t process the edge region, which is a one-pixel border for a 3 ⇥ 3 ‘mask’ of coefficients, for example; • reduce the size of the mask as one approaches the edge; • imagine the image is reflected along its first row and column and program the edge code accordingly; • imagine the image wraps around cyclically. Which of these is used is often thought to be down to the whim of the programmer; but only one of these choices matches the underlying mathematics. The correct solution follows from the description of the operator as a convolution performed in image space. In signal processing, convolution is usually performed in Fourier space: the image is Fourier transformed to yield a spectrum; that spectrum is multiplied by a filter, and the resulting product inverse-transformed. The reason that convolutions are not programmed in this way in practice is that, for small-sized masks, the Fourier approach requires more computation, even when using the FFT discussed below. The use of Fourier transformation, or more precisely the discrete Fourier transform, involves implicit assumptions; the one that is relevant here is that the data being transformed are cyclic in nature, as though the image is the unit cell of an infinitely- repeating pattern. So the only correct implementation, at least as far as the theory is concerned, is the last option listed above.
  • 6. Vision Algorithmics 6 4 Numerical Issues One thing that seems to be forgotten far too often these days is that floating-point arithmetic is not particularly accurate. The actual representation of a floating-point number is m ⇥ 2e , analog- ous to scientific notation, where m, the mantissa, can be thought of as having the binary point immediately to its left and e, the exponent, is the power of two that ‘scales’ the mantissa. Hence, (IEEE-format) 32-bit floating-point corresponds to 7–8 decimal digits and 64-bit to roughly twice that. Because of this and other issues, floating-point arithmetic is significantly less accurate than a pocket calculator! The most common problem is that, in order to add or subtract two numbers, the representation of the smaller number must be changed so that it has the same exponent as the larger, and this involves right-shifting binary digits in the mantissa to the right. This means that, if the numbers differ by about 107 , all the digits of the mantissa are shifted out and the lower number effectively becomes zero. The ‘obvious’ solution is to use 64-bit floating-point (double in C) but this simply postpones the problem to bigger numbers, it does not solve it. You might think that these sorts of problems will not occur in computer vision; after all, images generally involve numbers only in the range 0–255. But that is simply not true; let us consider two examples of where this kind of problem can bite the unwary programmer. The first example is well-known in numerical analysis. The task of solving a quadratic equation crops up surprisingly frequently, perhaps in fitting to a maximum when trying to locate things accurately in images, or when intersecting vectors with spheres when working on 3D problems. The solution to ax2 + bx + c = 0 is something almost everyone learns at school: x = b ± p b2 4ac 2a and indeed this works well for many quadratic problems. But there is a numerical problem hidden in there. When the discriminant, b2 4ac, involves values that make b2 4ac, the nature of floating- point subtraction alluded to above can make 4ac ! 0 relative to b2 so that the discriminant becomes ±b...and this means that the lower of the two solutions to the equation is b + b = 0. Fortunately, numerical analysts realized this long ago and have devised mathematically-equivalent formulations that do not suffer from the same numerical instability. If we first calculate q = 1 2 Ä b + sgn(b) p b2 4ac ä then the two solutions to the quadratic are given by x1 = c/q and x2 = q/a. Even code that looks straightforward can lead to difficult-to-find numerical problems. Let us consider yet another simple image processing operation, finding the standard deviation (SD) of the pixels in an image. The definition of the SD is straightforward enough 1 N NX i=1 (xi ¯x)2
  • 7. Vision Algorithmics 7 where ¯x is the mean of the image and N the number of pixels in it. (We shall ignore the distinction between population and sample SDs, which is negligible for images.) If we program this equation to calculate the SD, the code has to make two passes through the image: the first calculates the mean, ¯x, while the second finds deviations from the mean and hence calculates the SD. Most of us probably learnt in statistics courses how to re-formulate this: substitute the defini- tion of the mean in this equation and simplify the result. We end up needing to calculate X x2 P x 2 N which can be programmed up using a single pass through the image as follows. float sd (image *im, int ny, int nx) { float v, var, sum = sum2 = 0.0; int y, x; for (y = 0; y < ny; y++) { for (x = 0; x < nx; x++) { v = im[y][x]; sum = sum + v; sum2 = sum2 + v * v; } } v = nx * ny; var = (sum2 - sum * sum/v) / v; if (var <= 0.0) return 0.0; return sqrt(var); } I wrote such a routine myself while doing my PhD and it was used by about thirty people on a daily basis for two or three years before someone came to see me with a problem: on his image, the routine was returning a value of zero for the SD, even though there definitely was variation in the image. After a few minutes’ examination of the image, we noticed that the image in question com- prised large values but had only a small variation. This identified the source of the problem: the single-pass algorithm involves a subtraction and the quantities involved, which represented the sums of squares or similar, ended up differing by less than one part in ⇠ 107 and hence yielded inaccurate results from my code. I introduced two fixes: the first was to change sum and sum2 to be double rather than float; and to look for cases where the subtraction in question yielded a result that was was small or negative and, in those cases, calculate only the mean and then make a second pass through the image. It’s a fairly ad hoc solution but proved adequate, and I and my colleagues haven’t been bitten again.
  • 8. Vision Algorithmics 8 5 The Right Algorithm in The Right Place The fast Fourier transform (FFT) is a classic example of the distinction between mathematics and ‘algorithmics.’ The underlying discrete Fourier transform is normally represented as a matrix mul- tiplication; hence, for N data points, O(N2 ) complex multiplications are required to evaluate it. The (radix-2) FFT algorithm takes the matrix multiplication and, by exploiting symmetry proper- ties of the transform kernel, reduces the number of multiplications required to O(N log2 N); for a 256 ⇥ 256 pixel image, this is a saving of roughly ✓ 2562 256 ⇥ 8 ◆2 = ✓ 216 211 ◆2 = 1024 times! These days, an FFT is entirely capable of running in real time on even a moderately fast processor; a DFT is not. On the other hand, one must not use cute algorithms blindly. When median filtering, for example, one obviously needs to determine the median of a set of numbers, and the obvious way to do that is to sort the numbers into order and choose the middle value. Reaching for our textbook, we find that the ‘quicksort’ algorithm (see, for example, [1]) has the best performance for random data, again with computation increasing as O(N log2 N), so we choose that. But O(N log2 N) is its best-case performance; its worst-case performance is O(N2 ), so if we have an unfortunately-ordered set of data, quicksort is a poor choice! A better compromise is probably Shell’s sort algorithm: its best-case performance isn’t as good as that of the quicksort but its worst-case performance is nowhere near as bad — and the algorithm is simpler to program and debug. Even selecting the ‘best’ algorithm is not the whole story. Median filtering involves working on many small sets of numbers rather than one large set, so the way in which the sort algorithm’s performance scales is not the overriding factor. Quicksort is most easily implemented recursively, and the time taken to save registers and generate a new call frame will probably swamp the com- putation, even on a RISC processor; so we are pushed towards something that can be implemented iteratively with minimal overhead — which might even mean a bubble sort! But wait: why bother sorting at all? There are median-finding algorithms that do not involve re-arranging data, either by using multiple passes over the data or by histogramming. One of these is almost certainly faster. Where do you find out about numerical algorithms? The standard reference these days is [2], though I should warn you that many numerical analysts do not consider its algorithms to be state- of-the-art, or even necessarily good. (A quick web-search will turn up many discussions regarding the book and its contents.) The best numerical algorithms that the author is aware of are those licensed from NAg, the Numerical Algorithms Group. All UK higher education establishments should have NAg licenses. Considerations of possible numerical issues such as the ones highlighted here are important if you need to obtain accurate answers or if your code forms part of an active vision system.
  • 9. Vision Algorithmics 9 6 Vision Packages If you’ve been reading through this document carefully, you’re probably wondering whether it is ever possible to produce vision software that works reliably. Well, it is possible — but you cannot tell whether or not a piece of software has been written carefully and uses sensible numerical tech- niques without looking at its source code. For that reason, the author, like many other researchers, has a strong preference for open-source software. What is out there in there in the open source world? There are many vision packages, though all of them are the results of fairly small groupings of researchers and developers. Recent years have seen the research community focus on OpenCV and Matlab. As a rule of thumb, people who need their software to run in real time or have huge amounts of data to crunch through choose OpenCV, while people whose focus is algorithm development often work with Matlab. As both of these are covered elsewhere in this Summer School, here are a few others that you might be interested in looking at: VXL: http://vxl.sourceforge.net/ A heavily-templated C++ class library being developed jointly by people in a few research groups, including (the author believes) Oxford and Manchester. VXL grew out of Target Jr, which was a prototype of the Image Understanding Environment, a well-funded initiative to develop a common programming platform for the vision community. (Unfortunately, the IUE’s aspirations were beyond what could be achieved with the hardware and software of the time and it failed.) Tina: http://www.tina-vision.net/ Tina is probably the only vision library currently around that makes a conscious effort to provide facilities that are statistically robust. While this is a great advantage and a lot of work has been put into making it more portable and easy to work with, it is fair to say that a significant amount of effort needs to be put into learning to use it. EVE: http://vase.essex.ac.uk/software/eve.html EVE, the Easy Vision Environment, is a pure-Python vision package built on top of numpy. It aims to provide easy-to-use functionality for common image processing and computer vision tasks, the intention being for them to be used during interactive sessions, from the Python interpreter’s command prompt or from an enhanced interpreter such as ipython as well as in scripts. It does not use Python’s OpenCV wrappers, giving the Python developer an independent check of OpenCV functionality. In the author’s opinion, the biggest usability problem with current computer vision packages is that they cater principally for the programmer: they rarely provide facilities for prototyping algorithms via a command language or for producing graphical user interfaces; and their visual- isation capabilities are typically not great. These deficiencies are ones that Matlab, for example, addresses very well. Essentially all of the vision packages available today, free or commercial, are designed for a single processor; they cannot spread computation across a cluster, for example, or migrate compu- tation onto a GPU. That is a problem for algorithms that may take minutes to run and consequently tens of hours to evaluate. There is a definite need for something that does this, provides a script- ing language and good graphical capabilities, is based around solid algorithms, and is well tested. OpenCL (not to be confused with OpenGL, the well-known 3D rendering package) may help us
  • 10. Vision Algorithmics 10 achieve that, and the author has observed a few GPU versions of vision algorithms (SIFT, ransac, etc.) starting to appear. With the exception of OpenCV, which includes a (limited) machine learning library, vision packages work on images. Once information has been extracted from images, you usually have to find other facilities. This is another distinct shortcoming, as the author is certain that many task- specific algorithms are very weak in their processing of the data extracted from images. However, there are packages around that can help: Image format conversion. It sometimes seems that there are as many different image formats as there are research groups! You are almost certain to need to convert the format of some image data files, and there are two good toolkits around that makes this very easy, both of which are available for Windows and Unix (including Linux and MacOS X). Firstly, there is NETPBM, an enhancement of Jef Poskanzer’s PBMPLUS (http://netpbm.sourceforge. net/): this is a set of programs that convert most popular formats to and from its own internal format. If you use an image format within your research group that isn’t already supported by NETPBM, the author’s advice to you is to write compatible format converters as quickly as possible; they’ll prove useful at some point. The second package worthy of mention is ImageMagick (http://www.imagemagick.org/), which includes both a good converter and a good display program. There are also interfaces between ImageMagick and the Perl, Python and Tcl interpreters, which can be useful. Machine learning. Weka (http://www.cs.waikato.ac.nz/ml/weka/) is a Java machine learn- ing package really intended for data mining. It contains tools for data pre-processing, classi- fication, regression, clustering, association rules, and visualization. I’ve heard good reports about its facilities — if your data can be munged into something that it can work on. Statistical software. The statistics research community has been good at getting its act together, much better than the vision community, and much of its research now provides facilities that interface to R (http://www.r-project.org/). R is a free implementation of S (and S+), commercial statistical packages that grew out of work at Bell Laboratories. R is available for both Unix and Windows (it’s a five-minute job to install on Debian Linux, for example) and there are books around that provide introductions for all levels of users. Unusually for free software, it provides excellent facilities for graphical displays of data. Neural networks. Many good things have been said about netlab (http://www.ncrg.aston. ac.uk/netlab/), a Matlab plug-in that provides facilities rarely seen elsewhere. Genetic algorithms. Recent years have seen genetic algorithms gain in popularity as a robust optimisation mechanism. There are many implementations of GAs around the Net; but the author has written a simple, real-valued GA package with a root-polishing option which seems to be both fast and reliable. It’s available in both C and Python; do contact him if you would like a copy. Whatever you choose to work with, whether from this list or not, treat any results with the same suspicion as you treat those from your own code.
  • 11. Vision Algorithmics 11 7 Massaging Program Outputs One of the things you will inevitably have to do is take the output from a program and manipulate it into some other form. This usually occurs when “plugging together” two programs or when trying to gather success and failure rates. To make this easier, there are two specific things that make your life easier: • learn a scripting language; the most suitable is currently Python, largely because of the excellent numpy and scipy extensions and its OpenCV functionality; • don’t build a graphical user interface (GUI) into your program. Scripting languages are designed to be used as software ‘glue’ between programs; they all provide easy-to-use facilities for processing textual information, including regular expressions. Which of them you choose is largely up to personal preference; the author has used Lua, Perl, Python, Ruby and Tcl — after a long time working with Perl and Tcl, he finally settled on Python. Indeed, most of the author’s programming is done in a scripting language as it is really a much more time-efficient way to work. Building a GUI into a piece of research code is a mistake because it makes it practically im- possible to use in any kind of automated processing. A much better solution is to produce a comprehensive command-line interface, one that allows most internal parameters and thresholds to be set, and then use one of the scripting languages listed above to produce the GUI. For ex- ample, the author built the interface shown in Figure 1 using Tcl and its Tk graphical toolkit in about two hours. The interface allows the user to browse through a set of nearly 8,000 images and an expert’s opinion of them, fine-tuning them when the inevitable mistakes crop up. This 250-line GUI script works on Unix (including Linux and MacOS X) and Windows entirely unchanged. Figure 1: Custom-designed graphical interface using a scripting language
  • 12. Vision Algorithmics 12 8 Concluding Remarks The aim of this essay is to make you aware of the major classes of problem that can and do crop up in vision software. Beyond that, the major message to take home is not to believe results, either those from your own software or anyone else’s, without checking. A close colleague of the author’s will often spend a day getting a feel for the nature of her imagery, and she can spot a problem in a technique better than anyone else I’ve ever worked with. That is not, I suggest, entirely coincidence. Another important thing is to see if changing tuning parameters has the effect that you expect. For example, one of the author’s students recently wrote some stereo software. Its accuracy was roughly as anticipated but, following my suggestion that she explore the software using simulated imagery, found that accuracy decreased as the image size increased. She was uneasy about this and, after checking with me for a ‘second opinion,’ took a good look at her software. Sure enough, she found a bug and, re-running the tests after fixing it, found that accuracy then increased with image size — in keeping with both our expectations. Without both a feel for the problem and carrying out experiments to see that expected results do indeed occur, the bug may have lain dormant in the code for weeks or months, only to become apparent after more research had been built on it — effort that would have to be repeated. This idea of exploring an algorithm by varying its parameters is actually an important one. Measuring the success and failure rates as each parameter is varied allows the construction of re- ceiver operating characteristic (ROC) curves, currently one of the most popular ways of presenting performance data in vision papers. However, it is possible to go somewhat further. Let us imagine we have written a program that processes images of the sky and attempts to estimate the proportion of the image that has cloud cover (a surprisingly difficult task); this is one of the problems that the data in Figure 1 are used for. A straightforward evaluation of the algorithm, the kind presented in most papers, would simply report how often the amount of cloud cover detected by the program matches the expert’s opinion. However, as the expert also indicated the predominant type of cloud in the image, we can explore the algorithm in much more detail. We might expect that a likely cause of erroneous classification is mistaking thin, wispy cloud as blue sky. As both cirrus and altostratus are thin and wispy, we can run the program against the database of images and see how often the estimate of cloud cover is wrong when these types of cloud occur. This approach is much, much more valuable: it gathers evidence to see if there is a particular type of imagery on which our program will fail, and that tells us where to expend effort in improving the algorithm. This is performance characterisation, the first step towards producing truly robust vision algorithms — and that forms the basis of the material that is presented in a separate Summer School lecture. References [1] Robert Sedgewick. Algorithms in C. Addison-Wesley, 1990. [2] William H. Press, Saul A. Teukolsky, Willian T. Vetterling, and Brian P. Flannery. Numerical Recipes in C. Cambridge University Press, second edition, 1992.