Computer Vision – From traditional approaches to deep neural networks

Computer Vision
From traditional approaches to deep neural
networks
Stanislav Frolov München, 27.02.2018

● Computer vision
● Human vision
● Traditional approaches and methods
● Artificial neural networks
● Summary
2
Outline of this talk
What we are going to talk about

● trained deep neural networks for object
detection during master thesis
● still fascinated and interested
3
Stanislav Frolov
Big Data Engineer @inovex

● Teach computers how to see
● Automatic extraction, analysis and understanding of
images
● Infer useful information, interpret and make decisions
● Automate tasks that human visual system can do
● One of the most exciting fields in AI and ML
4
What is computer vision
General

5
Motivation
● Era of pixels
● Internet consists
mostly of images
● Explosion of visual
data
● Cannot be labeled
by humans

6
Drivers
● Two drivers for computer vision explosion
○ Compute (faster and cheaper)
○ Data (more data > algorithms)

7
Interdisciplinary field
Computer Science
Mathematics
Engineering
Physics
Biology
Psychology
Information
Retrieval
Machine
LearningGraphs,
Algorithms
Systems
Architecture
Robotics
Speech,
NLP
Image
Processing
Optics
Solid-State
Physics
Neuroscience
Cognitive
SciencesBiological vision

● Imaging for statistical pattern recognition
● Image transformations such as pixel-by-pixel operations
○ Contrast enhancement
○ Edge extraction
○ Noise reduction
○ Geometrical and spatial operations (i.e rotations)
9
Related fields - image processing

● Creates new images from scene descriptions
● Produces image data from 3D models
● “Inverse” of computer vision
● AR as a combination of both
10
Related fields - computer graphics

● Mainly manufacturing applications
● Image-based automatic inspection, process control,
robot guidance
● Usually employs strong assumptions (colour, shape,
light, structure, orientation, ...) -> works very well
● Output often pass/fail or good/bad
● Additionally numerical/measurement data, counts
11
Related fields - machine vision

● Create “intelligent” systems
● Studying computational aspects of intelligence
● Make computers do things at which, at the moment,
people are better
● Many techniques play an important role (ML, ANNs)
● Currently does a few things better/faster at scale than
humans can
● Ability to do anything “human” is not answered
12
Related fields - AI

● Related fields have a large intersection
● Basic techniques used, developed and studied are very
similar
13
Related fields- summary

● Two stage process
○ Eyes take in light reflected off the objects and retina
converts 3D objects into 2D images
○ Brain’s visual system interprets 2D images and “rebuilds”
a 3D model
15
What is human vision
General

● Pair of 2D images with slightly different view allows to
infer depth
● Position of nearby objects will vary more across the two
images than the position of more distant objects
16
Stereoscopic vision

● Prior knowledge of relative sizes and depths is often key
for understanding and interpretation
17
Prior knowledge

● Texture and texture change helps solving depth
perception
18
Texture pattern

19
Biases and illusions in human perception
● Shadows make all the difference in interpretation
● Gradual changes in light ignored to not be misled by
shadow

20
A few more illusions
● Two arrows with different orientations have the same
length

● Assumptions and familiarity (distorted room)
● Face recognition bias
● Up-down orientation bias
21
Biases and illusions in human perception

22
Summary
● Illusions are fun, but the complete puzzle to understand
human vision is far from being complete

● Recognition
● Localization
● Detection
● Segmentation
24
Typical tasks

● Part-based detection
○ Deformable parts model
○ Pose estimation and poselets
25
Typical tasks

● Image captioning
(actions, attributes)
26
Typical tasks

● Motion analysis
○ Egomotion (camera)
○ Optical flow (pixels)
27
Typical tasks

● Scene understanding and reconstruction
28
Typical tasks

● Image restoration
● Colouring black & white photos
29
Typical tasks

Solving this is useful for many applications
30

31
Typical applications
● Assistance systems for cars and people
● Surveillance
● Navigation (obstacle avoidance, road following, path
planning)
● Photo interpretation
● Military (“smart” weapons)
● Manufacturing (inspection, identification)
● Robotics
● Autonomous vehicles (dangerous zones)

32
Typical applications
● Recognition and tracking
● Event detection
● Interaction (man-machine interfaces)
● Modeling (medical, manufacturing, training, education)
● Organizing (database index, sorting/clustering)
● Fingerprint and biometrics
● …

34
Why it is difficult
● Occlusion
● Deformation
● Scale
● Clutter
● Illumination
● Viewpoint
● Object pose
● Tons of classes and
variants
● Often n:1 mapping
● Computationally
expensive
● Full understanding of
biological vision is
missing

● Input: image(s) + labels
● Output: Semantic data, labels
● Digital image pixels usually have three channels [R,G,B]
each [0...255] + Location[x,y]
● Digital images are just vectors
36
System overview

1. Image acquisition (camera, sensors)
2. Pre-processing (sampling, noise reduction,
augmentation)
3. Feature extraction (lines, edges, regions, points)
4. Detection and segmentation
5. Post-processing (verification, estimation, recognition)
6. Decision making
● -> Ability of a machine to step back and interpret the big
picture of those pixels
37
System overview

1950s
● 2D imaging for statistical pattern recognition
● Theory of optical flow based on a fixed point
towards which one moves
39
History

Image processing
● Histograms
● Filtering
● Stitching
● Thresholding
● ...
40
Traditional approaches

1960s
● Desire to extract 3D structure from 2D images for
scene understanding
● Began at pioneering AI universities to mimic human
visual system as stepping stone for intelligent robots
● Summer vision project at MIT: attach camera to
computer and having it “describe what it saw”
41
History

● Given to 10 undergraduate students
● … an attempt to use our summer workers effectively …
● … construction of a significant part of a visual system …
● … task can be segmented into sub-problems …
● … participate in the construction of a system complex
enough to be a real landmark in the development of
“pattern recognition” …
42
History: summer vision project @MIT 1966

● Goal: analyse scenes and identify objects
● Structure of system:
○ Region proposal
○ Property lists for regions
○ Boundary construction
○ Match with properties
○ Segment
● Basic foreground/background segmentation with simple
objects (cubes, cylinders, ….)
43

● Unlike general intelligence, computer vision seemed
tractable
● Amusing anecdote, but it did never aimed to “solve”
computer vision
● Computer vision today differs from what it was thought
to be in 1966
44

1970s
● Formed many algorithms that exist today
● Edges, lines and objects as interconnected
structures
45
History

46
Traditional approaches
Edge detection based on
● Brightness
● Gradients
● Geometry
● Illumination

47
Traditional approaches - part based detector
● Objects composed of features of parts and their spatial
relationship
● Challenge: how to define and combine

1980s
● More rigorous mathematical analysis and
quantitative aspects
● Optical character recognition
● Sliding window approaches
● Usage of artificial neural networks
48
History

49
Traditional approaches - HOG detection (histogram of
oriented gradients)
● Concept in 80s but used only in 2005
● Create HOG descriptors (object generalizations)
● One feature vector per object
● Train with SVM
● Sliding window @multiple scales

50
Traditional approaches - HOG detection (histogram of
oriented gradients)
● Computation of HOG descriptors:
1. Compute gradients
2. Compute histograms on cells
3. Normalize histograms
4. Concatenate histograms
● Requires a lot of engineering
● Must build ensembles of feature descriptors

1990s
● Significant interaction with computer graphics
(rendering, morphing, stitching)
● Approaches using statistical learning
● Eigenface (Ghostfaces) through principal component
analysis (PCA)
51
History

52
Traditional approaches - deformable parts model (DPM)
● Objects constructed by its parts
● First match whole object, then refine on the parts
● HOG + part-based + modern features
● Slow but good at difficult objects
● Involves many heuristics

53
Features
● Feature points
○ Small area of pixels with certain properties
● Feature detection
○ Use features for identification
○ Activate if “object” present
● Examples:
○ Lines, edges, colours, blobs, …
○ Animals, faces, cars, ...

54
Traditional approaches - classical recognition
● Init: extract features for objects in different scales,
colours, orientations, rotations, occlusion levels
● Inference: extract features from query image and find
closest match in database or train a classifier
● Computationally expensive (hundreds of features in
image, millions in database) and complex due to errors
and mismatches

55
History
Before the new era
● Bags of features
● Handcrafted ensembles
Input Feat. 2
Feat. 1
Feat. n
Final
Decision
Feature Extraction

The new era of computer vision
56

● Elementary building
block
● Inspired by biological
neurons
● Mathematical function
y=f(wx+b)
● Learnable weights
57
Artificial neural networks
Fundamentals - artificial neuron

● Collection of neurons
organized in layers
● Universal
approximators
● Fully-connected
network here
58
Fundamentals - artificial neural networks

59
Fundamentals - training
● Basically an optimization
problem
● Find minimum of a loss
function by an iterative
process (training)
● Designing the loss function
is sometimes tricky

60
Fundamentals - training
Simple optimizer algorithm:
1. Forward pass with a batch of data
2. Calculate error between actual and wanted output
3. Nudge weights in proportion to error into the right
direction (same data would result in smaller error)
4. Repeat until convergence

61
Fundamentals - CNN
● Local neighborhood
contributes to activation
● Exploit spatial
information
● Hierarchical feature
extractors
● Less parameters input
activation
filters
receptive field

62
Fundamentals - CNN
● Filter of size 3x3 applied to an input of 7x7

63
Fundamentals - pooling
● Max-pooling
● Dimension reduction/adaption
● Existence is more important than location

64
Fundamentals - pooling
● Zero-padding
● Controlling dimensions

65
Fundamentals - general network architecture
Input
image
convolutional layers
... Final
decision

66
Fundamentals - hierarchical feature extractors
Lines, edges, blobs,
colours, ...
Abstract objectsParts of abstract objects
First layers Deeper layers
Activations
for:

Modern history of object recognition
67

● Classification and detection
○ 27k images
○ 20 classes
■ person, bird, cat, cow, dog, horse, sheep, aeroplane,
bicycle, boat, bus, car, motorbike, train, bottle,
chair, dining table, potted plant, sofa, tv/ monitor
68
Benchmark
Datasets - PASCAL VOC

● Challenges on a subset of ImageNet
○ 14kk labeled images
○ 20k object categories
● ILSVRC* usually on 10k categories including 90 out of
120 dog breeds
69
Benchmark
Datasets - ImageNet
*ImageNet Large Scale Visual Recognition
Challenge

● ILSVRC 2012 winner by a large margin from 25% to 16%
● Proved effectiveness of CNNs and kicked of a new era
● 8 layers, 650k neurons, 60kk parameters
70
Roadmap - AlexNet

● ILSVRC 2013 winner with a best top-5 error of 11.6%
● AlexNet but using smaller 7x7 kernels to keep more
information in deeper layers
71
Roadmap - ZFNet

● ILSVRC 2013 localization winner
● Uses AlexNet on multi-scale input images with sliding
window approach
● Accumulates bounding boxes for final detection (instead
of non-max suppression)
72
Roadmap - OverFeat

● 2k proposals generated by selective search
● SVM trained for classification
● Multi-stage pipeline
73
Roadmap - RCNN (region based CNN)

● Not a winner but famous due to simplicity and
effectiveness
● Replace large-kernel convolutions by stacking several
small-kernel convolutions
74
Roadmap - VGGNet

● ILSVRC 2014 winner
● Stacks up “inception” modules
● 22 layers, 5kk parameters
75
Roadmap - InceptionNet (GoogleNet)

● Jointly learns region proposal and detection
● Employs a region of interest (RoI) that allows to reuse
the computations
76
Roadmap - Fast RCNN

● Directly predicts all objects and classes in one shot
● Very fast
● Processes images at ~40 FPS on a Titan X GPU
● First real-time state-of-the-art detector
● Divides input images into multiple grid cells which are
then classified
77
Roadmap - YOLO (you only look once)

● ILSVRC 2015 winner with a 3.6% error rate (human
performance is 5-10%)
● Employs residual blocks which allows to build deep
networks (hundreds of layers)
● Additional identity mapping
78
Roadmap - ResNet (Microsoft)

● Not a recognition network
● A region proposal network
● Popularized prior/anchor boxes (found through
clustering) to predict offsets
● Much better strategy than starting the predictions with
random coordinates
● Since then heuristic approaches have been gradually
fading out and replaced
79
Roadmap - MultiBox

● Fast RCNN with heuristic region proposal replaced by
region proposal network (RPN) inspired by MultiBox
● RPN shares full-image convolutional features with the
detection network (cost-free region proposal)
● RPN uses “attention” mechanism to tell where to look
● ~5 FPS on a Titan K40 GPU
● End-to-end training
80
Roadmap - Faster RCNN

● SSD leverages the Faster RCNN’s RPN to directly
classify objects inside each prior box (similar to YOLO)
● Predicts category scores and box offsets for a fixed set
of default bounding boxes
● Fixes the predefined grid cells used in YOLO by using
multiple aspect ratios
● Produces predictions of different scales
● ~59 FPS
81
Roadmap - SSD (single shot multibox detector)

● Open-source software library for machine learning
applications
● Tensorflow Object Detection API
○ A collection of pretrained models
○ construct, train and deploy object detection models
82
TensorFlow object detection API

● Humans are good at understanding the big picture
● Neural networks are good at details
● But they can be fooled...
84
Summary
Human vs machine

● Need a large amount data
● Lots of engineering
● Trial and error
● Long training time
● Still lots of hyperparameter parameter tuning
● No general network (generalization not answered)
● Little mathematical foundation
85
Summary
Computer vision is still difficult

● Despite all of these advances, the dream of having a
computer interpret an image at the same level as a
human remains unrealized
86
Summary
Computer vision is hard

Thank You
Stanislav Frolov
Big Data Engineer
sfrolov@inovex.de
0173 318 11 35
inovex GmbH
Lindberghstraße 3
80939 München

Computer Vision – From traditional approaches to deep neural networks

More Related Content

What's hot (20)

Similar to Computer Vision – From traditional approaches to deep neural networks (20)

More from inovex GmbH (20)

Recently uploaded (20)

Computer Vision – From traditional approaches to deep neural networks