Computer Vision for Beginners

Computer Vision for
Beginners
Sanghamitra Deb
Staff Data Scientist
Chegg Inc

Outline
• Introduction and Applications
• CNN and State of the Art Deep Learning Classification Models
• CV Classification Pipeline in Pytorch
• Classification Metrics
• Intro to Object Detection
• Metrics for Object Detetection
• Summary

What is Computer Vision?
“If We Want Machines to Think, We Need to Teach Them to See."
“Understanding vision and building visual systems is really understanding intelligence,”
“And by see, I mean to understand, not just to record pixels.”
--Fei fei Li

Applications: Image Search
Driven by Image Similarity

Cancer Screening
Disease Diagnostics
Surgical Assistance Technology
Example: Detection of diabetic
Retinopathy.
https://www.kaggle.com/c/diabetic-retinopathy-detection
Applications: HealthCare

Computer Vision Applications: Education
• Gauging engagement at a personalized level
• Gaze analysis: Providing feedback to students
• Tailored Learning Experiences for students and
Personalized teaching modes
• Providing optimal career Paths
• Uncovering Learning Gaps
• Generating Customized Education Content
• Collaborative Learning : pair students up based on
similar learning styles.

Computer Vision Applications: Self Driving
Cars
Self driving cars need cameras, radar and lasers
to allow the car to perceive the world around it,
creating a digital map.
Computer vision helps cars see via Object
Detection, Classification

Computer Vision Applications: Mobile Apps

Computer Vision Techniques
Classification & Object Detection

Classification
Given a set of pixels determine the category of image
85% cat
12% dog
2% hat
1% mug

LeNet
1998
AlexNet
2012
ZFNet
2013
FractalNet
2016
DenseNet
2017
Network-in-Network
2013
VGGNet
2014
GooLeNet
2014
AlexNet
2012
ResNet
2015
The Gap
Timeline

AlexNet 2012
AlexNet architecture is a conv layer followed by pooling layer, normalization,
conv-pool-norm, and then a few more conv layers, a pooling layer, and then
several fully connected layers afterwards. Total of 8 layers.
In data augmentation, ALexNet used flipping, jittering, cropping, colour normalization
and these things. Other parameters are Dropout with 0.5, SGD + Momentum with 0.9,
initial learning rate 1e-2 and again reduced by 10 when validation accuracy become
flat. The regularization used in this network is L2 with a weight decay of 5e-4. It was
trained on GTX580 GPU which contains 3GB of memory.
https://towardsdatascience.com/architecture-comparison-of-alexnet-vggnet-resnet-inception-densenet-beb8b116866d

AlexNet
What was the state
before?
• Small networks
• Few applications
What was novel?
• ReLU for non-linearity
• Local Response
Normalization
• Dropout regularization
• Max-pooling as an alternative
to average-pooling
• GPU training
• Much larger network
What was the state
after?
• It worked well!
• Shot heard around the world
• A revolution was underway
• CNNs had arrived
• Specialized hardware use
• Many more applications
• “Deep Learning” interest
rapidly increased
What were the lessons learned?
• Neural Networks were now ready for prime-time
• Could do useful tasks

VGGNET -2014
VGG 16 is 16 layer architecture with a pair of convolution layers,
poolings layer and at the end fully connected layer. VGG network is the
idea of much deeper networks and with much smaller filters. VGGNet
increased the number of layers from eight layers in AlexNet.
VGG network is the idea of much deeper networks and with much
smaller filters. VGGNet increased the number of layers from eight
layers in AlexNet.
3 x 3 conv filters are the smallest possible filters with fewer
parameters. They look at pixels that are immediate neighbors ,
stacking many of them results in a larger reception field.

VGGNET
What was the state
before?
• AlexNet “deep” CNN not deep
enough nor wide enough
What was novel?
• Much smaller 3x3 filters
• Multiple 3x3 filters within each
layer
• Resulted in a huge network
What was the state
after?
• Worked extremely well!
• Pushed the envelope on
model size
• The depth of a network is a critical component to accuracy.
• But, big networks are expensive to train and slow to evaluate.

Vanishing/Exploding Gradients
When the network is deep, multiplying n of these small numbers will become zero (vanished).
When the network is deep, multiplying n of these large numbers will become too large (exploded).
Operation --- multiplying n of these small / large numbers to compute gradients of the “front”
layers in an n-layer network

ResNet - 2015
Right: Regular CNN, Left: fit some residual of our H(X) - X instead of the
desired function H(X) directly. A skip / shortcut connection is added to the
input x along with the output after few weight layers
Layers can be stacked to be 150 layers deep

ResNet
What was the state
before?
• Ultra-Deep networks still
suffered from vanishing
gradient problem
What was novel?
• Fewer filters reduces
computational complexity & no.
of parameters for the same
depth
• "The Residual Block" add
bypass to avoid dead units.
• Bypassed TWO layers!
• Combine serial & parallel units.
What was the state
after?
• Sizeable jump in number of
layers
• Prize-winner...again
• Behaves like an ensemble of shallow networks
• Plausible model for biological visual cortex

Image classification Pipeline --- Pytorch
Data
Preprocessing --
Move data from
each class to
individual
folders
Choose a
classificati
on model:
example
resnet-18
Freeze layers and
extract features
Data
Augmen
tations
Classifier: SVM or
LR
CPU
GPU: fine tuning
Make sure
weights for
the layers are
not frozen
Add the final
output layer,
choose your
optimizer
Train until
accuracy on
validation data
converges
Measure
performance
metrics
Measure
performance
metrics

Train, Test & Validation
Train – 72% , Test –20 %, Validation –8% ---- the
percentages can vary depending on the total size of
the dataset.
• Train --- data used to train the model
• Validation --- data that the model has not seen
but is used for parameter tuning, i.e the model is
optimized based on performance on this set.
• Test --- model has not seen this data, this data is
not used in any part of the computation. Final
performance metrics are reported on this data.
from sklearn.model_selection import train_test_split
train0,test = train_test_split(df_labels,,
shuffle=True,random_state=42,test_size=0.2,
stratify=df_labels.classes)
train,valid = train_test_split(train0,
shuffle=True,random_state=42,test_size=0.1,
stratify=train0.classes)
Code
Data is typically unbalanced , use stratified
sampling to preserve the class ratios

Data Augmentation
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224,
0.225])
]),
'valid': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
0.225])
]),
'test': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
0.225])
])
}

Moving & Loading Data
train_dir = parent_dir + 'train/'
test_dir = parent_dir + 'test/'
val_dir = parent_dir + 'val/'
for gid in class_names:
dir_curr_test = test_dir+str(gid)+'/'
dir_curr_train = train_dir+str(gid)+'/'
dir_curr_val = val_dir+str(gid)+'/'
!mkdir $dir_curr_test
!mkdir $dir_curr_train
!mkdir $dir_curr_val
val_df_curr = valid[valid.group_id==gid]
for index,row in val_df_curr.iterrows():
fname_curr = all_data_dir + row['filenames']
!mkdir $train_dir
!mkdir $test_dir
!mkdir $val_dir
data_dir = parent_dir
image_datasets = {x:
datasets.ImageFolder(os.path.join(data_dir, x),
data_transforms[x])
for x in ['train', 'valid','test']}
dataloaders = {x:
torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
shuffle=True, num_workers=4)
for x in ['train', 'valid','test']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train',
'valid','test']}
class_names = [x for x in image_datasets['train'].classes]
device = torch.device("cuda:0" if torch.cuda.is_available() else
"cpu")

Training
def train_model(model, criterion, optimizer, s
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict(
best_acc = 0.0
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch, num_epochs - 1)
print('-' * 10)
# Each epoch has a training and validation phase
for phase in ['train', 'valid']:
if phase == 'train':
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training ph
Choose pre-trained model
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
Output Layer
model_ft.fc = nn.Linear(num_ftrs,len(class_names))
model_ft = model_ft.to(device)
criterion = nn.CrossEntropyLoss()
# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)
# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=8)

Pre-trained weights: To freeze or not to freeze
def set_parameter_requires_grad(model, feature_extracting):
if feature_extracting:
for param in model.parameters():
param.requires_grad = False
• Feature Extraction: pre-trained weights should be frozen
• Fine Tuning : Weights should not be frozen, especially if the data type is significantly different from the
dataset on which the model was pre-trained

Performance Metrics
True Positive --- Number of observations that model correctly
predicts the positive class
False Positive --- Number of observations where model
incorrectly predicts the positive class.
False Negatives --- Number of observations where model
incorrectly predicts the negative class.
True Negatives --- Number of observations where model
correctly predicts the negative class

Performance Metrics
Precision : TP/(TP+FP) --- what percentage of the positive class
is actually positive?
Recall : TP/(TP+FN) --- what percentage of the positive class
gets captured by the model?
Accuracy --- (TP+TN)/(TP+FP+TN+FN) --- what percentage of
predictions are correct?

Confusion Matrix
• Good for checking where your
model is incorrect
• For multi-class classification it
reflects which classes are
correlated

Thresholding
In a binary classification if you choose randomly the probability of belonging to a class is 0.5
It is possible improve the percentage of
correct results at the cost of coverage.

Object Detection : YOLO
Outputs co-ordinates of the Bounding Box and a
confidence score for the class of object.
You Only Look Once
• One shot learning – combines regression &
classification
• Extremely Fast -- it’s incredibly fast and can
process 45 frames per second.
• Trade off between speed and accuracy
• Easy to use – code is open source.

Metrics for object Detection
Precision and Recall are
computed above an IOU
threshold. 0.5 is common.
AP: Average Precision(AP) is finding the area under the precision-
recall curve.
mAP: mAP for object detection is the average of the AP calculated
for all the classes. mAP@0.5 means that it is the mAP calculated at
IOU threshold 0.5.

Example Detection
If multiple boxes are detected
above a threshold choose the
one with the highest IOU

Summary
• Enormous advances have been made in
Computer Vision in the past decade using
Deep Learning.
• State of the art pre-trained DL models are best
suited for most projects.
o Easy to use Codes are readily available
o High Performance models can be built with
good Training Data.
• Success of a project is related to measuring
defining and measuring metrics correctly.
• Programing frameworks such as Pytorch and
Keras abstract details and are useful to build
models fast.

Computer Vision for Beginners

More Related Content

What's hot (20)

Similar to Computer Vision for Beginners (20)

More from Sanghamitra Deb (17)

Recently uploaded (20)

Computer Vision for Beginners

Editor's Notes