SlideShare a Scribd company logo
Computer Vision for
Beginners
Sanghamitra Deb
Staff Data Scientist
Chegg Inc
Outline
• Introduction and Applications
• CNN and State of the Art Deep Learning Classification Models
• CV Classification Pipeline in Pytorch
• Classification Metrics
• Intro to Object Detection
• Metrics for Object Detetection
• Summary
What is Computer Vision?
“If We Want Machines to Think, We Need to Teach Them to See."
“Understanding vision and building visual systems is really understanding intelligence,”
“And by see, I mean to understand, not just to record pixels.”
--Fei fei Li
Applications: Image Search
Driven by Image Similarity
Cancer Screening
Disease Diagnostics
Surgical Assistance Technology
Example: Detection of diabetic
Retinopathy.
https://www.kaggle.com/c/diabetic-retinopathy-detection
Applications: HealthCare
Computer Vision Applications: Education
• Gauging engagement at a personalized level
• Gaze analysis: Providing feedback to students
• Tailored Learning Experiences for students and
Personalized teaching modes
• Providing optimal career Paths
• Uncovering Learning Gaps
• Generating Customized Education Content
• Collaborative Learning : pair students up based on
similar learning styles.
Computer Vision Applications: Self Driving
Cars
Self driving cars need cameras, radar and lasers
to allow the car to perceive the world around it,
creating a digital map.
Computer vision helps cars see via Object
Detection, Classification
Computer Vision Applications: Mobile Apps
Computer Vision Techniques
Classification & Object Detection
Classification
Given a set of pixels determine the category of image
85% cat
12% dog
2% hat
1% mug
DL Classification Network
LeNet
1998
AlexNet
2012
ZFNet
2013
FractalNet
2016
DenseNet
2017
Network-in-Network
2013
VGGNet
2014
GooLeNet
2014
AlexNet
2012
ResNet
2015
The Gap
Timeline
Performance: Imagenet Data
Size Chart
AlexNet 2012
AlexNet architecture is a conv layer followed by pooling layer, normalization,
conv-pool-norm, and then a few more conv layers, a pooling layer, and then
several fully connected layers afterwards. Total of 8 layers.
In data augmentation, ALexNet used flipping, jittering, cropping, colour normalization
and these things. Other parameters are Dropout with 0.5, SGD + Momentum with 0.9,
initial learning rate 1e-2 and again reduced by 10 when validation accuracy become
flat. The regularization used in this network is L2 with a weight decay of 5e-4. It was
trained on GTX580 GPU which contains 3GB of memory.
https://towardsdatascience.com/architecture-comparison-of-alexnet-vggnet-resnet-inception-densenet-beb8b116866d
AlexNet
What was the state
before?
• Small networks
• Few applications
What was novel?
• ReLU for non-linearity
• Local Response
Normalization
• Dropout regularization
• Max-pooling as an alternative
to average-pooling
• GPU training
• Much larger network
What was the state
after?
• It worked well!
• Shot heard around the world
• A revolution was underway
• CNNs had arrived
• Specialized hardware use
• Many more applications
• “Deep Learning” interest
rapidly increased
What were the lessons learned?
• Neural Networks were now ready for prime-time
• Could do useful tasks
VGGNET -2014
VGG 16 is 16 layer architecture with a pair of convolution layers,
poolings layer and at the end fully connected layer. VGG network is the
idea of much deeper networks and with much smaller filters. VGGNet
increased the number of layers from eight layers in AlexNet.
VGG network is the idea of much deeper networks and with much
smaller filters. VGGNet increased the number of layers from eight
layers in AlexNet.
3 x 3 conv filters are the smallest possible filters with fewer
parameters. They look at pixels that are immediate neighbors ,
stacking many of them results in a larger reception field.
VGGNET
What was the state
before?
• AlexNet “deep” CNN not deep
enough nor wide enough
What was novel?
• Much smaller 3x3 filters
• Multiple 3x3 filters within each
layer
• Resulted in a huge network
What was the state
after?
• Worked extremely well!
• Pushed the envelope on
model size
What were the lessons learned?
• The depth of a network is a critical component to accuracy.
• But, big networks are expensive to train and slow to evaluate.
Vanishing/Exploding Gradients
When the network is deep, multiplying n of these small numbers will become zero (vanished).
When the network is deep, multiplying n of these large numbers will become too large (exploded).
Operation --- multiplying n of these small / large numbers to compute gradients of the “front”
layers in an n-layer network
ResNet - 2015
Right: Regular CNN, Left: fit some residual of our H(X) - X instead of the
desired function H(X) directly. A skip / shortcut connection is added to the
input x along with the output after few weight layers
Layers can be stacked to be 150 layers deep
Plain Network vs RESNET
ResNet
What was the state
before?
• Ultra-Deep networks still
suffered from vanishing
gradient problem
What was novel?
• Fewer filters reduces
computational complexity & no.
of parameters for the same
depth
• "The Residual Block" add
bypass to avoid dead units.
• Bypassed TWO layers!
• Combine serial & parallel units.
What was the state
after?
• Sizeable jump in number of
layers
• Prize-winner...again
What were the lessons learned?
• Behaves like an ensemble of shallow networks
• Plausible model for biological visual cortex
Image classification Pipeline --- Pytorch
Data
Preprocessing --
Move data from
each class to
individual
folders
Choose a
classificati
on model:
example
resnet-18
Freeze layers and
extract features
Data
Augmen
tations
Classifier: SVM or
LR
CPU
GPU: fine tuning
Make sure
weights for
the layers are
not frozen
Add the final
output layer,
choose your
optimizer
Train until
accuracy on
validation data
converges
Measure
performance
metrics
Measure
performance
metrics
Code
Train, Test & Validation
Train – 72% , Test –20 %, Validation –8% ---- the
percentages can vary depending on the total size of
the dataset.
• Train --- data used to train the model
• Validation --- data that the model has not seen
but is used for parameter tuning, i.e the model is
optimized based on performance on this set.
• Test --- model has not seen this data, this data is
not used in any part of the computation. Final
performance metrics are reported on this data.
from sklearn.model_selection import train_test_split
train0,test = train_test_split(df_labels,,
shuffle=True,random_state=42,test_size=0.2,
stratify=df_labels.classes)
train,valid = train_test_split(train0,
shuffle=True,random_state=42,test_size=0.1,
stratify=train0.classes)
Code
Data is typically unbalanced , use stratified
sampling to preserve the class ratios
Data Augmentation
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224,
0.225])
]),
'valid': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224,
0.225])
]),
'test': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224,
0.225])
])
}
Moving & Loading Data
train_dir = parent_dir + 'train/'
test_dir = parent_dir + 'test/'
val_dir = parent_dir + 'val/'
for gid in class_names:
dir_curr_test = test_dir+str(gid)+'/'
dir_curr_train = train_dir+str(gid)+'/'
dir_curr_val = val_dir+str(gid)+'/'
!mkdir $dir_curr_test
!mkdir $dir_curr_train
!mkdir $dir_curr_val
val_df_curr = valid[valid.group_id==gid]
for index,row in val_df_curr.iterrows():
fname_curr = all_data_dir + row['filenames']
!mkdir $train_dir
!mkdir $test_dir
!mkdir $val_dir
data_dir = parent_dir
image_datasets = {x:
datasets.ImageFolder(os.path.join(data_dir, x),
data_transforms[x])
for x in ['train', 'valid','test']}
dataloaders = {x:
torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
shuffle=True, num_workers=4)
for x in ['train', 'valid','test']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train',
'valid','test']}
class_names = [x for x in image_datasets['train'].classes]
device = torch.device("cuda:0" if torch.cuda.is_available() else
"cpu")
Training
def train_model(model, criterion, optimizer, s
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict(
best_acc = 0.0
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch, num_epochs - 1)
print('-' * 10)
# Each epoch has a training and validation phase
for phase in ['train', 'valid']:
if phase == 'train':
model.train() # Set model to training mode
else:
model.eval() # Set model to evaluate mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device)
labels = labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training ph
Choose pre-trained model
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
Output Layer
model_ft.fc = nn.Linear(num_ftrs,len(class_names))
model_ft = model_ft.to(device)
criterion = nn.CrossEntropyLoss()
# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)
# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=8)
Pre-trained weights: To freeze or not to freeze
def set_parameter_requires_grad(model, feature_extracting):
if feature_extracting:
for param in model.parameters():
param.requires_grad = False
• Feature Extraction: pre-trained weights should be frozen
• Fine Tuning : Weights should not be frozen, especially if the data type is significantly different from the
dataset on which the model was pre-trained
Metrics
Performance Metrics
True Positive --- Number of observations that model correctly
predicts the positive class
False Positive --- Number of observations where model
incorrectly predicts the positive class.
False Negatives --- Number of observations where model
incorrectly predicts the negative class.
True Negatives --- Number of observations where model
correctly predicts the negative class
Performance Metrics
Precision : TP/(TP+FP) --- what percentage of the positive class
is actually positive?
Recall : TP/(TP+FN) --- what percentage of the positive class
gets captured by the model?
Accuracy --- (TP+TN)/(TP+FP+TN+FN) --- what percentage of
predictions are correct?
Confusion Matrix
• Good for checking where your
model is incorrect
• For multi-class classification it
reflects which classes are
correlated
Thresholding
In a binary classification if you choose randomly the probability of belonging to a class is 0.5
It is possible improve the percentage of
correct results at the cost of coverage.
Object Detection : YOLO
Outputs co-ordinates of the Bounding Box and a
confidence score for the class of object.
You Only Look Once
• One shot learning – combines regression &
classification
• Extremely Fast -- it’s incredibly fast and can
process 45 frames per second.
• Trade off between speed and accuracy
• Easy to use – code is open source.
Metrics for object Detection
Precision and Recall are
computed above an IOU
threshold. 0.5 is common.
AP: Average Precision(AP) is finding the area under the precision-
recall curve.
mAP: mAP for object detection is the average of the AP calculated
for all the classes. mAP@0.5 means that it is the mAP calculated at
IOU threshold 0.5.
Example Detection
If multiple boxes are detected
above a threshold choose the
one with the highest IOU
Summary
• Enormous advances have been made in
Computer Vision in the past decade using
Deep Learning.
• State of the art pre-trained DL models are best
suited for most projects.
o Easy to use Codes are readily available
o High Performance models can be built with
good Training Data.
• Success of a project is related to measuring
defining and measuring metrics correctly.
• Programing frameworks such as Pytorch and
Keras abstract details and are useful to build
models fast.
Thank You
@sangha_deb

More Related Content

PPTX
Convolutional Neural Network
PPTX
Image classification with Deep Neural Networks
PDF
Deep learning based object detection basics
PPTX
Image classification using cnn
PPTX
CNN Machine learning DeepLearning
PPTX
Object Detection using Deep Neural Networks
PPTX
Image Segmentation Using Deep Learning : A survey
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
Convolutional Neural Network
Image classification with Deep Neural Networks
Deep learning based object detection basics
Image classification using cnn
CNN Machine learning DeepLearning
Object Detection using Deep Neural Networks
Image Segmentation Using Deep Learning : A survey
Transfer Learning and Fine-tuning Deep Neural Networks

What's hot (20)

PDF
Image recognition
PDF
(2017/06)Practical points of deep learning for medical imaging
PPTX
Image classification using convolutional neural network
PPTX
Deep learning
PDF
Generative adversarial text to image synthesis
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PPTX
Image classification using CNN
PPTX
Image Classification using deep learning
PDF
Deep learning - A Visual Introduction
PPTX
Image recognition
PDF
Introduction to Generative Adversarial Networks (GANs)
PDF
Convolutional neural network
PDF
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PDF
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
PDF
Generative adversarial networks
PDF
A survey of deep learning approaches to medical applications
PPTX
Deep Learning in Bio-Medical Imaging
PPTX
Image processing
PPTX
Real Time Object Dectection using machine learning
PPTX
SfM Learner系単眼深度推定手法について
Image recognition
(2017/06)Practical points of deep learning for medical imaging
Image classification using convolutional neural network
Deep learning
Generative adversarial text to image synthesis
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Image classification using CNN
Image Classification using deep learning
Deep learning - A Visual Introduction
Image recognition
Introduction to Generative Adversarial Networks (GANs)
Convolutional neural network
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Generative adversarial networks
A survey of deep learning approaches to medical applications
Deep Learning in Bio-Medical Imaging
Image processing
Real Time Object Dectection using machine learning
SfM Learner系単眼深度推定手法について
Ad

Similar to Computer Vision for Beginners (20)

PPTX
B4UConference_machine learning_deeplearning
PDF
AI and Deep Learning
PDF
深度學習在AOI的應用
PPTX
Introduction to computer vision
PPTX
Introduction to computer vision with Convoluted Neural Networks
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PDF
Biomedical Signal and Image Analytics using MATLAB
PPTX
Deeplearning
PPTX
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
PDF
PPT
lec6a.ppt
PPTX
AI powered emotion recognition: From Inception to Production - Global AI Conf...
PPTX
AI powered emotion recognition: From Inception to Production - Global AI Conf...
PPTX
Introduction to deep learning
PDF
Startup.Ml: Using neon for NLP and Localization Applications
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PPTX
Machine Learning with ML.NET and Azure - Andy Cross
PDF
Deep Learning at Scale
PPTX
Deep Learning Made Easy with Deep Features
PPTX
Facial Emotion Detection on Children's Emotional Face
B4UConference_machine learning_deeplearning
AI and Deep Learning
深度學習在AOI的應用
Introduction to computer vision
Introduction to computer vision with Convoluted Neural Networks
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Biomedical Signal and Image Analytics using MATLAB
Deeplearning
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
lec6a.ppt
AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...
Introduction to deep learning
Startup.Ml: Using neon for NLP and Localization Applications
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Machine Learning with ML.NET and Azure - Andy Cross
Deep Learning at Scale
Deep Learning Made Easy with Deep Features
Facial Emotion Detection on Children's Emotional Face
Ad

More from Sanghamitra Deb (17)

PDF
odsc_2023.pdf
PPTX
Multi-modal sources for predictive modeling using deep learning
PPTX
Computer Vision Landscape : Present and Future
PDF
Intro to NLP: Text Categorization and Topic Modeling
PPTX
Intro to ml_2021
PPTX
NLP Classifier Models & Metrics
PPTX
Developing Recommendation System to provide a Personalized Learning experienc...
PDF
NLP and Deep Learning for non_experts
PDF
Introduction to machine learning
PDF
NLP and Machine Learning for non-experts
PDF
Democratizing NLP content modeling with transfer learning using GPUs
PDF
Natural Language Comprehension: Human Machine Collaboration.
PDF
Data day2017
PDF
Extracting knowledgebase from text
PDF
Extracting medical attributes and finding relations
PDF
From Rocket Science to Data Science
PPTX
Understanding Product Attributes from Reviews
odsc_2023.pdf
Multi-modal sources for predictive modeling using deep learning
Computer Vision Landscape : Present and Future
Intro to NLP: Text Categorization and Topic Modeling
Intro to ml_2021
NLP Classifier Models & Metrics
Developing Recommendation System to provide a Personalized Learning experienc...
NLP and Deep Learning for non_experts
Introduction to machine learning
NLP and Machine Learning for non-experts
Democratizing NLP content modeling with transfer learning using GPUs
Natural Language Comprehension: Human Machine Collaboration.
Data day2017
Extracting knowledgebase from text
Extracting medical attributes and finding relations
From Rocket Science to Data Science
Understanding Product Attributes from Reviews

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced IT Governance
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced IT Governance
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
GamePlan Trading System Review: Professional Trader's Honest Take
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Electronic commerce courselecture one. Pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx

Computer Vision for Beginners

  • 1. Computer Vision for Beginners Sanghamitra Deb Staff Data Scientist Chegg Inc
  • 2. Outline • Introduction and Applications • CNN and State of the Art Deep Learning Classification Models • CV Classification Pipeline in Pytorch • Classification Metrics • Intro to Object Detection • Metrics for Object Detetection • Summary
  • 3. What is Computer Vision? “If We Want Machines to Think, We Need to Teach Them to See." “Understanding vision and building visual systems is really understanding intelligence,” “And by see, I mean to understand, not just to record pixels.” --Fei fei Li
  • 4. Applications: Image Search Driven by Image Similarity
  • 5. Cancer Screening Disease Diagnostics Surgical Assistance Technology Example: Detection of diabetic Retinopathy. https://www.kaggle.com/c/diabetic-retinopathy-detection Applications: HealthCare
  • 6. Computer Vision Applications: Education • Gauging engagement at a personalized level • Gaze analysis: Providing feedback to students • Tailored Learning Experiences for students and Personalized teaching modes • Providing optimal career Paths • Uncovering Learning Gaps • Generating Customized Education Content • Collaborative Learning : pair students up based on similar learning styles.
  • 7. Computer Vision Applications: Self Driving Cars Self driving cars need cameras, radar and lasers to allow the car to perceive the world around it, creating a digital map. Computer vision helps cars see via Object Detection, Classification
  • 10. Classification Given a set of pixels determine the category of image 85% cat 12% dog 2% hat 1% mug
  • 15. AlexNet 2012 AlexNet architecture is a conv layer followed by pooling layer, normalization, conv-pool-norm, and then a few more conv layers, a pooling layer, and then several fully connected layers afterwards. Total of 8 layers. In data augmentation, ALexNet used flipping, jittering, cropping, colour normalization and these things. Other parameters are Dropout with 0.5, SGD + Momentum with 0.9, initial learning rate 1e-2 and again reduced by 10 when validation accuracy become flat. The regularization used in this network is L2 with a weight decay of 5e-4. It was trained on GTX580 GPU which contains 3GB of memory. https://towardsdatascience.com/architecture-comparison-of-alexnet-vggnet-resnet-inception-densenet-beb8b116866d
  • 16. AlexNet What was the state before? • Small networks • Few applications What was novel? • ReLU for non-linearity • Local Response Normalization • Dropout regularization • Max-pooling as an alternative to average-pooling • GPU training • Much larger network What was the state after? • It worked well! • Shot heard around the world • A revolution was underway • CNNs had arrived • Specialized hardware use • Many more applications • “Deep Learning” interest rapidly increased What were the lessons learned? • Neural Networks were now ready for prime-time • Could do useful tasks
  • 17. VGGNET -2014 VGG 16 is 16 layer architecture with a pair of convolution layers, poolings layer and at the end fully connected layer. VGG network is the idea of much deeper networks and with much smaller filters. VGGNet increased the number of layers from eight layers in AlexNet. VGG network is the idea of much deeper networks and with much smaller filters. VGGNet increased the number of layers from eight layers in AlexNet. 3 x 3 conv filters are the smallest possible filters with fewer parameters. They look at pixels that are immediate neighbors , stacking many of them results in a larger reception field.
  • 18. VGGNET What was the state before? • AlexNet “deep” CNN not deep enough nor wide enough What was novel? • Much smaller 3x3 filters • Multiple 3x3 filters within each layer • Resulted in a huge network What was the state after? • Worked extremely well! • Pushed the envelope on model size What were the lessons learned? • The depth of a network is a critical component to accuracy. • But, big networks are expensive to train and slow to evaluate.
  • 19. Vanishing/Exploding Gradients When the network is deep, multiplying n of these small numbers will become zero (vanished). When the network is deep, multiplying n of these large numbers will become too large (exploded). Operation --- multiplying n of these small / large numbers to compute gradients of the “front” layers in an n-layer network
  • 20. ResNet - 2015 Right: Regular CNN, Left: fit some residual of our H(X) - X instead of the desired function H(X) directly. A skip / shortcut connection is added to the input x along with the output after few weight layers Layers can be stacked to be 150 layers deep
  • 22. ResNet What was the state before? • Ultra-Deep networks still suffered from vanishing gradient problem What was novel? • Fewer filters reduces computational complexity & no. of parameters for the same depth • "The Residual Block" add bypass to avoid dead units. • Bypassed TWO layers! • Combine serial & parallel units. What was the state after? • Sizeable jump in number of layers • Prize-winner...again What were the lessons learned? • Behaves like an ensemble of shallow networks • Plausible model for biological visual cortex
  • 23. Image classification Pipeline --- Pytorch Data Preprocessing -- Move data from each class to individual folders Choose a classificati on model: example resnet-18 Freeze layers and extract features Data Augmen tations Classifier: SVM or LR CPU GPU: fine tuning Make sure weights for the layers are not frozen Add the final output layer, choose your optimizer Train until accuracy on validation data converges Measure performance metrics Measure performance metrics
  • 24. Code
  • 25. Train, Test & Validation Train – 72% , Test –20 %, Validation –8% ---- the percentages can vary depending on the total size of the dataset. • Train --- data used to train the model • Validation --- data that the model has not seen but is used for parameter tuning, i.e the model is optimized based on performance on this set. • Test --- model has not seen this data, this data is not used in any part of the computation. Final performance metrics are reported on this data. from sklearn.model_selection import train_test_split train0,test = train_test_split(df_labels,, shuffle=True,random_state=42,test_size=0.2, stratify=df_labels.classes) train,valid = train_test_split(train0, shuffle=True,random_state=42,test_size=0.1, stratify=train0.classes) Code Data is typically unbalanced , use stratified sampling to preserve the class ratios
  • 26. Data Augmentation data_transforms = { 'train': transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), 'valid': transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), 'test': transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) }
  • 27. Moving & Loading Data train_dir = parent_dir + 'train/' test_dir = parent_dir + 'test/' val_dir = parent_dir + 'val/' for gid in class_names: dir_curr_test = test_dir+str(gid)+'/' dir_curr_train = train_dir+str(gid)+'/' dir_curr_val = val_dir+str(gid)+'/' !mkdir $dir_curr_test !mkdir $dir_curr_train !mkdir $dir_curr_val val_df_curr = valid[valid.group_id==gid] for index,row in val_df_curr.iterrows(): fname_curr = all_data_dir + row['filenames'] !mkdir $train_dir !mkdir $test_dir !mkdir $val_dir data_dir = parent_dir image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'valid','test']} dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4, shuffle=True, num_workers=4) for x in ['train', 'valid','test']} dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'valid','test']} class_names = [x for x in image_datasets['train'].classes] device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
  • 28. Training def train_model(model, criterion, optimizer, s since = time.time() best_model_wts = copy.deepcopy(model.state_dict( best_acc = 0.0 for epoch in range(num_epochs): print('Epoch {}/{}'.format(epoch, num_epochs - 1) print('-' * 10) # Each epoch has a training and validation phase for phase in ['train', 'valid']: if phase == 'train': model.train() # Set model to training mode else: model.eval() # Set model to evaluate mode running_loss = 0.0 running_corrects = 0 # Iterate over data. for inputs, labels in dataloaders[phase]: inputs = inputs.to(device) labels = labels.to(device) # zero the parameter gradients optimizer.zero_grad() # forward # track history if only in train with torch.set_grad_enabled(phase == 'train' outputs = model(inputs) _, preds = torch.max(outputs, 1) loss = criterion(outputs, labels) # backward + optimize only if in training ph Choose pre-trained model model_ft = models.resnet18(pretrained=True) num_ftrs = model_ft.fc.in_features Output Layer model_ft.fc = nn.Linear(num_ftrs,len(class_names)) model_ft = model_ft.to(device) criterion = nn.CrossEntropyLoss() # Observe that all parameters are being optimized optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9) # Decay LR by a factor of 0.1 every 7 epochs exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1) model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=8)
  • 29. Pre-trained weights: To freeze or not to freeze def set_parameter_requires_grad(model, feature_extracting): if feature_extracting: for param in model.parameters(): param.requires_grad = False • Feature Extraction: pre-trained weights should be frozen • Fine Tuning : Weights should not be frozen, especially if the data type is significantly different from the dataset on which the model was pre-trained
  • 31. Performance Metrics True Positive --- Number of observations that model correctly predicts the positive class False Positive --- Number of observations where model incorrectly predicts the positive class. False Negatives --- Number of observations where model incorrectly predicts the negative class. True Negatives --- Number of observations where model correctly predicts the negative class
  • 32. Performance Metrics Precision : TP/(TP+FP) --- what percentage of the positive class is actually positive? Recall : TP/(TP+FN) --- what percentage of the positive class gets captured by the model? Accuracy --- (TP+TN)/(TP+FP+TN+FN) --- what percentage of predictions are correct?
  • 33. Confusion Matrix • Good for checking where your model is incorrect • For multi-class classification it reflects which classes are correlated
  • 34. Thresholding In a binary classification if you choose randomly the probability of belonging to a class is 0.5 It is possible improve the percentage of correct results at the cost of coverage.
  • 35. Object Detection : YOLO Outputs co-ordinates of the Bounding Box and a confidence score for the class of object. You Only Look Once • One shot learning – combines regression & classification • Extremely Fast -- it’s incredibly fast and can process 45 frames per second. • Trade off between speed and accuracy • Easy to use – code is open source.
  • 36. Metrics for object Detection Precision and Recall are computed above an IOU threshold. 0.5 is common. AP: Average Precision(AP) is finding the area under the precision- recall curve. mAP: mAP for object detection is the average of the AP calculated for all the classes. mAP@0.5 means that it is the mAP calculated at IOU threshold 0.5.
  • 37. Example Detection If multiple boxes are detected above a threshold choose the one with the highest IOU
  • 38. Summary • Enormous advances have been made in Computer Vision in the past decade using Deep Learning. • State of the art pre-trained DL models are best suited for most projects. o Easy to use Codes are readily available o High Performance models can be built with good Training Data. • Success of a project is related to measuring defining and measuring metrics correctly. • Programing frameworks such as Pytorch and Keras abstract details and are useful to build models fast.

Editor's Notes

  • #4: CV is a process of computers recording the input coming from a camera , analyzing it, drawing conclusions and making decisions based on it.
  • #6: Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment. Clinicians can identify DR by the presence of lesions associated with the vascular abnormalities caused by the disease. While this approach is effective, its resource demands are high. The expertise and equipment required are often lacking in areas where the rate of diabetes in local populations is high and DR detection is most needed. As the number of individuals with diabetes continues to grow, the infrastructure needed to prevent blindness due to DR will become even more insufficient.
  • #20: When training deep networks there comes a point where an increase in depth causes accuracy to saturate, then degrade rapidly. This is called the "degradation problem." This highlights that not all neural network architectures are equally easy to optimize.
  • #21: ResNet uses a technique called "residual mapping" to combat this issue. Instead of hoping that every few stacked layers directly fit a desired underlying mapping, the Residual Network explicitly lets these layers fit a residual mapping.