CNN Vision Applications

CNNs for
Segmentation, Localization, and

detection
M. Soleymani
Sharif University of Technology
Fall 2017
Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
and some from John Canny lectures, cs294-129, Berkeley, 2016. .
AlexNet
[Krizhevsky, Sutskever, Hinton, 2012]
• ImageNet Classification with Deep Convolutional Neural Networks

Image classification
Other Computer Vision Tasks
Classification and localization
What & where
Classification + Localization
Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy
Localization:
Input: Image
Output: Box in the image (x, y, w, h) (x, y, w, h)
Evaluation metric: Intersection over Union
Classification + Localization: Do both

Idea #1: Localization as Regression
Input: image
Neural Output:
Net Box coordinates
(4 numbers)
Loss:
Correct output: L2 distance
box coordinates
Only one object, (4 numbers)
simpler than
detection
Simple Recipe for Classification + Localization
• Step 1: Train (or download) a classification model (e.g., VGG)
Convolution
Fully-connected
and Pooling
layers
Softmax
loss
Final conv
Class
Image feature map
scores
• Step 2: Attach new fully-connected “regression head” to the network
Fully-connected
layers
“Classification head”
Convolution Class scores
and Pooling
Fully-connected
layers
“Regression head”
Final conv Box
Image feature map coordinates
• Step 3: Train the regression head only with SGD and L2 loss
Fully-connected
layers
Convolution
and Pooling Class scores
Fully-connected
layers
L2 loss
Final conv
Box
Fully-connected
layers
Convolution
Fully-connected
layers
L2 loss
Final conv
Box
Fully-connected
layers
Softmax loss
Convolution
Fully-connected
layers
L2 loss
Final conv
Box
Classification + Localization
Often pretrained on
ImageNet (Transfer learning)
Aside: Human Pose Estimation
Aside: Human Pose Estimation
Where to attach the regression head?
After conv layers: After last FC layer:

Overfeat, VGG DeepPose, R-CNN
Convolution Fully-
and Pooling connected
layers
Softmax
loss
Final conv
Class
Image feature map
scores
Object detection
Object Detection: Impact of Deep Learning
Object Detection as Regression?
Each image needs a different number of outputs!
Object Detection as Classification: Sliding Window
Problem: Need to apply CNN to huge number of

locations and scales, very computationally expensive!
Region Proposals
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN Training
• Step 1: Train (or download) a classification model for ImageNet
(AlexNet)
Convolution Fully-connected
and Pooling layers
Softmax loss
Final conv
Class scores
feature map
Image 1000 classes
R-CNN Training
• Step 2: Fine-tune model for detection
- Instead of 1000 ImageNet classes, want 20 object classes + background
- Throw away final fully-connected layer, reinitialize from scratch
- Keep training model using positive / negative regions from detection images
Re-initialize this layer:

Convolution Fully-connected was 4096 x 1000,
and Pooling layers now will be 4096 x 21
Softmax loss
Final conv
Class scores:
feature map
Image 21 classes
R-CNN Training
Step 3: Extract features
- Extract region proposals for all images
- For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk
- Have a big hard drive: features are ~200GB for PASCAL dataset!
Convolution
and Pooling
pool5
features
Image Region Crop + Warp Forward pass Save to disk
Proposals
R-CNN Training
• Step 4: Train one binary SVM per class to classify region features
Training image
regions
Cached region
features
Positive samples for cat Negative samples for cat

SVM SVM
R-CNN Training
• Step 5 (bbox regression): For each class, train a linear regression model to map
from cached features to offsets to GT boxes to make up for “slightly wrong”
proposals
Training image
regions
Cached region
features
Regression targets (0, 0, 0, 0) (.25, 0, 0, 0) (0, 0, -0.125, 0)

(dx, dy, dw, dh) Proposal is Proposal too Proposal too
Normalized good far to left wide
coordinates
R-CNN: Problems
• Ad hoc training objectives
– Fine-tune network with softmax classifier (log loss)
– Train post-hoc linear SVMs (hinge loss)
– Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow

– 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
– Fixed by SPP-net [He et al. ECCV14]
Fast R-CNN
Fast R-CNN
Share computation of
convolutional layers
between proposals for
an image
Fast R-CNN: Region of Interest Pooling
and Pooling layers
Hi-res input Hi-res conv Problem: Fully-connected

image: features: layers expect low-res
3 x 800 x 600 CxHxW conv features: C x h x w
with region with region
proposal proposal
Project region
proposal onto conv
feature map
and Pooling layers

proposal proposal
Convolution Divide projected

region into h x w Fully-connected
and Pooling layers
grid

proposal proposal
Max-pool
Convolution within each
grid cell Fully-connected
and Pooling layers
Hi-res input Hi-res conv RoI conv features: Fully-connected layers

image: features: Cxhxw expect low-res conv
3 x 800 x 600 CxHxW for region features:
with region with region proposal Cxhxw
proposal proposal
Can back
Convolution propagate similar
to max pooling Fully-connected
and Pooling layers
Hi-res input Hi-res conv RoI conv features: Fully-connected layers

image: features: Cxhxw expect low-res conv
3 x 800 x 600 CxHxW for region features:
with region with region proposal Cxhxw
proposal proposal
Fast R-CNN: RoI Pooling
Fast R-CNN
Share computation of
convolutional layers
between proposals for
an image
R-CNN vs SPP vs Fast R-CNN
Problem: Runtime dominated by region proposals!

Faster R-CNN
• Make CNN do proposals!
– Solely based on CNN
– No external modules
• Each step is end-to-end
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks”. NIPS 2015
Faster R-CNN
• Insert Region Proposal Network
(RPN) to predict proposals from
features Jointly
• Train with 4 losses:
– RPN classify object / not object
– RPN regress box coordinates
– Final classification score (object
classes)
– Final box coordinates
Faster R-CNN: Make CNN do proposals!
Object Detection
Source: http://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
Semantic Segmentation
Semantic Segmentation Idea: Sliding Window
Problem: Very inefficient!

Not reusing shared features between
overlapping patches
Semantic Segmentation Idea: Fully Convolutional
In-Network upsampling: “Unpooling”
In-Network upsampling: “Max Unpooling”
Learnable Upsampling: Transpose Convolution
Other names:
Deconvolution (bad)
Upconvolution
Fractionally strided convolution
Backward strided convolution
Transpose Convolution: 1D Example
Convolution as Matrix Multiplication (1D Example)
Convolution as Matrix Multiplication (1D Example)
Computer Vision Tasks

CNN Vision Applications

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN Vision Applications

Uploaded by

Copyright:

Available Formats

CNNs for

Segmentation, Localization, and

• ImageNet Classification with Deep Convolutional Neural Networks

Classification + Localization: Do both

After conv layers: After last FC layer:

Problem: Need to apply CNN to huge number of

Re-initialize this layer:

Positive samples for cat Negative samples for cat

Regression targets (0, 0, 0, 0) (.25, 0, 0, 0) (0, 0, -0.125, 0)

• Training is slow (84h), takes a lot of disk space

• Inference (detection) is slow

Hi-res input Hi-res conv Problem: Fully-connected

Hi-res input Hi-res conv Problem: Fully-connected

Convolution Divide projected

Hi-res input Hi-res conv Problem: Fully-connected

Hi-res input Hi-res conv RoI conv features: Fully-connected layers

Hi-res input Hi-res conv RoI conv features: Fully-connected layers

Problem: Runtime dominated by region proposals!

Problem: Very inefficient!

You might also like