You are on page 1of 63

CNNs for

Segmentation, Localization, and


detection
M. Soleymani
Sharif University of Technology
Fall 2017

Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
and some from John Canny lectures, cs294-129, Berkeley, 2016. .
AlexNet
[Krizhevsky, Sutskever, Hinton, 2012]

• ImageNet Classification with Deep Convolutional Neural Networks


Image classification
Other Computer Vision Tasks
Classification and localization
What & where
Classification + Localization
Classification: C classes
Input: Image
Output: Class label CAT
Evaluation metric: Accuracy

Localization:
Input: Image
Output: Box in the image (x, y, w, h) (x, y, w, h)
Evaluation metric: Intersection over Union

Classification + Localization: Do both


Idea #1: Localization as Regression

Input: image

Neural Output:
Net Box coordinates
(4 numbers)
Loss:
Correct output: L2 distance
box coordinates
Only one object, (4 numbers)
simpler than
detection
Simple Recipe for Classification + Localization
• Step 1: Train (or download) a classification model (e.g., VGG)

Convolution
Fully-connected
and Pooling
layers

Softmax
loss
Final conv
Class
Image feature map
scores
Simple Recipe for Classification + Localization
• Step 1: Train (or download) a classification model (e.g., VGG)
• Step 2: Attach new fully-connected “regression head” to the network

Fully-connected
layers
“Classification head”
Convolution Class scores
and Pooling
Fully-connected
layers
“Regression head”
Final conv Box
Image feature map coordinates
Simple Recipe for Classification + Localization
• Step 1: Train (or download) a classification model (e.g., VGG)
• Step 2: Attach new fully-connected “regression head” to the network
• Step 3: Train the regression head only with SGD and L2 loss
Fully-connected
layers
“Classification head”
Convolution
and Pooling Class scores

Fully-connected
layers

L2 loss
Final conv
Box
Image feature map coordinates
Simple Recipe for Classification + Localization
• Step 1: Train (or download) a classification model (e.g., VGG)
• Step 2: Attach new fully-connected “regression head” to the network
• Step 3: Train the regression head only with SGD and L2 loss
Fully-connected
layers
“Classification head”
Convolution
and Pooling Class scores

Fully-connected
layers

L2 loss
Final conv
Box
Image feature map coordinates
Simple Recipe for Classification + Localization
• Step 1: Train (or download) a classification model (e.g., VGG)
• Step 2: Attach new fully-connected “regression head” to the network
• Step 3: Train the regression head only with SGD and L2 loss
Fully-connected
layers

Softmax loss
Convolution
and Pooling Class scores

Fully-connected
layers

L2 loss
Final conv
Box
Image feature map coordinates
Classification + Localization

Often pretrained on
ImageNet (Transfer learning)
Aside: Human Pose Estimation
Aside: Human Pose Estimation
Where to attach the regression head?

After conv layers: After last FC layer:


Overfeat, VGG DeepPose, R-CNN

Convolution Fully-
and Pooling connected
layers

Softmax
loss
Final conv
Class
Image feature map
scores
Object detection
Object Detection: Impact of Deep Learning
Object Detection as Regression?
Each image needs a different number of outputs!
Object Detection as Classification: Sliding Window
Object Detection as Classification: Sliding Window
Object Detection as Classification: Sliding Window

Problem: Need to apply CNN to huge number of


locations and scales, very computationally expensive!
Region Proposals
R-CNN
R-CNN
R-CNN
R-CNN
R-CNN Training
• Step 1: Train (or download) a classification model for ImageNet
(AlexNet)

Convolution Fully-connected
and Pooling layers

Softmax loss

Final conv
Class scores
feature map
Image 1000 classes
R-CNN Training
• Step 2: Fine-tune model for detection
- Instead of 1000 ImageNet classes, want 20 object classes + background
- Throw away final fully-connected layer, reinitialize from scratch
- Keep training model using positive / negative regions from detection images

Re-initialize this layer:


Convolution Fully-connected was 4096 x 1000,
and Pooling layers now will be 4096 x 21

Softmax loss

Final conv
Class scores:
feature map
Image 21 classes
R-CNN Training
Step 3: Extract features
- Extract region proposals for all images
- For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk
- Have a big hard drive: features are ~200GB for PASCAL dataset!

Convolution
and Pooling

pool5
features
Image Region Crop + Warp Forward pass Save to disk
Proposals
R-CNN Training
• Step 4: Train one binary SVM per class to classify region features

Training image
regions

Cached region
features

Positive samples for cat Negative samples for cat


SVM SVM
R-CNN Training
• Step 5 (bbox regression): For each class, train a linear regression model to map
from cached features to offsets to GT boxes to make up for “slightly wrong”
proposals

Training image
regions

Cached region
features

Regression targets (0, 0, 0, 0) (.25, 0, 0, 0) (0, 0, -0.125, 0)


(dx, dy, dw, dh) Proposal is Proposal too Proposal too
Normalized good far to left wide
coordinates
R-CNN: Problems
• Ad hoc training objectives
– Fine-tune network with softmax classifier (log loss)
– Train post-hoc linear SVMs (hinge loss)
– Train post-hoc bounding-box regressions (least squares)

• Training is slow (84h), takes a lot of disk space

• Inference (detection) is slow


– 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
– Fixed by SPP-net [He et al. ECCV14]
Fast R-CNN
Fast R-CNN

Share computation of
convolutional layers
between proposals for
an image
Fast R-CNN: Region of Interest Pooling

Convolution Fully-connected
and Pooling layers

Hi-res input Hi-res conv Problem: Fully-connected


image: features: layers expect low-res
3 x 800 x 600 CxHxW conv features: C x h x w
with region with region
proposal proposal
Fast R-CNN: Region of Interest Pooling
Project region
proposal onto conv
Convolution Fully-connected
feature map
and Pooling layers

Hi-res input Hi-res conv Problem: Fully-connected


image: features: layers expect low-res
3 x 800 x 600 CxHxW conv features: C x h x w
with region with region
proposal proposal
Fast R-CNN: Region of Interest Pooling

Convolution Divide projected


region into h x w Fully-connected
and Pooling layers
grid

Hi-res input Hi-res conv Problem: Fully-connected


image: features: layers expect low-res
3 x 800 x 600 CxHxW conv features: C x h x w
with region with region
proposal proposal
Fast R-CNN: Region of Interest Pooling
Max-pool
Convolution within each
grid cell Fully-connected
and Pooling layers

Hi-res input Hi-res conv RoI conv features: Fully-connected layers


image: features: Cxhxw expect low-res conv
3 x 800 x 600 CxHxW for region features:
with region with region proposal Cxhxw
proposal proposal
Fast R-CNN: Region of Interest Pooling
Can back
Convolution propagate similar
to max pooling Fully-connected
and Pooling layers

Hi-res input Hi-res conv RoI conv features: Fully-connected layers


image: features: Cxhxw expect low-res conv
3 x 800 x 600 CxHxW for region features:
with region with region proposal Cxhxw
proposal proposal
Fast R-CNN: RoI Pooling
Fast R-CNN

Share computation of
convolutional layers
between proposals for
an image
R-CNN vs SPP vs Fast R-CNN

Problem: Runtime dominated by region proposals!


Faster R-CNN
• Make CNN do proposals!
– Solely based on CNN
– No external modules
• Each step is end-to-end

Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks”. NIPS 2015
Faster R-CNN
• Insert Region Proposal Network
(RPN) to predict proposals from
features Jointly
• Train with 4 losses:
– RPN classify object / not object
– RPN regress box coordinates
– Final classification score (object
classes)
– Final box coordinates
Faster R-CNN: Make CNN do proposals!
Object Detection

Source: http://icml.cc/2016/tutorials/icml2016_tutorial_deep_residual_networks_kaiminghe.pdf
Semantic Segmentation
Semantic Segmentation Idea: Sliding Window

Problem: Very inefficient!


Not reusing shared features between
overlapping patches
Semantic Segmentation Idea: Fully Convolutional
Semantic Segmentation Idea: Fully Convolutional
Semantic Segmentation Idea: Fully Convolutional
In-Network upsampling: “Unpooling”
In-Network upsampling: “Max Unpooling”
Learnable Upsampling: Transpose Convolution
Learnable Upsampling: Transpose Convolution
Learnable Upsampling: Transpose Convolution

Other names:
Deconvolution (bad)
Upconvolution
Fractionally strided convolution
Backward strided convolution
Transpose Convolution: 1D Example
Convolution as Matrix Multiplication (1D Example)
Convolution as Matrix Multiplication (1D Example)
Semantic Segmentation Idea: Fully Convolutional
Computer Vision Tasks

You might also like