Semantic Segmentation Architecture: A Key Part of Scene Understanding Applications

Semantic Segmentation

Semantic segmentation
Labelling every pixel in an image Architecture
 A key part of Scene Understanding  Deep Network is constructed by modifying ResNet-101
 Applications  Fully connected layers of ResNet-101 are converted to convolutional
 Autonomous navigation layers
 Assisting the partially sighted  Model weights of Imagenet-pretrained ResNet-101 network are finetuned
 Medical diagnosis  Convolutional filters in the layers that follow pooling are modified to
 Image editing atrous spatial pyramid pooling
 ASPP consists of One 1x1 convolution and three 3x3 convolutions with
(6,12,18) rates - all with 256 filters and batch normalization, image level
features (global average pooling)
ASPP block Architecture

 Semantic Segmentation on Cityscape dataset
• Complexity
30 classes
• Diversity
• 50 cities, Several months (spring, summer, fall)
• Daytime, Good/medium weather conditions
• Manually selected frames
• Large number of dynamic objects
• Varying scene layout
• Varying background
Framework used
• Volume  Tensorflow cpu version for inference on cityscape dataset
• 5 000 annotated images with fine annotations  mxnet gpu version for training and inference on IIIT Hyderabad dataset
• 20 000 annotated images with coarse annotations  Competive accuracy of 65.2 mIOU as compared to baseline of 55.7 in ECCV 2018 competition
 Given demo video to Great wall conference organized on 12/07/2018
Semantic Segmentation on IIIT Hyderabad

dataset
• Complexity Knowledge transfer
 Given initial set of codes on using KITTI dataset for viewing 3D point clouds, transfer of data
32 classes to/from 3D Lidar to camera using projection matrices
• Diversity  Bird eye view projection of 3D Lidar data
• unconstrained driving condition, highly challenging  DataParsers for KITTI 3d object detection on KITTI dataset
• Hyderabad, Bangalore outskirts  Dataparsers for semantic segmentation on CamVid and Cityscape dataset
• Daytime, Good/medium weather conditions  Made dataParsers compatible with IIIT Hyderabad dataset apart from cityscape dataset
• Volume GPU investigation
• 10,003 annotated images with fine annotations  Aggregated information relevant to Desktop development, data center solutions and Embedded
GPUs
 Gathered info on GPUs using Amazon EC2 spot instance
1
GPU_Specs
Speed Improvement in Semantic Segmentation
Motivation

Fast and accurate semantic segmentation has been a fun-
damental challenge in computer vision

Unnecessary to reprocess every single pixel of aframe by
those deep semantic segmentation models in a video
sequence.

Our goal is to develop a real-time semantic segmentation system
by leveraging the temporal correlations between consecutive
frames.
Approach

The DVSNet framework consists of 3 major steps. ➢
Decision network (DN) is a lightweight CNN consists of only a single convolutional
layer and 3 fullyconnected layers.
➢
DN takes as input the feature maps from one of the intermediate layers of the flow
network, and be trained to perform regression.
➢
In the training phase, the goal of DN is to learn to predict an expected confidence
score for a frame region as close to the ground truth confidence score as possible.
The predicted expected confidence score is compared with the ground truth
confidence score to calculate a mean squared error (MSE) loss.
➢
In the inference phase, the ground truth confidence score is not accessible to both DN

The first step in the DVSNet framework is dividing the input frames and the flow network. The feature maps fed into DN is allowed to come from any of
into 4 frame regions. the layers of the flow networks. These feature maps represent the spatial transfer

In step 2, DN analyzes the frame region pairs between consecutive information between a key frame region and its corresponding current frame region.
frames, and evaluates the expected confidence scores for the 4
regions separately. DN compares the expected confidence score of
each region against a predetermined threshold. If the expected
confidence score of a region is lower than the threshold, the
corresponding region is sent to a segmentation path. Otherwise, it  Semantic Segmentation on Cityscape dataset Results:
is forwarded to a spatial warping path, which includes the flow • Complexity ➢
network. 30 classes

Based on the decisions of DN, in step 3, frame regions are • Diversity
forwarded to different paths to generate their regional semantic • 50 cities, Several months (spring, summer, fall)
segmentations. For the spatial warping path, a special warping • Daytime, Good/medium weather conditions
function W is employed to process the the output of the flow • Manually selected frames
network with the segmentation Sk from the same region of the key • Large number of dynamic objects
frame to generate a new segmentation Oc for that region. • Varying scene layout
• Volume
• 5 000 annotated images with fine annotations
• 20 000 annotated images with coarse annotations
References
Future-work [1] Yu-Shuan Xu and Tsu-Jui Fu and Hsuan-Kung Yang and

Integrate backbone framework with mobilenet-V2 Chun-Yi Lee, Dynamic Video Segmentation Network,IEEE
Conference on Computer Vision and Pattern Recognition
CVPR 2018
Instance Segmentation
Instance Segmentation Architecture
Detect instances, Give category, Label the pixels 1. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign,
Simultaneous detection and segmentation that faithfully preserves exact spatial locations.
A key part of Scene Understanding ➢ We propose an RoIAlign layer that removes the harsh quantization of RoIPool,
Applications properly aligning the extracted features with the input.
 Autonomous navigation ➢ RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains
 Assisting the partially sighted under stricter localization metrics.
 Medical diagnosis ➢ We use bilinear interpolation to compute the exact values of the input
 Image editing ➢ features at four regularly sampled locations in each ROI bin, and aggregate the
result(using max or average)
2. Adding a branch for predicting segmentation masks on each Region of Interest (RoI), in
Instance Segmentation on Cityscape dataset
parallel with the existing branch for classification and bounding box regression
• Complexity
30 classes
• Diversity
• 50 cities, Several months (spring, summer, fall)
• Daytime, Good/medium weather conditions
• Volume
• 5 000 annotated images with fine annotations
• 20 000 annotated images with coarse annotations
Faster-RCNN V/S Mask RCNN

• Mask R-CNN extends Faster R-CNN by adding a branch for predicting Loss calculation:
an object mask in parallel with the existing branch for bounding box
recognition
Results :
mxnet gpu version for training and inference on cityscape dataset
3
Motion segmentation
Motivation Architecture

Predict both the object label and motion status ofeach pixel in an 
Motion Feature Learning
image. ●
This stream generatesfeatures that represent motion

Given a pair of consecutive images, thenetwork learns to fuse specific information. Succes-sive frames are first
features from self-generated optical flowmaps and semantic passed through a section of this stream that
segmentation kernels to yield pixel-wise se-mantic motion labels. generates high quality optical flow maps. FlowNet2
used for this purpose.
Approach 
Semantic Feature Learning
●
This stream takes input imagex and generates semantic

The framework consists of 3 major steps. features. The architecture follows the design of a
contractive segment that aggregates semantic information
while decreasing the spatial dimensions of the feature
maps and an expansive segment that upsamples the
feature maps back to the full input resolution.

Semantic Motion Fusion
●
This stream takes feature tensors from Motion Feature and
Semantic Feature, concatenated and further deep
representations are learned through a series of additional
layers. Finally,towards the end of this stream, we use
deconvolutionfor upsampling the lowresolution feature
maps from 2048×24×48 back to the input resolution.
This upsampled output has joint labels corresponding to a
semantic class and a motion status: static or moving
Dataset :
City-KITTI-Motion dataset :
3734 training images and 1100 for validation.

A section that learns motion features from generated optical flow KITTI-Motion dataset have wider resolution 1280×384
maps Cityscape-motion dataset resolution of768×384

A parallel section that generates features for semantic
segmentation
Results :

Fusion section that combines both the motion and semantic
caffe gpu version for training and inference on City-KITTI-Motion dataset.
features and further learns deep representations for pixel-wise
semantic motion segmentation.
3D Object detection
Motivation LIDAR Datasets:

The use of Deep Learning approaches for real-time object ➢
KITTI dataset
detection from sparse LIDAR Point Clouds has not been fully ➢
Nuscenes/Nutonomy dataset
explored. ●
Apolloscape dataset

High levels of sparsity in the data makes it difficult to
interpret object structure

Our goal is to develop a real-time object detection system for
highly sparse 3D point clouds using a lightweight CNN
called PointPillarNet.
Results:
➢
Average forward processing time for pedestrian/cyclist was around 19ms
Approach ➢
Average forward processing time for vehicle/car was around 164 ms

Our focus is to address object detection in point clouds ➢
MAP for pedestrian/Bi-cyclist was around 59.07% on KITTI dataset
collected from LIDAR scans with sparse vertical density. ➢
MAP for vehicle/car was around 74% on KITTI dataset

Classes of interest: car, pedestrian, cyclist and ground.

Sensor: Velodyne VLP-64

Gathering training data is a labour intensive task. Thus, we
leverage on the KITTI,Nuscenes, Apolloscape dataset, which
has labels for car, pedestrian and cyclist Conclusion
➢
We proposed methods to perform fast and accurate object detection
using highly sparse LIDAR point clouds for instances car, pedestrian, Bi-
cyclist using Deep Learning.
➢
In general, we achieve high classification performance for cars, pedestrians,
Bi-cyclist
Architecture ➢
In all of our results, we achieve high recall scores.
 The raw point cloud is converted to a stacked pillar ➢
Finally, there is room for improvement concerning the results using
tensor and pillar index tensor. Point Graph CNN
 The encoder uses the stacked pillars to learn a set of
features that can be scattered back to a 2D pseudo-
image for a convolutional neural network.
 The features from the backbone are used by the Future-work
detection head to predict 3D bounding boxes for 
Improving the scope of results for classification for pedestrians, cars and
objects. cyclists.

Incorporate Point graph CNN to handle Non-grid structured pointclouds

Explore the capabilities of the network in unstructured environments.
References
[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasum. Vision meets robotics: the kitti
dataset. 2013
Annotation Tool—AI based tracker
Algo Real-time Smoothing Drift Performance Network Topology(Re3) :
MIL(V1.0) Yes Window averaging Yes Not good, can track only few
frames
KCF/CSK( Yes Window averaging Yes Just OK, can track only few
frames and then drifts
V2.0)
Boosting(V Yes Window averaging Yes Just OK, can track only few
frames and then drifts
2.1)
Re3(V2.2) Yes None N Good, can track more than 500

frames continously. No drift
occurs as long as frame change
abruptly
Network Structure(Re3) :
a) Image crop pairs are fed in at each timestep. Both crops are centered around the object’s
location in the previous frame, and
padded to two times the width and height of the object.
b) Before every pooling stage, we add a skip layer to preserve high-resolution spatial
information.
c) The weights from the two image streams are shared.
d) The output from thecconvolutional layers feeds into a single fully connected layer and an
LSTM.
e) The network predicts the top left and bottom right corners of the new bounding box.
Tracker performance
Key Findings
➢
Tracks the small traffic light even for more than 1000 frames robustly
➢ Annotation makes much easier and 5 times efficient in terms of Version V1.0,V2.0
References
Gordon, D., Farhadi, A., Fox, D.: Re3: Re al-time recurrent regression networks for visual tracking
of generic objects. IEEE Robotics and Automation Letters 3(2), 788–795 (2018)
https://arxiv.org/abs/1705.06368
Various real-time trackers evaluated on the Imagenet Video test set. Area under the curve (AUC) is shown for each method
Conditional Imitation Learning
Background :
➢ The basic idea behind imitation learning is to train a controller that mimics an expert. Network Topology(Re3) :
➢ Controller receives an observation o_t from the environment and a command c_t. It
produces an action a_t that affects the environment, advancing to the next time step.
➢ The training data is a set of observation- action pairs D = {<o_i,a_i>} generated by the
expert. The assumption is that the expert is successful at performing the task of interest
and that a controller trained to mimic the expert will also perform the task well.
➢ This is a supervised learning problem, in which the parameters θ of a function

approximator F(o;θ) must be optimized to fit the mapping of observations to actions:
Drawback :
➢ Assumption behind this formulation is that the expert’s actions are fully explained by the
observations. There exists a function E that maps observations to the expert’s actions
➢ If this assumption holds, a sufficiently expressive approximator will be able to fit the
function given enough data.
➢ However, in more complex scenarios the assumption that the mapping of observations to
actions is a function breaks down.
CIL
➢ To address this, we begin by explicitly modeling the expert’s internal state by a vector h,
which together with the observation explains the expert’s action: a_i = E(o_i,h_i).
➢ h can include information about the expert’s intentions, goals and prior knowledge. The
conditional imitation learning objective can be rewritten as:
➢ We expose the latent state h to the controller by introducing an additional command input: c
= c(h).
➢ At training time, the command c is provided by the expert. At test time, commands can be Key Findings
used to affect the behavior of the controller, which come from a human user or a planning
module. “turn right at the next intersection”. ➢
Tracks the small traffic light even for more than 1000 frames robustly
➢ The training dataset becomes D = {<o_i, c_i, a_i>} ➢ Annotation makes much easier and 5 times efficient in terms of Version V1.0,V2.0
References
Gordon, D., Farhadi, A., Fox, D.: Re3: Re al-time recurrent regression networks for visual tracking
of generic objects. IEEE Robotics and Automation Letters 3(2), 788–795 (2018)
https://arxiv.org/abs/1705.06368
Various real-time trackers evaluated on the Imagenet Video test set. Area under the curve (AUC) is shown for each method
Motivation
Traffic light detection

Real-time traffic light detection using TensorFlow object Trial4:
detection module. Model : FasterRCNN_50(faster_rcnn_resnet50_coco.config)
fixed_image_resizer: 350*350
Trial1
Dataset

Gather opensource datasets from Berkeley, Apolloscape, Lisa, Inference time ~~ 100ms====>> Real time
GWM dataset ~8726 images
Bosch and GWM

Developed parser to parse all datasets given in json,csv
Modifications
format

Since apollo works on small portion of ROI in an entire
image, we dynamically select the ROI based on the
dimensions of traffic light

Generated tfrecords

Train on the dataset mixtures from Berkeley, Lisa, Apolloscape

Models used : FPN ResNet-50, FasterRCNN-ResNet101

Key Findings:
●
Traffic light didn’t got detected => traffic lights are not
incoporated in dataset. Key findings:
a) Traffic light detected consistently
b) Inference time on single image : fixed irrespective on the size of the image
Trial2 c) No shoot up in execution time

Gather GWMdata using bag-tracker based on Re3 tracker d) Inference time around 100/110 ms

Datasets generated around 8000+ images

Models used : FPN ResNet-50, FasterRCNN-ResNet101,
SSD_Mobilenetv2 Trial5

Key Findings: Model : FasterRCNN_50(faster_rcnn_resnet50_lowproposals_coco_2018_01_28)
●
Traffic light detection accuracy not good using FPN image_resizer: 350*350
ResNet-50.
Dataset
●
Traffic light detection accuracy not good using
GWM dataset ~8726 images
SSD_Mobilenetv2.
●
Traffic light detection accuracy good using FasterRCNN- Modifications
ResNet101.
●
Execution time was getting shooted FasterRCNN-
ResNet101 if the resolution is not of size 300*300.
●
Detection results was much better than HavalUS results:
Reason we got more annotations using Re3 tracker in
short span of time
Trial3: Solution to overcome shootup time


Key Findings:
●
No more shoot-up in execution time.Our data always has
the same width/height ratio. Key findings:

a) Traffic light detected, slightly lower accuracy as compared to our custom model.
b) Inference time on single image : varies irrespective on the size of the image
c) Shoot up in execution time
Traffic light detection

Semantic Segmentation Architecture: A Key Part of Scene Understanding Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Semantic Segmentation Architecture: A Key Part of Scene Understanding Applications

Uploaded by

Copyright:

Available Formats

Semantic Segmentation

ASPP block Architecture

Semantic Segmentation on IIIT Hyderabad

Faster-RCNN V/S Mask RCNN

Re3(V2.2) Yes None N Good, can track more than 500

➢ This is a supervised learning problem, in which the parameters θ of a function

Trial3: Solution to overcome shootup time

You might also like