Professional Documents
Culture Documents
Semantic segmentation
Labelling every pixel in an image Architecture
A key part of Scene Understanding Deep Network is constructed by modifying ResNet-101
Applications Fully connected layers of ResNet-101 are converted to convolutional
Autonomous navigation layers
Assisting the partially sighted Model weights of Imagenet-pretrained ResNet-101 network are finetuned
Medical diagnosis Convolutional filters in the layers that follow pooling are modified to
Image editing atrous spatial pyramid pooling
ASPP consists of One 1x1 convolution and three 3x3 convolutions with
(6,12,18) rates - all with 256 filters and batch normalization, image level
features (global average pooling)
Approach
The DVSNet framework consists of 3 major steps. ➢
Decision network (DN) is a lightweight CNN consists of only a single convolutional
layer and 3 fullyconnected layers.
➢
DN takes as input the feature maps from one of the intermediate layers of the flow
network, and be trained to perform regression.
➢
In the training phase, the goal of DN is to learn to predict an expected confidence
score for a frame region as close to the ground truth confidence score as possible.
The predicted expected confidence score is compared with the ground truth
confidence score to calculate a mean squared error (MSE) loss.
➢
In the inference phase, the ground truth confidence score is not accessible to both DN
The first step in the DVSNet framework is dividing the input frames and the flow network. The feature maps fed into DN is allowed to come from any of
into 4 frame regions. the layers of the flow networks. These feature maps represent the spatial transfer
In step 2, DN analyzes the frame region pairs between consecutive information between a key frame region and its corresponding current frame region.
frames, and evaluates the expected confidence scores for the 4
regions separately. DN compares the expected confidence score of
each region against a predetermined threshold. If the expected
confidence score of a region is lower than the threshold, the
corresponding region is sent to a segmentation path. Otherwise, it Semantic Segmentation on Cityscape dataset Results:
is forwarded to a spatial warping path, which includes the flow • Complexity ➢
network. 30 classes
Based on the decisions of DN, in step 3, frame regions are • Diversity
forwarded to different paths to generate their regional semantic • 50 cities, Several months (spring, summer, fall)
segmentations. For the spatial warping path, a special warping • Daytime, Good/medium weather conditions
function W is employed to process the the output of the flow • Manually selected frames
network with the segmentation Sk from the same region of the key • Large number of dynamic objects
frame to generate a new segmentation Oc for that region. • Varying scene layout
• Varying background
• Volume
• 5 000 annotated images with fine annotations
• 20 000 annotated images with coarse annotations
References
Future-work [1] Yu-Shuan Xu and Tsu-Jui Fu and Hsuan-Kung Yang and
Integrate backbone framework with mobilenet-V2 Chun-Yi Lee, Dynamic Video Segmentation Network,IEEE
Conference on Computer Vision and Pattern Recognition
CVPR 2018
Instance Segmentation
Instance Segmentation Architecture
Detect instances, Give category, Label the pixels 1. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign,
Simultaneous detection and segmentation that faithfully preserves exact spatial locations.
A key part of Scene Understanding ➢ We propose an RoIAlign layer that removes the harsh quantization of RoIPool,
Applications properly aligning the extracted features with the input.
Autonomous navigation ➢ RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains
Assisting the partially sighted under stricter localization metrics.
Medical diagnosis ➢ We use bilinear interpolation to compute the exact values of the input
Image editing ➢ features at four regularly sampled locations in each ROI bin, and aggregate the
result(using max or average)
2. Adding a branch for predicting segmentation masks on each Region of Interest (RoI), in
Instance Segmentation on Cityscape dataset
parallel with the existing branch for classification and bounding box regression
• Complexity
30 classes
• Diversity
• 50 cities, Several months (spring, summer, fall)
• Daytime, Good/medium weather conditions
• Manually selected frames
• Large number of dynamic objects
• Varying scene layout
• Varying background
• Volume
• 5 000 annotated images with fine annotations
• 20 000 annotated images with coarse annotations
Results :
mxnet gpu version for training and inference on cityscape dataset
3
Motion segmentation
Motivation Architecture
Predict both the object label and motion status ofeach pixel in an
Motion Feature Learning
image. ●
This stream generatesfeatures that represent motion
Given a pair of consecutive images, thenetwork learns to fuse specific information. Succes-sive frames are first
features from self-generated optical flowmaps and semantic passed through a section of this stream that
segmentation kernels to yield pixel-wise se-mantic motion labels. generates high quality optical flow maps. FlowNet2
used for this purpose.
Approach
Semantic Feature Learning
●
This stream takes input imagex and generates semantic
The framework consists of 3 major steps. features. The architecture follows the design of a
contractive segment that aggregates semantic information
while decreasing the spatial dimensions of the feature
maps and an expansive segment that upsamples the
feature maps back to the full input resolution.
Semantic Motion Fusion
●
This stream takes feature tensors from Motion Feature and
Semantic Feature, concatenated and further deep
representations are learned through a series of additional
layers. Finally,towards the end of this stream, we use
deconvolutionfor upsampling the lowresolution feature
maps from 2048×24×48 back to the input resolution.
This upsampled output has joint labels corresponding to a
semantic class and a motion status: static or moving
Dataset :
City-KITTI-Motion dataset :
3734 training images and 1100 for validation.
A section that learns motion features from generated optical flow KITTI-Motion dataset have wider resolution 1280×384
maps Cityscape-motion dataset resolution of768×384
A parallel section that generates features for semantic
segmentation
Results :
Fusion section that combines both the motion and semantic
caffe gpu version for training and inference on City-KITTI-Motion dataset.
features and further learns deep representations for pixel-wise
semantic motion segmentation.
3D Object detection
Motivation LIDAR Datasets:
The use of Deep Learning approaches for real-time object ➢
KITTI dataset
detection from sparse LIDAR Point Clouds has not been fully ➢
Nuscenes/Nutonomy dataset
explored. ●
Apolloscape dataset
High levels of sparsity in the data makes it difficult to
interpret object structure
Our goal is to develop a real-time object detection system for
highly sparse 3D point clouds using a lightweight CNN
called PointPillarNet.
Results:
➢
Average forward processing time for pedestrian/cyclist was around 19ms
Approach ➢
Average forward processing time for vehicle/car was around 164 ms
Our focus is to address object detection in point clouds ➢
MAP for pedestrian/Bi-cyclist was around 59.07% on KITTI dataset
collected from LIDAR scans with sparse vertical density. ➢
MAP for vehicle/car was around 74% on KITTI dataset
Classes of interest: car, pedestrian, cyclist and ground.
Sensor: Velodyne VLP-64
Gathering training data is a labour intensive task. Thus, we
leverage on the KITTI,Nuscenes, Apolloscape dataset, which
has labels for car, pedestrian and cyclist Conclusion
➢
We proposed methods to perform fast and accurate object detection
using highly sparse LIDAR point clouds for instances car, pedestrian, Bi-
cyclist using Deep Learning.
➢
In general, we achieve high classification performance for cars, pedestrians,
Bi-cyclist
Architecture ➢
In all of our results, we achieve high recall scores.
The raw point cloud is converted to a stacked pillar ➢
Finally, there is room for improvement concerning the results using
tensor and pillar index tensor. Point Graph CNN
The encoder uses the stacked pillars to learn a set of
features that can be scattered back to a 2D pseudo-
image for a convolutional neural network.
The features from the backbone are used by the Future-work
detection head to predict 3D bounding boxes for
Improving the scope of results for classification for pedestrians, cars and
objects. cyclists.
Incorporate Point graph CNN to handle Non-grid structured pointclouds
Explore the capabilities of the network in unstructured environments.
References
[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasum. Vision meets robotics: the kitti
dataset. 2013
Annotation Tool—AI based tracker
Algo Real-time Smoothing Drift Performance Network Topology(Re3) :
MIL(V1.0) Yes Window averaging Yes Not good, can track only few
frames
KCF/CSK( Yes Window averaging Yes Just OK, can track only few
frames and then drifts
V2.0)
Boosting(V Yes Window averaging Yes Just OK, can track only few
frames and then drifts
2.1)
Network Structure(Re3) :
a) Image crop pairs are fed in at each timestep. Both crops are centered around the object’s
location in the previous frame, and
padded to two times the width and height of the object.
b) Before every pooling stage, we add a skip layer to preserve high-resolution spatial
information.
c) The weights from the two image streams are shared.
d) The output from thecconvolutional layers feeds into a single fully connected layer and an
LSTM.
e) The network predicts the top left and bottom right corners of the new bounding box.
Tracker performance
Key Findings
➢
Tracks the small traffic light even for more than 1000 frames robustly
➢ Annotation makes much easier and 5 times efficient in terms of Version V1.0,V2.0
References
Gordon, D., Farhadi, A., Fox, D.: Re3: Re al-time recurrent regression networks for visual tracking
of generic objects. IEEE Robotics and Automation Letters 3(2), 788–795 (2018)
https://arxiv.org/abs/1705.06368
Various real-time trackers evaluated on the Imagenet Video test set. Area under the curve (AUC) is shown for each method
Conditional Imitation Learning
Background :
➢ The basic idea behind imitation learning is to train a controller that mimics an expert. Network Topology(Re3) :
➢ Controller receives an observation o_t from the environment and a command c_t. It
produces an action a_t that affects the environment, advancing to the next time step.
➢ The training data is a set of observation- action pairs D = {<o_i,a_i>} generated by the
expert. The assumption is that the expert is successful at performing the task of interest
and that a controller trained to mimic the expert will also perform the task well.
Drawback :
➢ Assumption behind this formulation is that the expert’s actions are fully explained by the
observations. There exists a function E that maps observations to the expert’s actions
➢ If this assumption holds, a sufficiently expressive approximator will be able to fit the
function given enough data.
➢ However, in more complex scenarios the assumption that the mapping of observations to
actions is a function breaks down.
CIL
➢ To address this, we begin by explicitly modeling the expert’s internal state by a vector h,
which together with the observation explains the expert’s action: a_i = E(o_i,h_i).
➢ h can include information about the expert’s intentions, goals and prior knowledge. The
conditional imitation learning objective can be rewritten as:
➢ We expose the latent state h to the controller by introducing an additional command input: c
= c(h).
➢ At training time, the command c is provided by the expert. At test time, commands can be Key Findings
used to affect the behavior of the controller, which come from a human user or a planning
module. “turn right at the next intersection”. ➢
Tracks the small traffic light even for more than 1000 frames robustly
➢ The training dataset becomes D = {<o_i, c_i, a_i>} ➢ Annotation makes much easier and 5 times efficient in terms of Version V1.0,V2.0
References
Gordon, D., Farhadi, A., Fox, D.: Re3: Re al-time recurrent regression networks for visual tracking
of generic objects. IEEE Robotics and Automation Letters 3(2), 788–795 (2018)
https://arxiv.org/abs/1705.06368
Various real-time trackers evaluated on the Imagenet Video test set. Area under the curve (AUC) is shown for each method
Motivation
Traffic light detection
Real-time traffic light detection using TensorFlow object Trial4:
detection module. Model : FasterRCNN_50(faster_rcnn_resnet50_coco.config)
fixed_image_resizer: 350*350
Trial1
Dataset
Gather opensource datasets from Berkeley, Apolloscape, Lisa, Inference time ~~ 100ms====>> Real time
GWM dataset ~8726 images
Bosch and GWM
Developed parser to parse all datasets given in json,csv
Modifications
format
Since apollo works on small portion of ROI in an entire
image, we dynamically select the ROI based on the
dimensions of traffic light
Generated tfrecords
Train on the dataset mixtures from Berkeley, Lisa, Apolloscape
Models used : FPN ResNet-50, FasterRCNN-ResNet101
Key Findings:
●
Traffic light didn’t got detected => traffic lights are not
incoporated in dataset. Key findings:
a) Traffic light detected consistently
b) Inference time on single image : fixed irrespective on the size of the image
Trial2 c) No shoot up in execution time
Gather GWMdata using bag-tracker based on Re3 tracker d) Inference time around 100/110 ms
Datasets generated around 8000+ images
Models used : FPN ResNet-50, FasterRCNN-ResNet101,
SSD_Mobilenetv2 Trial5
Key Findings: Model : FasterRCNN_50(faster_rcnn_resnet50_lowproposals_coco_2018_01_28)
●
Traffic light detection accuracy not good using FPN image_resizer: 350*350
ResNet-50.
Dataset
●
Traffic light detection accuracy not good using
GWM dataset ~8726 images
SSD_Mobilenetv2.
●
Traffic light detection accuracy good using FasterRCNN- Modifications
ResNet101.
●
Execution time was getting shooted FasterRCNN-
ResNet101 if the resolution is not of size 300*300.
●
Detection results was much better than HavalUS results:
Reason we got more annotations using Re3 tracker in
short span of time