Object Detection Using Google AI Open Images - Towards Data Science PDF

10/01/2019 Object Detection using Google AI Open Images – Towards Data Science
Object Detection using Google AI

Open Images
Learn to build your own self-driving car!!!….just
kidding
Atindra Bandi Follow

Dec 14, 2018 · 9 min read
By Atindra Bandi, Alyson Brown, Sagar Chadha, Amy Dang, Jason Su
Source
https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 1/22
When was the last time you logged into your phone using nothing but
your face? Or clicked a sel e with some friends and used a Snapchat
lter that put some fancy dog ears on your face? Did you know that
these cool features are enabled by a fancy neural network that not only
recognizes that there is a face in the photo but also detects where the
ears should go. Your phone, in a sense, can ‘see’ you and it even knows
what you look like!
The technology that helps computers ‘see’ is called “computer vision”.

In recent years, computer vision applications are becoming increasingly
commonplace due to an explosion in computing power making deep
learning models faster and more feasible. Many companies such as
Amazon, Google, Tesla, Facebook, and Microsoft are investing heavily
in this technology and its applications.
Computer Vision Tasks

We focus on two main computer vision tasks — image classi cation and
object detection.
1. Image Classi cation focuses on grouping an image into a

prede ned category. To achieve this, we need to have multiple
images with the class that is of interest to us and train a computer
to essentially convert pixel numbers to symbols. This is just saying
that the computer sees a photo of a cat and says that there is a cat
in it.
2. Object detection utilizes an image classi er to gure out what is

present in an image and where. These tasks have been made
easier through the use of Convolutional Neural Networks (CNNs)
which have made it possible to detect multiple classes in a single
pass of the image.
For more details on the di erence in such tasks, please reference the following article.
Computer vision is cool!

Recognizing that many interesting data science applications in the
future would involve working with images, my team of budding data
scientists and I decided to try our hands at the Google AI Open Image
challenge hosted on Kaggle. We thought of this as the perfect
opportunity to get our hands dirty with neural networks and
convolutions, and potentially impress our professors and classmates.

This challenge provided us with 1.7 million images with 12 million
bounding box annotations (their X and Y coordinates relative to the
image) of 500 object classes. You can nd the data here.
We highly recommend going through Andrew Ng’s Coursera course on

Convolutional Neural networks to anyone who wants to read about
CNNs.
Getting Our Hands Dirty!

Exploratory Data Analysis — As with all data analyses, we began
exploring what images we had and the type of objects we needed to
detect.
Frequency of Classes in the Training Dataset
A quick look at the training images revealed that certain objects had
more of a presence than others in terms of how many times they
appeared. The chart above shows the distribution of the top 43 classes.
It is clear that there is a huge disparity and it would need to be resolved
somehow. In the interest of time and money (GPU costs are high :( ) we
chose the aforementioned 43 object classes and a subset of ~ 300K
images with these objects. We had about 400 images for each object
class in the training data.
Choosing the Object Detection Algorithm

We considered various object detection algorithms, including VGG,
Inception, and YOLO, but ultimately chose the YOLO algorithm
because of its speed, computational power and the abundance of online
articles that could guide us through the process. Faced with
computational and time restraints, we made two key decisions -
1. Use a YOLO v2 model which was trained to identify certain

objects.
2. Leverage transfer learning to train the last convolutional layer to

recognize previously unseen objects such as guitar, house,
man/woman, bird, etc.
Inputs for YOLO

The YOLO algorithm requires some speci c inputs -
1. Input image size — YOLO network is designed to work with

speci c input image sizes. We sent in images with a size of 608 *
608.
2. Number of Classes — 43. This is required to de ne the

dimensions of the output of the YOLO.
3. Anchor box — The number and dimensions of anchor boxes to be

used.
4. Con dence and IoU thresholds — Thresholds to de ne which

anchor boxes to choose and how to pick between anchor boxes.
5. Image names with bounding box information — For each image

we need to provide YOLO with what’s in it in a speci c format as
shown below
Sample input for YOLO
Below is the code snippet for YOLO inputs
1 LABELS = ['Shirt', 'Trousers', 'Swimwear', 'Tie', 'Bus', 'T

2 'Sunglasses', 'Jacket', 'Dress', 'Human eye', 'Sui
3 'Human head','Human hand', 'Human leg', 'Human no
4 'Wheel', 'Boat', 'House', 'Bird', 'Guitar', 'Fast
5
6 # Setting the input image size to 608 X 608
7 IMAGE_H, IMAGE_W = 608, 608
8
9 # We wil use 19X19 grids for our images. This will lead us
10 GRID_H, GRID_W = 19 , 19
11 BOX = 5
12 # Getting the total number of classes/labels we will be pre
13 CLASS = len(LABELS)
14
15 # Assigning 1's to all class labels
16 CLASS_WEIGHTS = np.ones(CLASS, dtype='float32')
17
18 # Pr (object in class) * Pr (class of the object) < Obj_thr
19 OBJ_THRESHOLD = 0.3#0.5
20
21 # If there are many overlapping boxes and IOU is > NMS_ther
22 NMS_THRESHOLD = 0.3#0.45
23
Inputs into YOLO
YOLO v2 Architecture
The architecture is as shown below — it has 23 convolution layers each
with its own batch normalization, Leaky RELU activation and max
pooling.
Representation of the actual YOLO v2 architecture.
These layers try to extract multiple important features from images so

that the various classes can be detected. For the purpose of object
detection, the YOLO algorithm divides the input image into a 19*19
grid each with 5 di erent anchor boxes. It then tries to detect classes
within each of these grid cells and assigns an object to one of the 5
anchor boxes for each grid cell. The anchor boxes di er in shape and
are intended to capture di erently shaped objects for each grid cell.
The YOLO algorithm outputs a matrix (shown below) for each of the
de ned anchor boxes-
Given that we had to train the algorithm for 43 classes, we got output
dimensions of:
These matrices give us the probabilities of observing an object for each

anchor box and also the probability of what class that object is. To lter
out anchor boxes that don’t have any classes or have the same object as
some other box, we use two thresholds — IoU threshold to lter out
anchor boxes capturing the same object and con dence threshold to
lter out boxes that don’t contain any class with a high con dence.
Below is the illustration of the last few layers of the YOLO v2 architecture:
1 # Layer 20
2 x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name
3 x = BatchNormalization(name='norm_20')(x)
4 x = LeakyReLU(alpha=0.1)(x)
5
6 # Layer 21
7 skip_connection = Conv2D(64, (1,1), strides=(1,1), padding=
8 skip_connection = BatchNormalization(name='norm_21')(skip_c
9 skip_connection = LeakyReLU(alpha=0.1)(skip_connection)
10 skip_connection = Lambda(space_to_depth_x2)(skip_connection
11
12 x = concatenate([skip_connection, x])
13
14 # Layer 22
15 x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name
16 x = BatchNormalization(name='norm_22')(x)
17 x = LeakyReLU(alpha=0.1)(x)
18
Last few layers of YOLO v2 architecture (Only for illustration purposes)
Transfer Learning
Transfer learning is the idea of obtaining a neural network that has
already been trained to classify images and using it for our speci c
purpose. This saves us computation time since we don’t need to train a
lot of weights — for instance, the YOLO v2 model we used has about 50
million weights — training which would have taken 4–5 days easily on
the Google cloud instance we were using.
To successfully implement transfer learning, we had to make a few

updates to our model:
• Input image size — The model that we downloaded used input

images of size 416*416. Since some of the objects we were
training for were very little — birds, footwear- we didn’t want to
squish the input image so much. For this reason, we used input
images of size 608*608.
• Grid size — We changed the dimensions of the grid size so that it

divides the image into 19*19 grid cells instead of 13*13 which
was the default for the model we downloaded.
• Output Layer — Since we were training on a di erent number of

classes 43 versus 80 that the original model was trained on, the
output layer was changed to output the matrix dimension as
discussed above.
We re-initialized the weights of YOLO’s last convolution layer to train it

on our dataset which eventually helped us identify unique classes.
Below is the code snippet for the same -
1 # Taking the last convolutional layer

2 layer = model.layers[-4]
3 weights = layer.get_weights()
4
5 # Randomly initializing the weights of the last layer
6 new_kernel = np.random.normal(size=weights[0].shape)/(GRID_
7 new_bias = np.random.normal(size=weights[1].shape)/(GRID_
Re-initializing the last convolution layer of YOLO
Cost Function
In any object detection problem, we want to identify the right object at
the right place with a high con dence in an image. There are 3 major
components to the cost function:
1. Classi cation Loss: It is the squared error of class conditional

probability if an object is detected. Thus the loss function only
penalizes classi cation error only if an object is present in a grid
cell.
2. Localization Loss: It is the squared error in the predicted

boundary boxes location and size with the ground truth boxes, if
the boxes are responsible for detecting the object. In order to
penalize the loss from bounding box coordinate predictions, we
use a regularization parameter (ƛcoord). Further, to make sure
that small deviations in larger boxes matter less than in smaller
boxes the algorithm uses the square root of bounding box width
and height.
3. Con dence Loss: It is the squared error of the bounding box’s

con dence score. Most of the boxes are not responsible for
detecting an object and therefore the equation is split into two
parts, one for the boxes detecting an object and another one for
the rest boxes. A regularization term λnoobj (default: 0.5) is
applied to the latter part to weigh down the boxes not detecting an
object.
Please feel free to refer to the original YOlO paper for a detailed look at
the cost function.
The beauty of YOLO is that it uses errors that are easy to optimize using
optimization functions such as Stochastic Gradient Descent (SGD),
SGD with momentum, or Adam etc. Below code snippet shows the
parameters we used for optimizing the cost function.
1
2 # Optimization Functions
3 optimizer = Adam(lr=0.5e-4, beta_1=0.9, beta_2=0.999, epsil
4 #optimizer = SGD(lr=1e-4, decay=0.0005, momentum=0.9)
5 #optimizer = RMSprop(lr=1e-4, rho=0.9, epsilon=1e-08, decay
6
7 model.compile(loss=custom_loss, optimizer=optimizer)
8
9 model.fit_generator(generator = train_batch,
10 steps_per_epoch = int(len(train_batch)
11 epochs = 100,
12 verbose = 1
Training algorithm for YOLO (Adam optimizer)
Output Accuracy — mean Average Precision

(mAP Score):
There are many metrics to evaluate models in object detection, and for
our project we decided to use the mAP score, which is the average of
the maximum precision at di erent recall values over all IoU
thresholds. In order to understand mAP, we’ll do a quick review of
precision, recall, and IoU (intersection over union).
Precision & Recall

Precision measures the percentage of positive predictions that are
correct. Recall is the proportion of true positives out of all possible
outcomes. These two values are inversely related and are also
dependent on the model score threshold that you set for the model (in
our case, it is the con dence score). Their mathematical de nitions are
presented below:
Source
Intersection over Union (IoU)

IoU measures how much overlap there is between two regions, which is
equal to the area of overlap over the area of union. This measures how
well your predictions are (from your object detector) compared with
the ground truth (true object boundary). To summarize, the mAP score
is the mean AP over all IoU thresholds.
Results
Object Detection — Car Tra c Video

drive.google.com
Conclusion
Object detection is di erent from other computer vision tasks. You can
use a pre-trained model and edit as needed to meet your needs. You’ll
prob need GCP or another platform that allows for higher computing
power. Math is hard, read others’ articles and fail fast.
Lessons Learned
In the beginning, we found that the model was not able to predict
many of the classes because many of them only had a few training
images, which caused an imbalance training dataset. Therefore, we
decided to just use the most popular 43 classes, which is not a perfect
approach, but each class had at least 500 images. However, our
predictions’ con dence scores were still pretty low. To solve this
problem, we selected images that contained our target classes.
Object detection is a very challenging topic, but don’t be scared and try
to learn as much as possible from the various open sources online, like
Coursera, YouTube instructional videos, GitHub, and Medium. All
these free wisdom can help you succeed in this amazing eld!
Future Work — Continuations or

Improvements
1. Train the model on more classes to detect a greater variety of
objects. To reach this goal, we need to rst solve the problem of
imbalanced data. A potential solution is that we can collect more

images with these rarer classes.
a. Data Augmentation — Change existing images slightly to create new

images
b. Image duplication — We can use the same image multiple times to

train the algorithm on the speci c rare class
c. Ensemble — Train one model on the popular classes and train

another for the rare classes and use predictions from both.
2. In addition, we can try an ensemble of di erent models, such as

MobileNet, VGG, etc. which are convolutional neural networks
algorithms also used for object detection.
If you’d like to take a detailed look into our team’s code, here’s the GitHub
link. Please feel free to provide any feedback or comments!
bandiatindra/Object-Detection-
Project
Contribute to bandiatindra/Object-
Detection-Project development by
creating an account on GitHub.
github.com

Object Detection Using Google AI Open Images - Towards Data Science PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object Detection Using Google AI Open Images - Towards Data Science PDF

Uploaded by

Copyright:

Available Formats

10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Object Detection using Google AI

Atindra Bandi Follow

By Atindra Bandi, Alyson Brown, Sagar Chadha, Amy Dang, Jason Su

The technology that helps computers ‘see’ is called “computer vision”.

Computer Vision Tasks

1. Image Classi cation focuses on grouping an image into a

2. Object detection utilizes an image classi er to gure out what is

Computer vision is cool!

convolutions, and potentially impress our professors and classmates.

We highly recommend going through Andrew Ng’s Coursera course on

Getting Our Hands Dirty!

Frequency of Classes in the Training Dataset

Choosing the Object Detection Algorithm

1. Use a YOLO v2 model which was trained to identify certain

2. Leverage transfer learning to train the last convolutional layer to

Inputs for YOLO

1. Input image size — YOLO network is designed to work with

2. Number of Classes — 43. This is required to de ne the

3. Anchor box — The number and dimensions of anchor boxes to be

4. Con dence and IoU thresholds — Thresholds to de ne which

5. Image names with bounding box information — For each image

Sample input for YOLO

Below is the code snippet for YOLO inputs

1 LABELS = ['Shirt', 'Trousers', 'Swimwear', 'Tie', 'Bus', 'T

Representation of the actual YOLO v2 architecture.

These layers try to extract multiple important features from images so

These matrices give us the probabilities of observing an object for each

Last few layers of YOLO v2 architecture (Only for illustration purposes)

To successfully implement transfer learning, we had to make a few

• Input image size — The model that we downloaded used input

• Grid size — We changed the dimensions of the grid size so that it

• Output Layer — Since we were training on a di erent number of

We re-initialized the weights of YOLO’s last convolution layer to train it

1 # Taking the last convolutional layer

Re-initializing the last convolution layer of YOLO

1. Classi cation Loss: It is the squared error of class conditional

2. Localization Loss: It is the squared error in the predicted

3. Con dence Loss: It is the squared error of the bounding box’s

Output Accuracy — mean Average Precision

Precision & Recall

Intersection over Union (IoU)

Object Detection — Car Tra c Video

Future Work — Continuations or

imbalanced data. A potential solution is that we can collect more

a. Data Augmentation — Change existing images slightly to create new

b. Image duplication — We can use the same image multiple times to

c. Ensemble — Train one model on the popular classes and train

2. In addition, we can try an ensemble of di erent models, such as

You might also like