You are on page 1of 22

10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Object Detection using Google AI


Open Images
Learn to build your own self-driving car!!!….just
kidding

Atindra Bandi Follow


Dec 14, 2018 · 9 min read

By Atindra Bandi, Alyson Brown, Sagar Chadha, Amy Dang, Jason Su

Source

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 1/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

When was the last time you logged into your phone using nothing but
your face? Or clicked a sel e with some friends and used a Snapchat
lter that put some fancy dog ears on your face? Did you know that
these cool features are enabled by a fancy neural network that not only
recognizes that there is a face in the photo but also detects where the
ears should go. Your phone, in a sense, can ‘see’ you and it even knows
what you look like!

The technology that helps computers ‘see’ is called “computer vision”.


In recent years, computer vision applications are becoming increasingly
commonplace due to an explosion in computing power making deep
learning models faster and more feasible. Many companies such as
Amazon, Google, Tesla, Facebook, and Microsoft are investing heavily
in this technology and its applications.

Computer Vision Tasks


We focus on two main computer vision tasks — image classi cation and
object detection.

1. Image Classi cation focuses on grouping an image into a


prede ned category. To achieve this, we need to have multiple
images with the class that is of interest to us and train a computer
to essentially convert pixel numbers to symbols. This is just saying
that the computer sees a photo of a cat and says that there is a cat
in it.

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 2/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

2. Object detection utilizes an image classi er to gure out what is


present in an image and where. These tasks have been made
easier through the use of Convolutional Neural Networks (CNNs)
which have made it possible to detect multiple classes in a single
pass of the image.

For more details on the di erence in such tasks, please reference the following article.

Computer vision is cool!


Recognizing that many interesting data science applications in the
future would involve working with images, my team of budding data
scientists and I decided to try our hands at the Google AI Open Image
challenge hosted on Kaggle. We thought of this as the perfect
opportunity to get our hands dirty with neural networks and

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 3/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

convolutions, and potentially impress our professors and classmates.


This challenge provided us with 1.7 million images with 12 million
bounding box annotations (their X and Y coordinates relative to the
image) of 500 object classes. You can nd the data here.

We highly recommend going through Andrew Ng’s Coursera course on


Convolutional Neural networks to anyone who wants to read about
CNNs.

Getting Our Hands Dirty!


Exploratory Data Analysis — As with all data analyses, we began
exploring what images we had and the type of objects we needed to
detect.

Frequency of Classes in the Training Dataset

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 4/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

A quick look at the training images revealed that certain objects had
more of a presence than others in terms of how many times they
appeared. The chart above shows the distribution of the top 43 classes.
It is clear that there is a huge disparity and it would need to be resolved
somehow. In the interest of time and money (GPU costs are high :( ) we
chose the aforementioned 43 object classes and a subset of ~ 300K
images with these objects. We had about 400 images for each object
class in the training data.

Choosing the Object Detection Algorithm


We considered various object detection algorithms, including VGG,
Inception, and YOLO, but ultimately chose the YOLO algorithm
because of its speed, computational power and the abundance of online
articles that could guide us through the process. Faced with
computational and time restraints, we made two key decisions -

1. Use a YOLO v2 model which was trained to identify certain


objects.

2. Leverage transfer learning to train the last convolutional layer to


recognize previously unseen objects such as guitar, house,
man/woman, bird, etc.

Inputs for YOLO


The YOLO algorithm requires some speci c inputs -

1. Input image size — YOLO network is designed to work with


speci c input image sizes. We sent in images with a size of 608 *

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 5/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

608.

2. Number of Classes — 43. This is required to de ne the


dimensions of the output of the YOLO.

3. Anchor box — The number and dimensions of anchor boxes to be


used.

4. Con dence and IoU thresholds — Thresholds to de ne which


anchor boxes to choose and how to pick between anchor boxes.

5. Image names with bounding box information — For each image


we need to provide YOLO with what’s in it in a speci c format as
shown below

Sample input for YOLO

Below is the code snippet for YOLO inputs

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 6/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

1 LABELS = ['Shirt', 'Trousers', 'Swimwear', 'Tie', 'Bus', 'T


2 'Sunglasses', 'Jacket', 'Dress', 'Human eye', 'Sui
3 'Human head','Human hand', 'Human leg', 'Human no
4 'Wheel', 'Boat', 'House', 'Bird', 'Guitar', 'Fast
5
6 # Setting the input image size to 608 X 608
7 IMAGE_H, IMAGE_W = 608, 608
8
9 # We wil use 19X19 grids for our images. This will lead us
10 GRID_H, GRID_W = 19 , 19
11 BOX = 5
12 # Getting the total number of classes/labels we will be pre
13 CLASS = len(LABELS)
14
15 # Assigning 1's to all class labels
16 CLASS_WEIGHTS = np.ones(CLASS, dtype='float32')
17
18 # Pr (object in class) * Pr (class of the object) < Obj_thr
19 OBJ_THRESHOLD = 0.3#0.5
20
21 # If there are many overlapping boxes and IOU is > NMS_ther
22 NMS_THRESHOLD = 0.3#0.45
23
Inputs into YOLO

YOLO v2 Architecture
The architecture is as shown below — it has 23 convolution layers each
with its own batch normalization, Leaky RELU activation and max

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 7/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

pooling.

Representation of the actual YOLO v2 architecture.

These layers try to extract multiple important features from images so


that the various classes can be detected. For the purpose of object
detection, the YOLO algorithm divides the input image into a 19*19
grid each with 5 di erent anchor boxes. It then tries to detect classes
within each of these grid cells and assigns an object to one of the 5
anchor boxes for each grid cell. The anchor boxes di er in shape and
are intended to capture di erently shaped objects for each grid cell.

The YOLO algorithm outputs a matrix (shown below) for each of the
de ned anchor boxes-

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 8/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Given that we had to train the algorithm for 43 classes, we got output
dimensions of:

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 9/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

These matrices give us the probabilities of observing an object for each


anchor box and also the probability of what class that object is. To lter
out anchor boxes that don’t have any classes or have the same object as
some other box, we use two thresholds — IoU threshold to lter out
anchor boxes capturing the same object and con dence threshold to
lter out boxes that don’t contain any class with a high con dence.

Below is the illustration of the last few layers of the YOLO v2 architecture:

1 # Layer 20
2 x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name
3 x = BatchNormalization(name='norm_20')(x)
4 x = LeakyReLU(alpha=0.1)(x)
5
6 # Layer 21
7 skip_connection = Conv2D(64, (1,1), strides=(1,1), padding=
8 skip_connection = BatchNormalization(name='norm_21')(skip_c
9 skip_connection = LeakyReLU(alpha=0.1)(skip_connection)
10 skip_connection = Lambda(space_to_depth_x2)(skip_connection
11
12 x = concatenate([skip_connection, x])
13
14 # Layer 22
15 x = Conv2D(1024, (3,3), strides=(1,1), padding='same', name
16 x = BatchNormalization(name='norm_22')(x)
17 x = LeakyReLU(alpha=0.1)(x)
18

Last few layers of YOLO v2 architecture (Only for illustration purposes)

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 10/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Transfer Learning
Transfer learning is the idea of obtaining a neural network that has
already been trained to classify images and using it for our speci c
purpose. This saves us computation time since we don’t need to train a
lot of weights — for instance, the YOLO v2 model we used has about 50
million weights — training which would have taken 4–5 days easily on
the Google cloud instance we were using.

To successfully implement transfer learning, we had to make a few


updates to our model:

• Input image size — The model that we downloaded used input


images of size 416*416. Since some of the objects we were
training for were very little — birds, footwear- we didn’t want to
squish the input image so much. For this reason, we used input
images of size 608*608.

• Grid size — We changed the dimensions of the grid size so that it


divides the image into 19*19 grid cells instead of 13*13 which
was the default for the model we downloaded.

• Output Layer — Since we were training on a di erent number of


classes 43 versus 80 that the original model was trained on, the
output layer was changed to output the matrix dimension as
discussed above.

We re-initialized the weights of YOLO’s last convolution layer to train it


on our dataset which eventually helped us identify unique classes.
Below is the code snippet for the same -

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 11/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

1 # Taking the last convolutional layer


2 layer = model.layers[-4]
3 weights = layer.get_weights()
4
5 # Randomly initializing the weights of the last layer
6 new_kernel = np.random.normal(size=weights[0].shape)/(GRID_
7 new_bias = np.random.normal(size=weights[1].shape)/(GRID_

Re-initializing the last convolution layer of YOLO

Cost Function
In any object detection problem, we want to identify the right object at
the right place with a high con dence in an image. There are 3 major
components to the cost function:

1. Classi cation Loss: It is the squared error of class conditional


probability if an object is detected. Thus the loss function only
penalizes classi cation error only if an object is present in a grid
cell.

2. Localization Loss: It is the squared error in the predicted


boundary boxes location and size with the ground truth boxes, if
the boxes are responsible for detecting the object. In order to
penalize the loss from bounding box coordinate predictions, we
use a regularization parameter (ƛcoord). Further, to make sure
that small deviations in larger boxes matter less than in smaller
boxes the algorithm uses the square root of bounding box width
and height.

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 12/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

3. Con dence Loss: It is the squared error of the bounding box’s


con dence score. Most of the boxes are not responsible for
detecting an object and therefore the equation is split into two
parts, one for the boxes detecting an object and another one for
the rest boxes. A regularization term λnoobj (default: 0.5) is
applied to the latter part to weigh down the boxes not detecting an
object.

Please feel free to refer to the original YOlO paper for a detailed look at
the cost function.

The beauty of YOLO is that it uses errors that are easy to optimize using
optimization functions such as Stochastic Gradient Descent (SGD),
SGD with momentum, or Adam etc. Below code snippet shows the
parameters we used for optimizing the cost function.

1
2 # Optimization Functions
3 optimizer = Adam(lr=0.5e-4, beta_1=0.9, beta_2=0.999, epsil
4 #optimizer = SGD(lr=1e-4, decay=0.0005, momentum=0.9)
5 #optimizer = RMSprop(lr=1e-4, rho=0.9, epsilon=1e-08, decay
6
7 model.compile(loss=custom_loss, optimizer=optimizer)
8
9 model.fit_generator(generator = train_batch,
10 steps_per_epoch = int(len(train_batch)
11 epochs = 100,
12 verbose = 1
Training algorithm for YOLO (Adam optimizer)

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 13/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Output Accuracy — mean Average Precision


(mAP Score):
There are many metrics to evaluate models in object detection, and for
our project we decided to use the mAP score, which is the average of
the maximum precision at di erent recall values over all IoU
thresholds. In order to understand mAP, we’ll do a quick review of
precision, recall, and IoU (intersection over union).

Precision & Recall


Precision measures the percentage of positive predictions that are
correct. Recall is the proportion of true positives out of all possible
outcomes. These two values are inversely related and are also
dependent on the model score threshold that you set for the model (in
our case, it is the con dence score). Their mathematical de nitions are
presented below:

Source

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 14/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Intersection over Union (IoU)


IoU measures how much overlap there is between two regions, which is
equal to the area of overlap over the area of union. This measures how
well your predictions are (from your object detector) compared with
the ground truth (true object boundary). To summarize, the mAP score
is the mean AP over all IoU thresholds.

Results

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 15/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 16/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 17/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Object Detection — Car Tra c Video


drive.google.com

Conclusion

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 18/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

Object detection is di erent from other computer vision tasks. You can
use a pre-trained model and edit as needed to meet your needs. You’ll
prob need GCP or another platform that allows for higher computing
power. Math is hard, read others’ articles and fail fast.

Lessons Learned
In the beginning, we found that the model was not able to predict
many of the classes because many of them only had a few training
images, which caused an imbalance training dataset. Therefore, we
decided to just use the most popular 43 classes, which is not a perfect
approach, but each class had at least 500 images. However, our
predictions’ con dence scores were still pretty low. To solve this
problem, we selected images that contained our target classes.

Object detection is a very challenging topic, but don’t be scared and try
to learn as much as possible from the various open sources online, like
Coursera, YouTube instructional videos, GitHub, and Medium. All
these free wisdom can help you succeed in this amazing eld!

Future Work — Continuations or


Improvements
1. Train the model on more classes to detect a greater variety of
objects. To reach this goal, we need to rst solve the problem of

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 19/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

imbalanced data. A potential solution is that we can collect more


images with these rarer classes.

a. Data Augmentation — Change existing images slightly to create new


images

b. Image duplication — We can use the same image multiple times to


train the algorithm on the speci c rare class

c. Ensemble — Train one model on the popular classes and train


another for the rare classes and use predictions from both.

2. In addition, we can try an ensemble of di erent models, such as


MobileNet, VGG, etc. which are convolutional neural networks
algorithms also used for object detection.

If you’d like to take a detailed look into our team’s code, here’s the GitHub
link. Please feel free to provide any feedback or comments!

bandiatindra/Object-Detection-
Project
Contribute to bandiatindra/Object-
Detection-Project development by
creating an account on GitHub.
github.com

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 20/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 21/22
10/01/2019 Object Detection using Google AI Open Images – Towards Data Science

https://towardsdatascience.com/object-detection-using-google-ai-open-images-4c908cad4a54 22/22

You might also like