Gentle guide on how YOLO Object Localization works with Keras (Part 1)

https://www.dlology.com/blog/gentle-guide-on-how-yolo-object-localization-works-with-keras/

We will dive a little deeper and understand how the YOLO object localization algorithm works.
I have seen some impressive real-time demos for object localization. One of them is with TensorFlow Object Detection API, you can customize it to detect your cute pet - a raccoon.
https://github.com/tensorflow/models/tree/master/research/object_detection

Having played around with the API for a while, I began to wonder why the object localization works so well.
Few online resources explained it to me in a definitive and easy to understand way. So I decided to write this post my own for anyone who is curious about how object localization algorithm works.
This post might contain some advanced topics, but I will try to explain it as beginner friendly as possible.

Into to Object Localization

What is object localization and how it is compared to object classification?

You might have heard of ImageNet models, and they are doing well on classifying images. One model is trained to tell if there is a specific object such as a car in a given image.
目标分类

An object localization model is similar to a classification model. But the trained localization model also predicts where the object is located in the image by drawing a bounding box around it. For example, a car is located in the image below. The information of the bounding box, center point coordinate, width and, height is also included in the model output.
目标定位

这里写图片描述

Let’s see we have 3 types of targets to detect
1 - pedestrian
2 - car
3 - motorcycle

For the classification model, the output will be a list of 3 numbers representing the probability for each class. For the image above with only a car inside the output may look like [0.001, 0.998, 0.001]. The second class which is the car has the largest probability.

The output of the localization model will include the information of the bounding box, so the output will look like

这里写图片描述

p_c: being 1 means any object is detected. If it is 0, the rest of output will be ignored.
b_x: x coordinate, the center of the object corresponding to the upper left corner of the image
b_y: y coordinate, the center of the object corresponding to the upper left corner of the image
b_h: height of the bounding box
b_w: width of the bounding box

扫描二维码关注公众号，回复： 3056806 查看本文章

At this point, you may have come up an easy way to do object localization by applying a sliding window across the entire input image. Kindly like we use a magnifier to look one region of a map at a time and find if that region contains something that interests us. This method is easy to implement and don’t even need us to train another localization model since we can use a popular image classifier model and have it look at every selected region of the image and output the probability for each class of target.

But this method is very slow since we have to predict on lots of regions and try lots of box sizes to have a more accurate result. It is computationally intensive, so it is hard to achieve good real-time object localization performance as required in an application like self-driving cars.

Here is a trick to make it a little bit faster.

If you are familiar with how convolutional network works, it can resemble the sliding window effect by holding the virtual magnifier for you. As a result, it generates all prediction for a given bounding box size in one forward pass of the network which is more computationally efficient. But still, the position of the bounding box is not going to be very accurate based on how we choose the stride and how many different sizes of bounding boxes to try.

这里写图片描述
credit: Coursera deeplearning.ai

In the image above, we have the input image with shape 16 x 16 pixels and 3 color channels (RGB). Then the convolutional sliding window is shown in the upper left blue square with size 14 x 14 and stride of 2. Meaning the window slide vertically or horizontally 2 pixels at a time. The upper left corner in the output gives you the upper left 14 x 14 image result if any of the 4 types of target objects are detected in the 14 x 14 section.

If you are not entirely sure what I just talked about the convolutional implementation of the sliding window, no problem because the YOLO algorithm we explain later will handle them all.

Why do we need Object Localization?

One apparently application, self-driving car, real-time detecting and localizing other cars, road signs, bikes are critical.

What else can it do?

What about a security camera to track and predict the movement of a suspected person entering your property?
Or in a fruit packaging and distribution center. We can build an image based volume sensing system. It may even use the size of the bounding box to proximate the size of an orange on the conveyer belt and do some smart sorting.

这里写图片描述

Can you think of some other useful application for object localization? Please share your fresh ideas below!

The second part of the series “Gentle guide on how YOLO Object Localization works with Keras (Part 2)”.

Wordbook
Keras is an open source neural network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit or Theano.
magnifier [‘mæɡnɪfaɪə]：放大镜，放大器
suspected person ：可疑人物
conveyer belt：输送带
raccoon [rə’kuːn]：浣熊，浣熊毛皮

Related posts
Gentle guide on how YOLO Object Localization works with Keras (Part 2)
https://www.dlology.com/blog/gentle-guide-on-how-yolo-object-localization-works-with-keras-part-2/