An overview of YOLO target detection written for beginners

Article directory


This article mainly introduces; the advantages of YOLO (You Only Look Once) object detection, its development in the past few years, and some real-life applications.

What is object detection

Object detection is a technique used in computer vision to identify and locate objects in images or videos.

Image localization refers to the process of identifying the correct location of one or more objects using bounding boxes, which correspond to the rectangular shape surrounding the object.

This process is sometimes confused with image classification or image recognition, which aims at predicting an image or an object within an image as a class or one of the classes.

The illustrations below correspond to the computer vision techniques explained above. The object detected in the image is a "person".

img

In this article, we will first understand the advantages of object detection, and then introduce the state-of-the-art object detection algorithm YOLO .

In the second part, we will pay more attention to the YOLO algorithm and how it works. After that, we will provide some practical applications using YOLO .

The final section will explain the evolution of YOLO from 2015 to 2020, and then summarize the next steps.

What is YOLO?

You Only Look Once (YOLO) is a state-of-the-art real-time object detection algorithm developed by Joseph Redmon , Santosh Divvala , Ross Girshick , and Ali Farhadi in 2015 in their famous research paper "You Only Look Once: Unified, Real-Time Introduced in Object Detection".

The core idea of ​​YOLO is to turn target detection into a regression problem rather than a classification task. Using the entire image as the input of the network, only through a single convolutional neural network (CNN), the location of the bounding box (bounding box) and its category are obtained.

Why is YOLO so popular in the field of object detection?

The reasons why YOLO is ahead of its competitors include:

  • high speed
  • High detection accuracy
  • good generalization ability
  • open source

1. Fast

YOLO is very fast because it does not involve complicated processes. It can process images at 45 frames per second. Furthermore, compared to other real-time systems, YOLO achieves more than twice the average precision (mAP), making it an excellent choice for real-time processing. From the graph below, we can see that YOLO is much faster than other object detectors, reaching 91 FPS.YOLO Speed compared to other state-of-the-art object detectors

2. High detection accuracy

The accuracy of YOLO far exceeds other state-of-the-art models with almost no background error.

3. Better generalization

Especially for the new version of YOLO, discussed later in this article. With these improvements, YOLO provides better generalization performance in new domains, making it ideal for applications that rely on fast and robust object detection. For example, the study "Automatic Melanoma Detection Using YOLO Deep Convolutional Neural Networks" shows that the YOLOv1 version has the lowest average accuracy, while the YOLOv2 and YOLOv3 versions have higher average accuracy.

4. Open source

Open sourcing YOLO allows the community to continuously improve the model. This is one of the reasons YOLO has achieved so many improvements in a limited time.

YOLO Architecture

The YOLO architecture is similar to GoogleNet. As shown in the figure below, it has a total of 24 convolutional layers, four max pooling layers and two fully connected layers .

YOLO Architecture from the original paper

The architecture works as follows:

  • The input image is resized to 448x448 and then processed through a convolutional network.
  • First a 1x1 convolution is applied to reduce the number of channels, followed by a 3x3 convolution to generate a cube output.
  • The activation function used internally is ReLU, except that the last layer uses a linear activation function.
  • Some additional techniques, such as batch normalization and dropout, respectively, regularize the model and prevent overfitting.

How does YOLO object detection work?

Having seen the architecture of YOLO in the previous section, let us briefly introduce how the YOLO algorithm performs object detection using a simple use case.

Imagine you built a YOLO application that can detect players and footballs from a given image. But how to explain this process to someone, especially a non-professional?

YOLO Object Detection Image by Jeffrey F Lin on Unsplash

The algorithm operates based on the following four methods:

  • Residual blocks
  • Bounding box regression
  • IoU (Intersection over Union)
  • Non-Maximum Suppression (Non-Maximum Suppression)

Let's look at each method in more detail.

Residual blocks

This step first divides the original image (A) into NxN grid cells with equal shape, where N is 4 in our case, as shown in the image on the right. Each cell in the grid is responsible for locating and predicting the class of the objects it covers, as well as the probability/confidence value.

Application of grid cells to the original image

Bounding box regression

The next step is to determine bounding boxes corresponding to all objects in the image, there can be as many bounding boxes as there are objects in a given image. YOLO determines the attributes of these bounding boxes using a single regression module of the following format, where Y is the final vector representation of each bounding box.

Y = [pc,bx,by,bh,bw,c1,c2]

This is especially important during the training phase of the model.

  • pc corresponds to the probability score of the mesh containing the object. For example, all red grids will have a probability score greater than zero. The image on the right is a simplified version because each yellow cell has a probability of zero (insignificant).

Identification of significant and insignificant grids

  • bx and by are the center coordinates of the bounding box, relative to the grid cells surrounding it.

  • bh and bw are the height and width of the bounding box, relative to the grid cells surrounding it.

  • c1 and c2 correspond to two classes Player and Ball. You can have as many classes as you need for your application.

To understand better, let's take a closer look at the bottom right player.

Bounding box regression identification

IoU (Intersection over Union)

Usually, a single object in an image may have multiple grid boxes as prediction results, but not all grid boxes are related. The goal of the IOU (a value between 0 and 1) is to discard these irrelevant grid boxes and keep only the relevant ones. The logic is as follows:

  • The user defines its IOU selection threshold, e.g. 0.5.

  • Then, YOLO calculates the IOU of each grid cell, which is the intersection area divided by the union area.

  • Finally, it ignores predictions for grid cells with IOU ≤ threshold and considers grid cells with IOU > threshold.

Below is an example of applying the mesh selection process to the bottom left object. We can observe that the object initially had two grid box candidates, and finally only "Grid 2" was selected.

Process of selecting the best grids for prediction

Non-Maximum Suppression (Non-Maximum Suppression)

Thresholding the IOU is not always enough, because an object may have multiple boxes with the same IOU value as the threshold, which may contain noise if all of them are kept. This is why we can use NMS to keep only the boxes with the highest detection probability scores.

Application scenarios of YOLO

YOLO object detection has different applications in our daily life. In this part, we will cover some applications in the following fields: healthcare, agriculture, security monitoring, and autonomous vehicles.

1- Applied in the industrial field

Object detection has been introduced to many practical industrial domains, such as healthcare and agriculture. Let's understand each field with concrete examples.

the medical

In the medical field, especially in surgery, locating organs in real time can be challenging due to biodiversity among patients. Kidney Recognition in CT uses YOLOv3 to help localize kidneys in 2D and 3D in computed tomography (CT).

agriculture

Artificial intelligence and robotics play an important role in modern agriculture. Harvesting robots are vision-based robots used to replace hand-picked fruits and vegetables. One of the best models in this field uses YOLO. In Tomato Detection Based on a Modified YOLOv3 Framework , the authors describe how they used YOLO to identify different types of fruits and vegetables for efficient harvesting.

Comparison of YOLO-tomato models

Security Monitoring

In the field of security monitoring, although object detection technology is widely used, it is not the only application. During the COVID-19 outbreak, YOLOv3 was used to estimate social distance violations between people. You can learn more about this topic from Deep Learning-Based Framework for Social Distancing Monitoring of COVID-19 .

YOLO, YOLOv2, YOLO9000, YOLOv3, YOLOv4, YOLOR, YOLOX, YOLOv5, YOLOv6, YOLOv7比较

Since YOLO was first released in 2015, it has evolved through different versions. In this section, we'll look at the differences between each version.

YOLO Timeframe 2015 to 2022

YOLO/YOLOv1, the starting point

The first version of YOLO was a game changer in object detection due to its ability to recognize objects quickly and efficiently.

However, like many other solutions, the first version of YOLO had its own limitations:

  • It struggles to detect smaller images in groups of images, such as a group of people in a stadium. This is because each grid in the YOLO architecture is designed for a single object detection.
  • Then, YOLO cannot successfully detect new or unusual shapes.
  • Finally, the loss function used to approximate detection performance treats errors equally for small and large bounding boxes, creating erroneous localizations.

YOLOv2 or YOLO9000

YOLOv2 was created in 2016 to make the YOLO model better, faster and stronger .

Improvements include but are not limited to using Darknet-19 as a new architecture, batch normalization, higher input resolution, convolutional layers with anchors, dimensional clustering, and fine-grained features.

1-Batch normalization, adding a batch normalization layer will improve performance

1- Batch Normalization (Batch Normalization)

Adding a batch normalization layer improves mAP performance by 2%. This batch normalization includes a regularization effect, preventing overfitting.

2- Higher input resolution

YOLOv2 directly uses a higher resolution 448×448 input instead of 224×224, which allows the model to adjust its filters to perform better on higher resolution images. After training for 10 epochs on the ImageNet data, this approach improves accuracy by 4% mAP.

3- Convolutional layer using anchor boxes

YOLOv2 simplifies the problem by replacing fully connected layers with anchor boxes instead of predicting the exact coordinates of object bounding boxes like YOLOv1. This approach lowered accuracy slightly, but increased model recall by 7%, allowing more room for improvement.

4- Dimensional Clustering

The above mentioned anchor boxes are automatically discovered by YOLOv2 using k-means dimension clustering with k = 5 instead of manual selection. This novel approach provides a good trade-off between model recall and precision.

For a better understanding of k-means dimensional clustering, check out our K-Means Clustering in Python with scikit-learn and K-Means Clustering in R tutorials. They dive into the concepts of k-means clustering using Python and R.

5- Fine-grained features

YOLOv2 predicts a 13x13 feature map, which is certainly sufficient for large object detection. However, for finer object detection, the architecture can be modified by converting the 26×26×512 feature maps to 13×13×2048 feature maps and concatenating with the original features. This approach improved the model performance by 1%.

YOLOv3 - Incremental Improvement

Incremental improvements were made on top of YOLOv2 to create YOLOv3.

Major changes include a new network architecture: Darknet-53 . This is a 106-layer neural network with an upsampling network and a residual block. Compared with YOLOv2's backbone network Darknet-19 , it is larger, faster and more accurate. This new architecture is beneficial in many ways:

1- Better bounding box prediction

YOLOv3 uses a logistic regression model to predict object scores for each bounding box.

2- More accurate category prediction

Different from the softmax used in YOLOv2, a separate logistic classifier is introduced to accurately predict the category of the bounding box. This is useful when faced with more complex domains with overlapping labels (e.g. person → soccer player). Using softmax will limit each box to only one category, which is not always true.

3- More accurate predictions at different scales

YOLOv3 makes three predictions at different scales in the input image for each location to help upsampling from the previous layer. This strategy allows to obtain fine-grained and more meaningful semantic information for higher quality output images.

YOLOv4 - Best Speed ​​and Accuracy for Object Detection

This version of YOLO has the best object detection speed and accuracy compared to all previous versions and other state-of-the-art object detectors.

The image below shows that YOLOv4 is 10% faster compared to YOLOv3 and 12% faster than FPS.

YOLOv4 Speed compared to YOLOv3

YOLOv4 is specifically designed for production systems and optimized for parallel computing.

The backbone of the YOLOv4 architecture is CSPDarknet53 , a network of 29 convolutional layers with 3 x 3 filters and approximately 27.6 million parameters.

Compared with YOLOv3, this architecture adds the following information for better object detection:

  • The Spatial Pyramid Pooling (SPP) block significantly increases the receptive field , separates the most relevant contextual features, and does not affect the network speed.
  • YOLOv4 uses PANet instead of the Feature Pyramid Network (FPN) used in YOLOv3 for aggregation of parameters from different detection levels.
  • Data augmentation uses a collage technique, combining four training images, and employs an adaptive adversarial training method.
  • Optimal hyperparameter selection is performed using a genetic algorithm.

Receptive field: The pixels on the feature map output by each layer of the convolutional neural network are mapped back to the area size on the input image. The popular explanation is that a point on the feature map, relative to the size of the original image, is also the area where the convolutional neural network features can see the input image

YOLOR — You Only Look One Representation

YOLOR is a multi-task unified network based on a combined unified network of explicit and implicit knowledge methods.

YOLOR unified network architecture

explicit and tacit knowledge

Explicit knowledge refers to normal or conscious learning. Implicit learning refers to the subconscious learning through experience.

Combining these two techniques, YOLOR is able to create a more powerful architecture based on three processes: (1) feature alignment, (2) prediction alignment for object detection, and (3) canonical representation for multi-task learning.

1- Prediction Alignment

This approach introduces an implicit representation in the feature maps of each Feature Pyramid Network (FPN), which can improve the accuracy by about 0.5%.

2- Predictive refinement of object detection

Model predictions can be refined by adding implicit representations to the output layer of the network.

3- Canonical representation for multi-task learning

Performing multi-task training requires performing joint optimization on a loss function shared by all tasks. This process may degrade the overall performance of the model, which can be mitigated by integrating canonical representations during model training.

From the graph below, we can see that YOLOR achieves the state-of-the-art inference speed compared to other models on MS COCO data.

YOLOR vs YOLOv4

YOLOX - Beyond the YOLO series in 2021

This paper uses a modified version of YOLOv3 as the baseline and Darknet-53 as its backbone network.

In the article "YOLOX: Exceeding YOLO Series in 2021" , YOLOX provides the following four key features to create a better model than the old version.

1- Highly separated head

The joint head used in previous YOLO versions was shown to degrade the performance of the model. YOLOX uses a split head, which separates the classification and localization tasks, thus improving the performance of the model.

2- Powerful data augmentation

Integrating Mosaic and MixUp into the data augmentation method significantly improves the performance of YOLOX.

3- No anchor point system

Anchor-based algorithms perform clustering internally, which increases inference time. The anchor mechanism is removed in YOLOX, which reduces the number of predictions per image and significantly improves the inference time.

4- SimOTA Label Assignment

The authors introduce SimOTA, a more powerful label assignment strategy, instead of using intersection-over-union (IoU) methods. SimOTA not only reduces training time, but also avoids additional hyperparameter issues, leading to state-of-the-art results. Furthermore, it improves detection mAP by 3%.

YOLOv5

Compared with other versions, YOLOv5 has not published research papers and is the first YOLO version implemented in Pytorch instead of Darknet.

Published by Glenn Jocher in June 2020, YOLOv5 is similar to YOLOv4, using CSPDarknet53 as the backbone of its architecture. The release includes five different model sizes: YOLOv5s (smallest), YOLOv5m, YOLOv5l, and YOLOv5x (largest).

A major improvement in the YOLOv5 architecture is the integration of the Focus layer, represented by a single layer, created by replacing the first three layers of YOLOv3. This ensemble reduces the number of layers and parameters, and increases both forward and backward speed without significant impact on mAP.

The illustration below compares the training time between YOLOv4 and YOLOv5s.

YOLOv4 vs YOLOv5 Training Time

YOLOv6 - A Single-Stage Object Detection Framework for Industrial Applications

The YOLOv6 (MT-YOLOv6) framework launched by the Chinese e-commerce company Meituan is specifically designed for hardware-friendly efficient design and high-performance optimization for industrial applications .

The framework is written in Pytorch, and while not part of official YOLO, its backbone is named YOLOv6, inspired by the original single-stage YOLO architecture.

Compared with the previous YOLOv5, YOLOv6 introduces three major improvements in hardware-friendly backbone and neck design, efficient decoupling head, and more effective training strategy.

As shown below, YOLOv6 shows significant improvements in both accuracy and speed on the COCO dataset over previous YOLO versions.

YOLO Model Comparison

  • The throughput of YOLOv6-N on NVIDIA Tesla T4 GPU is 1234 FPS, achieving 35.9% AP.

  • At 869 FPS, YOLOv6-S achieved an AP of 43.3%, creating a new optimal result.

  • YOLOv6-M and YOLOv6-L achieve better accuracy performance at the same inference speed, 49.5% and 52.3%, respectively.

YOLOv7 - Free trainable toolkit sets new state of the art for real-time object detectors

YOLOv7 was released in July 2022 in an article titled " Training free toolkit sets new state of the art for real-time object detectors ". This version has made significant progress in the field of object detection, surpassing all previous models in terms of accuracy and speed.

YOLOV7 VS Competitors

YOLOv7 has made major changes in its (1) architecture and (2) trainable free toolkit level:

1- Architectural level

YOLOv7 improves its architecture by integrating the Extended Efficient Layer Aggregation Network (E-ELAN), which enables the model to learn more diverse features for better learning.

In addition, YOLOv7 extends its architecture by connecting the architectures of its derived models, such as YOLOv4, Scaled YOLOv4, and YOLO-R, so that the model can meet the needs of different inference speeds.

YOLO Compound Scaling Depth

2- Trainable free toolkit

The term "free toolkit" refers to ways to improve model accuracy without increasing training costs, which is why YOLOv7 not only improves inference speed, but also improves detection accuracy.

YOLOv8 is a cutting-edge, state-of-the-art (SOTA) model that builds on the previous successful version of YOLO and introduces new features and improvements to further improve performance and flexibility.

Comparison with other YOLO [source: Ultralytics]
It uses anchor-free detection and new convolutional layers to make predictions more accurate.

Comparison of different versions of V8 [Source: Roboflow]
Results:
The results obtained by YOLO 8 on RF100 are improved compared to other versions.

YOLOv8

YOLOv8 is a cutting-edge, state-of-the-art (SOTA) model that builds on the previous successful version of YOLO and introduces new features and improvements to further improve performance and flexibility.

img

It uses anchor-free detection and new convolutional layers to make predictions more accurate.

img

result:

The results obtained by YOLO 8 on RF100 have improved compared to other versions.

img

in conclusion

This paper describes the advantages of YOLO compared to other state-of-the-art object detection algorithms, and its evolution from 2015 to 2020, and highlights its advantages.

Considering the rapid development of YOLO, there is no doubt that it will maintain the leading position in the field of object detection for a long time.

The next step will be to apply the YOLO algorithm to real cases.

Guess you like

Origin blog.csdn.net/shangyanaf/article/details/130399439