Overview of target detection algorithms and commonly used libraries

Object detection is the process of discovering and identifying objects in images. It is one of the important achievements in the field of deep learning and image processing. When creating object localization, a common approach when identifying objects is to use bounding boxes. This approach is highly general and can train object detection models to recognize and detect multiple specific objects.

Typically, object detection models are trained to detect the presence of specific objects. The built model can be applied to images, videos or real-time operations. Object detection has received widespread attention before the emergence of deep learning methods and modern image processing techniques. Certain methods, such as SIFT and HOG and their feature and edge extraction techniques, have been successful in object detection, while there are relatively few other competitors in this field.

With the introduction of convolutional neural networks (CNNs) and the development of computer vision technology, object detection has become increasingly popular in the current era. The new wave of target detection brought by deep learning methods shows us endless possibilities.

Object detection exploits the special and unique properties of each category to identify the desired object. When looking for a square, the object detection model can look for vertical corners, thus forming a square with equal lengths on each side. When looking for circular objects, object detection models look for center points from which specific circular entities can be created. These recognition technologies are used for face recognition or object tracking.

In this article we will explore different object detection algorithms and libraries

Application scenarios of target detection

In daily life, target detection has been widely used. For example, smartphones are unlocked via facial recognition, or suspicious activity is identified in video surveillance of stores and warehouses.

Here are some of the main applications of object detection:

  • License Plate Recognition : Combines object detection and optical character recognition (OCR) technology to identify alphanumeric characters on vehicles. Object detection is used to capture images and detect vehicles in a specific image. After the model detects the license plate, OCR technology converts the 2D data into machine-encoded text.
  • Face Detection and Recognition : One of the major applications of object detection is face detection and recognition. With the help of modern algorithms we can detect faces in images or videos. Now, thanks to the one-shot learning method, faces can even be recognized from just one trained image.
  • Object Tracking : When watching a baseball or cricket game, the ball may hit a great distance. In these situations, it is useful to track the movement of the ball and the distance it covers. To this end, object tracking ensures that we have continuous information about the direction of the ball's movement.
  • Self-driving cars : For self-driving cars, it is crucial to study the different elements surrounding the vehicle while driving. An object detection model trained on multiple categories is crucial for good performance of autonomous vehicles.
  • Robotics : Many tasks such as lifting, pick-and-place operations and other real-time work are performed by robots. Object detection is critical for robots to detect objects and automate tasks.

Since the popularity of deep learning in the early 2010s, the quality of algorithms used to solve object detection problems has continued to improve. We'll explore the most popular algorithms, understand how they work, their advantages, and their pitfalls in certain scenarios.

1. Histogram of Oriented Gradients (HOG, Histogram of Oriented Gradients)

Introduction

Histogram of Oriented Gradients (HOG) is one of the oldest object detection methods, having first appeared in 1986. Although there were some developments over the next decade, it was not until 2005 that this approach started to gain popularity in many computer vision-related tasks. HOG uses feature extractors to identify objects in images.

The feature descriptors used in HOG are representations of image parts, and we only extract the most necessary information and ignore other content. The function of the feature descriptor is to convert the overall size of the image into the form of an array or feature vector. In HOG, we use the gradient orientation process to locate the most critical parts of the image.

Architecture overview

Insert image description here

Before we understand the overall architecture of HOG, let us first understand how it works. For a specific pixel in the image, the gradient histogram is calculated by considering the vertical and horizontal values, thereby obtaining the feature vector. With gradient magnitude and gradient angle, we can get a clear value for the current pixel by exploring other entities around it horizontally and vertically.

As shown in the image above, we will consider an image segment of a specific size. The first step is to find the gradient by dividing the calculation of the entire image into the gradient representation of 8×8 cells. With the obtained 64 gradient vectors, we can segment each cell into angle intervals and calculate the histogram of this area. This process reduces the size of 64 vectors to a smaller size of 9 values.

Once we have the 9 point histogram values ​​(intervals) for each cell, we can choose to create overlaps for the cell blocks. The final steps are to form feature blocks, normalize the obtained feature vectors, and collect all feature vectors to obtain the overall HOG features.

HOG's achievements

  1. Created a feature descriptor for performing object detection.
  2. Can be combined with support vector machines (SVMs) to achieve high-precision object detection.
  3. Creates a sliding window effect for calculations at each location.

Points to consider

Limitations - Although Histogram of Oriented Gradients (HOG) was quite revolutionary in the early stages of object detection, there are many problems with this method. The calculation of complex pixels in the image is very time-consuming and does not work well in some object detection scenarios.

When to use HOG?

HOG should generally be used as the first method for object detection, for testing other algorithms and their respective performance. Nonetheless, HOG has important uses in most object detection and facial feature recognition with considerable accuracy.

2. Region-based convolutional neural network (R-CNN)

Introduction

Region-based convolutional neural network (R-CNN) is an improvement in the object detection process compared to previous methods such as HOG and SIFT. In the R-CNN model, we try to extract the most important features (usually around 2000 features) by using selective features. The process of selecting the most important feature extraction can be achieved with the help of a selective search algorithm, which can obtain more important region proposals.

The working process of R-CNN

Insert image description here

The workflow of the selective search algorithm is to select the most important region proposals, ensure that multiple sub-segments are generated on a specific image, and select candidates suitable for the task. Valid candidates can then be merged using a greedy algorithm to combine smaller fragments into appropriate larger fragments.

Once the selective search algorithm completes successfully, our next task is to extract features and make appropriate predictions. We can then generate the final candidate proposals, and a convolutional neural network can be used to create an n-dimensional (2048 or 4096) feature vector as output. With the help of pre-trained convolutional neural networks, we can easily implement feature extraction tasks.

The final step of R-CNN is to make appropriate predictions for the image and label the bounding boxes accordingly. To obtain the best results for each task, predictions are made by computing a classification model for each task, while a regression model is used to correct the bounding box classification of the proposed regions.

Problems with R-CNN

  1. Although feature extraction is efficient using pre-trained CNN models, the entire process of extracting all region proposals and ultimately the best region is very slow using current algorithms.
  2. Another major disadvantage of the R-CNN model is not only the slow training speed but also the long prediction time. The solution requires the use of significant computing resources, increasing the feasibility of the entire process. Therefore, the overall architecture can be considered quite expensive.
  3. Sometimes, the initial steps may result in poor candidate selection because improvements cannot be made in this particular step. This can cause a lot of problems in training the model.

When to use R-CNN?

R-CNN is similar to the HOG object detection method and should be used as the first baseline to test the performance of the object detection model. Predicting images and objects can take longer than expected, so it is often preferred to use modern versions of R-CNN.

Faster R-CNN (Fast R-CNN and Faster R-CNN)

Introduction

Although the R-CNN model achieves ideal results in object detection, it suffers from some major shortcomings in speed. To solve this problem, faster methods were introduced, including Fast R-CNN and Faster R-CNN.

Faster R-CNN and Fast R-CNN are both object detection algorithms in the R-CNN family. They provide improvements over the original R-CNN in terms of performance and speed. Here's a brief comparison of the two methods:

Fast R-CNN

  1. Speed : Fast R-CNN is faster than the original R-CNN because it avoids repeated calculations for each sub-region by applying a convolutional neural network over the entire image.
  2. RoI pooling : Fast R-CNN introduces Region of Interest (RoI) pooling, a special technique that extracts features from the input of pre-trained models and selective search algorithms.
  3. End-to-end training : Fast R-CNN can be trained end-to-end, which means the entire network can be trained at once without the need for staged training.
  4. Limitations : Fast R-CNN still uses a selective search algorithm to generate region proposals, which can lead to speed bottlenecks.

Faster R-CNN

  1. Speed : Faster R-CNN is faster than Fast R-CNN, mainly due to the introduction of the Region Proposal Network (RPN).
  2. Region Proposal Network : Faster R-CNN replaces the selective search algorithm with RPN to generate region proposals faster.
  3. End-to-end training : Like Fast R-CNN, Faster R-CNN can also be trained end-to-end.
  4. Performance : Faster R-CNN shows high accuracy in object detection tasks, thanks to its consideration of multiple scales, sizes, and aspect ratios of anchor boxes.

In short, Faster R-CNN is an improved version of Fast R-CNN, which mainly accelerates the generation process of region proposals by introducing a Region Proposal Network (RPN). This makes Faster R-CNN improved compared to Fast R-CNN in terms of speed and performance.

Faster R-CNN architecture

Insert image description here

Faster R-CNN is one of the best versions of the R-CNN family, with greatly improved performance and speed. While R-CNN and Fast R-CNN models use selective search algorithms to calculate region proposals, Faster R-CNN introduces a superior region proposal network. Region Proposal Network (RPN) generates efficient output by performing wide-range and multi-scale computations on images.

The region proposal network significantly reduces edge computing time, typically taking just 10 milliseconds per image. The network consists of convolutional layers that can extract the basic feature map of each pixel. For each feature map, we generate multiple anchor boxes with different scales, sizes, and aspect ratios. For each anchor box, we make a class-specific binary prediction and generate the corresponding bounding box.

Next, non-maximum suppression is used to eliminate overlapping unnecessary information in the feature map. The output of non-maximum suppression is passed through the region of interest, and the rest of the process and calculations are similar to the work of Fast R-CNN.

Advantages and limitations of Fast R-CNN

Advantage

  1. Speed : Fast R-CNN has a significant improvement in speed compared to the original R-CNN. This is mainly due to the application of convolutional neural networks on the entire image, avoiding repeated calculations for each sub-region.
  2. RoI pooling : Fast R-CNN introduces Region of Interest (RoI) pooling, a special technique that extracts features from the input of pre-trained models and selective search algorithms.
  3. End-to-end training : Fast R-CNN can be trained end-to-end, which means the entire network can be trained at once without the need for staged training.
  4. Accuracy : Fast R-CNN exhibits high accuracy in object detection tasks, thanks to its consideration of multiple scales, sizes, and aspect ratios of anchor boxes.

limitation

  1. Region proposals : Fast R-CNN still uses a selective search algorithm to generate region proposals, which can lead to speed bottlenecks.
  2. Real-time applications : Although the speed of Fast R-CNN has improved compared to the original R-CNN, in real-time applications, it may still not be able to meet strict real-time requirements. For these application scenarios, faster detection methods such as YOLO or SSD can be considered.
  3. Computing resources : Although Fast R-CNN has improved in speed, it still requires more computing resources, especially when processing high-resolution images.
  4. Small object detection : Fast R-CNN may not perform well when detecting small objects because its feature extraction process may cause information loss of small objects. To address this problem, you can try to use other methods, such as Feature Pyramid Networks (FPN) to improve the model.

Advantages and limitations of Faster R-CNN

Advantage

  1. Speed : Compared with R-CNN and Fast R-CNN, Faster R-CNN has significantly improved speed, which is mainly due to the introduction of the Region Proposal Network (RPN).
  2. Accuracy : Faster R-CNN shows high accuracy in object detection tasks, thanks to its consideration of multiple scales, sizes, and aspect ratios of anchor boxes.
  3. End-to-end training : Faster R-CNN can be trained end-to-end, which means the entire network can be trained at once without the need for staged training.

limitation

  1. Computing resources : Although Faster R-CNN has improved in speed, it still requires more computing resources, especially when processing high-resolution images.
  2. Real-time applications : Although the speed of Faster R-CNN has improved compared to its predecessors, in real-time applications, it may still not be able to meet strict real-time requirements. For these application scenarios, faster detection methods such as YOLO or SSD can be considered.
  3. Small object detection : Faster R-CNN may not perform well when detecting small objects because its feature extraction process may cause information loss of small objects. To address this problem, you can try to use other methods, such as Feature Pyramid Networks (FPN) to improve the model.

Single-shot multi-frame detector (SSD)

Introduction

Single-shot multi-frame detector (SSD) is one of the efficient methods to achieve real-time calculation of object detection tasks. Compared with the Faster R-CNN method, it can process real-time tasks at a faster speed and can process up to approximately 7 frames per second, meeting the needs of real-time applications.

SSD solves the time-consuming problem of the Faster R-CNN method by increasing the number of frames per second by nearly five times. It abandons the region proposal network and instead uses multi-scale features and default boxes for object detection.

Architecture overview

SSD Architecture

The architecture of SSD is mainly divided into three parts. The first is the feature extraction stage, which selects key feature maps. This part of the architecture only contains fully convolutional layers and no other layers. After extracting all necessary feature maps, the next step is the processing of the detection head, which also includes a fully convolutional neural network.

However, in the second stage of the detection head, the task is not to find semantic meaning for the image, but to create the most suitable boundary map for all feature maps. After these two critical stages of computation, the final stage is passed through a non-maximum suppression layer to reduce the error rate caused by repeated bounding boxes.

SSD limitations

Although SSD significantly improves performance, it reduces the resolution of images, resulting in lower quality images. For small-scale objects, the performance of SSD architecture is generally worse than Faster R-CNN.

When to use SSDs

Typically, single-shot detectors are the preferred method. The main reason for choosing a single-shot detector is to focus more on faster predictions on images to detect larger objects, while accuracy is not an extremely important concern. However, for objects that are smaller and require more precise predictions, other methods need to be considered.

5. YOLO(You Only Look Once)

An overview of YOLO target detection for beginners

6. RetinaNet

Introduction

RetinaNet is a target detection model launched in 2017. It was considered one of the best single-shot target detection models at the time, surpassing other popular target detection algorithms. Compared to Yolo v2 and SSD models, RetinaNet competes with the R-CNN family in terms of accuracy while maintaining the same speed. Due to its efficient and accurate characteristics, RetinaNet is widely used in fields such as satellite image target detection.

Architecture overview

RetinaNet Architecture

RetinaNet's architecture produces more effective and efficient results by balancing the problems of previous single-shot detectors to a certain extent. In this model architecture, focal loss is used instead of the traditional cross-entropy loss, which solves the class imbalance problem in architectures such as YOLO and SSD. The RetinaNet model consists of three main components.

The construction of RetinaNet is based on three factors: ResNet model (specifically ResNet-101), Feature Pyramid Network (FPN) and focal loss. Feature Pyramid Network is one of the best methods to overcome most of the shortcomings in previous architectures. It combines the semantically rich features of low-resolution images with the semantically weak features of high-resolution images.

In the final output, we can create classification and regression models similar to other object detection methods discussed previously. A classification network is used to make multi-class predictions, while a regression network is used to predict appropriate bounding boxes. If you want to learn more about RetinaNet, please refer to the article or video guide in the link below.

When to use RetinaNet?

RetinaNet is currently one of the best object detection methods in many different tasks. It can be used as an alternative to single-shot detectors for a variety of tasks to achieve fast and accurate image results.

Target detection library

1. ImageAI

Introduction

The ImageAI library is designed to provide developers with a variety of computer vision algorithms and deep learning methods to complete tasks related to object detection and image processing. The main goal of the ImageAI library is to provide a concise and efficient way to write object detection projects.

To learn more about this topic, be sure to visit the official documentation for the ImageAI library linked below. Most of the available code blocks are written with the help of the Python programming language as well as the popular deep learning framework Tensorflow. As of June 2021, this library uses the PyTorch backend for calculations for image processing tasks.

Overview

The ImageAI library supports a large number of object detection-related operations, including image recognition, image object detection, video object detection, video detection analysis, custom image recognition training and inference, and custom object detection training and inference. Image recognition capabilities can identify up to 1,000 different objects in a given image.

Image and video object detection tasks will help detect 80 of the most common objects in daily life. Video detection analysis will help in computing timely analysis of specific objects detected in videos or in real-time. In this library, it is also possible to introduce custom images to train your own samples. With updated images and datasets, you can train more objects for object detection tasks.

GitHub

https://github.com/OlafenwaMoses/ImageAI

2. GluonCV

Introduction

GluonCV is one of the best library frameworks with state-of-the-art implementation of deep learning algorithms for various computer vision applications. The main goal of this library is to help enthusiasts in this field achieve productive results in a shorter time. It has some of the best features including large training datasets, implementation techniques, and well-designed APIs.

Overview

The GluonCV library framework supports a large number of tasks you can accomplish with it. These projects include image classification tasks, image, video or real-time object detection tasks, semantic segmentation and instance segmentation, pose estimation to determine the pose of a specific body, and action recognition to detect the type of human activity being performed. These features make this library one of the best object detection libraries for faster results.

The framework provides all the state-of-the-art technologies required to perform the aforementioned tasks. It supports MXNet and PyTorch, and comes with extensive tutorials and additional support from which you can start exploring numerous concepts. It contains a large collection of training models from which you can explore and create specific machine learning models to perform specific tasks.

Once you have MXNet or PyTorch installed in your virtual environment, you can follow this link to start a simple installation of this object detection library. You can choose specific settings for the library. It also gives you access to Model Zoo, the best platform for easily deploying machine learning models. All these features make GluonCV an excellent object detection library.

GitHub

https://github.com/dmlc/gluon-cv

3. Detectron2

Introduction

Developed by Facebook's AI Research (FAIR) team, the Detectron2 framework is considered a next-generation library that supports most of the most advanced detection technologies, object detection methods, and segmentation algorithms. The Detectron2 library is an object detection framework based on PyTorch. The library is highly flexible and scalable, providing users with a variety of high-quality implementation algorithms and techniques. It also supports numerous applications and production projects on Facebook.

Overview

The Detectron2 library developed by FaceBook based on PyTorch has great application value and can be trained on single or multiple GPUs to produce fast and effective results. With the help of this library, you can implement multiple high-quality object detection algorithms for optimal results. These state-of-the-art techniques and object detection algorithms supported by the library include:

DensePose, Panoramic Feature Pyramid Network, and many other variants of the Mask R-CNN model family.

The Detectron2 library also allows users to easily train custom models and datasets. The following installation process is fairly simple. You only need two dependencies: PyTorch and COCO API. Once the following requirements are met, you can install the Detectron2 model and train large numbers of models easily. To learn more and learn how to use the following libraries, you can use the following guides.

GitHub

https://github.com/facebookresearch/detectron2

4. YOLOv3_TensorFlow

Introduction

The YOLO v3 model is one of the successful implementations of the YOLO series released in 2018. The third version of YOLO improves upon previous models. Compared with previous versions, the YOLOv3 model has achieved significant improvements in detection speed and accuracy. The YOLOv3_TensorFlow library is a YOLOv3 implementation based on TensorFlow and aims to provide developers with an easy-to-use object detection tool.

Overview

The YOLOv3_TensorFlow library supports real-time object detection tasks, suitable for images and videos. It provides pre-trained weight files that can be directly used for object detection. Additionally, you can use custom datasets to fine-tune your model to fit specific application scenarios.

The main features of the YOLOv3_TensorFlow library include:

  • High-speed real-time object detection
  • Supports multiple object categories
  • Can run on CPU and GPU
  • Support custom data set training

To use the YOLOv3_TensorFlow library, you need to install TensorFlow and other related dependencies. After meeting these requirements, you can clone the GitHub repository and start using YOLOv3 for object detection tasks.

GitHub

https://github.com/wizyoung/YOLOv3_TensorFlow

5. EfficientIt

Introduction

EfficientDet is an efficient object detection model developed by the Google Brain team. It is based on the EfficientNet model and combines Feature Pyramid Network (FPN) and Weighted Bidirectional Feature Pyramid Network (BiFPN). EfficientDet performs well in terms of speed and accuracy and is an object detection library worth paying attention to.

Overview

The EfficientDet library provides a variety of pre-trained models suitable for different computing capabilities and application scenarios. It supports real-time object detection tasks and can run on CPU, GPU and TPU. EfficientDet also allows users to train with custom datasets to suit specific needs.

Key features of the EfficientDet library include:

  • Efficient object detection performance
  • Supports multiple object categories
  • Can run on CPU, GPU and TPU
  • Support custom data set training

To use the EfficientDet library, you need to install TensorFlow and other related dependencies. After meeting these requirements, you can clone the GitHub repository and start using EfficientDet for object detection tasks.

GitHub

https://github.com/google/automl/tree/master/efficientdet

in conclusion

Object detection remains one of the most important deep learning and computer vision applications to date. We have seen many improvements and advances in object detection methods.

It started with algorithms such as Histogram of Gradient Orientation introduced in 1986 for performing simple object detection on images with considerable accuracy. Now we have more modern architectures such as Faster R-CNN, Mask R-CNN, YOLO and RetinaNet.

Object detection is not limited to images and can be effectively performed on video and live footage with high accuracy. In the future, we will also see more successful object detection algorithms and libraries.

Guess you like

Origin blog.csdn.net/shangyanaf/article/details/132988174