02Deep Learning-Target Detection-Introduction to Traditional Methods

1. Changes and comparison of detection methods for target learning

       "Object detection" is a current research hotspot in the field of computer vision and machine learning. From the wisdom of the cold weapon era such as Viola-Jones Detector and DPM to the violent aesthetics of GPUs nurtured by deep learning soil such as RCNN and YOLO today, the development of the entire target detection can be described as a condensed history in the field of computer vision. The entire development process of target detection has been summarized in the figure below:

        It can be seen that before 2012, in the field of target detection, traditional manual feature detection algorithms were still the main ones. However, with the rise of convolutional neural network (CNN) in 2012, target detection began to be violent under deep learning. aesthetics. Under deep learning, the effect of target detection is much better than that of traditional manual features. To this day, detection algorithms based on deep learning are still the mainstream of target detection.

    Although deep learning algorithms are much better than traditional manual features in target detection, I still can't forget the help that traditional algorithms bring to us. This article records the beginning of my learning of target detection algorithms and gives an in-depth introduction to the role of traditional algorithms in target detection. Principles and effects.

Prerequisite knowledge:

  • The feature extraction process often uses color-based, texture-based, shape-based, semantic-based and other image feature representation methods in computer vision and pattern recognition.
  • Common feature extraction methods in computer vision include three categories: low-level features, mid-level features, and high-level features. The first two categories are commonly used.

         Low-level features (basic features of middle-level features such as color, texture, shape, etc., usually manually designed features)
         Middle-level features (features based on low-level feature learning and mining of high-level features)
         High-level features (features based on bottom or middle layers) Features after learning and mining, such as whether the person is wearing a hat)

  • Use a classifier to make classification decisions on the extracted features

        (Two categories) determine whether the current window is background or a target to be detected,
        (multiple categories) determine whether it is background, and if not, determine which category it is.

2. Definition of detection algorithm based on traditional manual features

       In the development process of target detection algorithms, traditional algorithms based on manual features were once the mainstream . These traditional algorithms identify target objects by designing and extracting manually designed features , including Haar features, HOG features, SIFT features , etc. This article will deeply explore the traditional algorithm based on manual features in the target detection algorithm and introduce its principles, advantages and disadvantages, and its application in computer vision.


       The traditional target detection algorithm based on manual features is an early type of target detection algorithm, which identifies target objects by manually designing and extracting features. These features are usually based on local information of the image, such as edges, texture, color, etc. On the basis of feature extraction, traditional algorithms usually use classifiers or detectors to determine whether there is a target object in the image and give the location and size of the target .

3. Traditional main manual features and algorithms

Haar features and face detection algorithm - Viola-Jones (understand)

  • Harr feature extraction
  • Train a face classifier (Adaboost algorithm, etc.)
  • Sliding windows (problem: good step size for interactive windows)

       Haar feature is a feature based on image matrix, which was first used in the field of face detection. The Viola-Jones algorithm is a fast face detection algorithm based on Haar features, which uses the Adaboost classifier for feature selection and cascade classification. The algorithm achieves significant performance and efficiency in face detection tasks.

       In 2004, Paul Viola and Michael Jones published an epoch-making article "Robust Real-Time Face Detection" on CVPR. Later generations called the face detection algorithm in the article the Viola-Jones (VJ) detector. The VJ detector achieved real-time face detection for the first time 17 years ago with extremely limited computing resources . The speed was dozens or even hundreds of times that of the detection algorithm in the same period, which greatly promoted the commercialization of face detection applications. . The idea of ​​VJ detector has profoundly influenced the development of the field of target detection for at least 10 years.

       The VJ detector uses the most traditional and conservative target detection method - sliding window detection , which traverses every scale and every pixel position in the image to determine whether the current window is a face target one by one. This idea seems simple, but in fact the computational cost is huge. There are three key elements why VJ face detection can achieve real-time detection under limited computing resources: fast calculation of multi-scale Haar features, effective feature selection algorithm and efficient multi-stage processing strategy .

       In terms of fast calculation of multi-scale Harr features, the VJ detector uses the integral map to accelerate feature extraction. The integral map can make the feature calculation amount independent of the size of the window, and also avoids the time-consuming process of building an image pyramid when dealing with multi-scale problems.

       In terms of feature selection algorithms, unlike manual features in the traditional sense, the Harr features used in the VJ detector are not pre-designed by humans. The VJ detector uses over-complete random Haar features and uses the Adaboost algorithm to perform feature selection from a huge feature pool (about 180k dimensions) to select the very few features that are most useful for face detection to reduce unnecessary computational overhead.

       In terms of multi-stage processing, the author proposed a cascade decision-making structure and vividly called it "cascades". The entire detector is composed of multi-level Adaboost decision makers, and each level of decision maker is composed of several weak classification decision stumps. The core idea of ​​waterfall is to allocate less computing resources to the background window and allocate more computing resources to the target window: if a certain level of decision maker determines that the current window is the background, it can continue to start the next step without subsequent decision-making. A window of judgment.

Haar features: mainly differential, there are four basic features:

value=white-black

  • The first type represents two adjacent pixels, and the difference is performed in 4 directions, 0 degrees, 180 degrees, 45 degrees, 135 degrees.
  • The second type of linear feature, a wide area represents two pixels.
  • The third type of central feature is that adjacent areas are differentiated from the central point. Like LDP features.
  • The fourth type of relationship between multiple pixels.

    Finally, the extracted histogram is selected, and the difference operation itself is solving the gradient. Therefore, Haar features are literary features.

Adaboost algorithm: It belongs to the ensemble learning method in machine learning.

  • Initialize the sample weight w, and the sum of the sample weights is 1
  • Train a weak classifier
  • Update sample weights (increase the weight of incorrectly classified samples)
  • Second step of loop
  • Combine the results of each classifier to vote

Sliding window method:

       First, slide the input image with different window sizes from left to right and top to bottom. The classifier is executed on the current window each time it slides (the classifier is trained in advance). If the current window gets a higher classification probability, the object is considered detected. After detecting each sliding window with different window sizes, the object marks detected by different windows will be obtained. These window sizes will have high repetitive parts, and finally non-maximum suppression (Non-Maximum Suppression, NMS) is used. method to filter.
 

HOG features and SVM algorithm (understanding) (pedestrian detection, opencv implementation)

  • Extract HOG features
  • Train SVM classifier
  • Use sliding windows to extract target areas and make classification judgments
  • NMS
  • Output detection results

       HOG (Histogram of Oriented Gradients) feature is a feature that describes the local gradient direction of an image and is widely used in pedestrian detection and object recognition. Combined with the SVM (Support Vector Machine) classifier, the HOG feature can achieve target detection tasks in complex scenes.


SIFT features and SIFT algorithm (understanding)

  • Grayscale + Gamma transformation (solve the root of the value)
  • Calculate the gradient map (calculate the gradient value of each point in the x and y directions, use the gradient value to the gradient angle, that is, get tan = x/y, and find the direction angle)
  • The image is divided into small cells, and the gradient histogram of each cell is calculated.
  • Multiple cells form a block and feature normalization (splicing)
  • Multiple blocks are connected in series and normalized
  • It is related to the quantization angle and cell size (the smaller the cell, the larger the gradient dimension), and usually the dimension will be very large.

       SIFT (Scale-Invariant Feature Transform) feature is a feature based on local extreme points and scale space, which is mainly used for image matching and target recognition. The SIFT algorithm realizes the positioning and identification of targets in images by extracting key points and feature descriptors.

DPM (Object Detection) (Learn)

DPM feature extraction

  • signed gradient
  • unsigned gradient

      Signed: The entire angle space is represented as an 18-dimensional vector, 0 ~ 360 degrees.
      Unsigned: 0 ~ 180 degrees. Each cell obtains a 27-dimensional histogram.
      In multi-dimensional situations in HOG, PCA is used to reduce the dimensionality of HOG.
      DPM uses a method that approximates PCA for approximate processing, which is to sum and represent the 27-dimensional histograms extracted by each cell. It accumulates and sums 4 values ​​​​horizontally and 27 values ​​​​vertically. After splicing, we get The final 31-dimensional feature vector. The speed of accumulation mode has been improved.


DPM (object detection)

  • Calculate DPM feature map
  • Calculate the corresponding graph (roof filter and part filter) (that is, the current area may be a value of the target area, understood as energy distribution)
  • Latent SVM classifier training
  • Detection and identification

4. Basic process of traditional target detection algorithm

Process one:

       Given a picture to be detected, this picture is used as the input of the detection algorithm, and then the sliding window method is used to extract candidate boxes from the picture , and then feature extraction is performed on the images in each candidate box (feature extraction is mainly based on Extraction method introduced in the previous pre-knowledge), and use the classifier to determine the feature classification, and obtain a series of candidate frames for the current detection target. These candidate frames may overlap. In this case, the non-maximum suppression algorithm NMS is used After merging or filtering the candidate frames, the final candidate frame is the final detection target, that is, the output result.

Process 2:
       Given a picture as input, the feature extraction + target frame regression method is used to extract the target area. Finally, NMS is also used to merge the candidate frames, and finally the target output result is obtained.

Notice:

  • Process 1: Applicable to traditional target detection methods and target detection methods based on deep learning
  • Process 2: Suitable for target detection methods based on deep learning

5. Problems with detection algorithms based on traditional manual features

  • 1. Setting features through traditional methods is difficult to design on the one hand, and on the other hand, the designed features often have various problems, such as not adapting to a specific condition, that is, they are not robust and have low efficiency.
  • 2. This strategy of extracting the target frame through a sliding window and classifying the target frame is very cumbersome and time-consuming when extracting the sliding window.

6. Advantages and Disadvantages of Traditional Algorithms Based on Manual Features

advantage:

  • a. Relatively simple: Traditional algorithms based on hand-crafted features are usually simple and easy to implement and do not require a large number of training samples.
  • b. Lower computational complexity: Since the feature extraction process is usually simple, traditional algorithms are computationally efficient.
  • c. Strong interpretability: Manual features are designed manually, and their good interpretability helps analyze the performance and results of the algorithm.

shortcoming:

  • a. Depends on feature design: The performance of traditional algorithms based on hand-crafted features largely depends on the quality and selection of feature design. Different tasks require different features, so it takes a lot of manpower and time to design and tune features.
  • b. Not suitable for complex scenes: Traditional algorithms usually have weak processing capabilities for complex scenes, especially when the target scale and shape change greatly or there is occlusion.
  • c. Unable to handle large-scale data: As the scale of data increases, the computational complexity and recognition performance of traditional algorithms based on manual features will be limited.

Guess you like

Origin blog.csdn.net/qq_41946216/article/details/132778457