HOG feature process explanation (transfer)

1. HOG features:

       Histogram of Oriented Gradient (HOG) feature is a feature descriptor used for object detection in computer vision and image processing. It constructs features by computing and counting the gradient direction histograms of local regions of the image. Hog feature combined with SVM classifier has been widely used in image recognition, especially in pedestrian detection with great success. It should be reminded that the method of pedestrian detection by HOG+SVM was proposed by French researcher Dalal at CVPR in 2005. Although many pedestrian detection algorithms have been proposed continuously, they are basically based on the idea of ​​HOG+SVM.

(1) Main idea:

       In an image, the appearance and shape of local objects can be well described by gradient or edge directional density distribution. (Essential: statistics of gradients, and gradients mainly exist at the edges).

(2) The specific implementation method is:

       The image is first divided into small connected regions, which we call cell units. Then the gradient or edge direction histogram of each pixel in the cell unit is collected. Finally, these histograms are combined to form a feature descriptor.

(3) Improve performance:

       Contrast-normalize these local histograms in a larger range of the image (we call it an interval or block). The method used is: first calculate each histogram in this interval (block) The density in the interval is then normalized for each cell unit in the interval according to this density. After this normalization, you can get better effects on lighting changes and shadows.

(4) Advantages:

       Compared with other feature description methods, HOG has many advantages. First of all, since HOG operates on the local grid cells of the image, it can maintain good invariance to the geometric and optical deformation of the image, which only appear in a larger spatial field. Secondly, under the conditions of coarse airspace sampling, fine direction sampling, and strong local optical normalization, as long as pedestrians can generally maintain an upright posture, pedestrians can be allowed to have some subtle body movements. It is ignored without affecting the detection effect. Therefore, the HOG feature is particularly suitable for human detection in images.

 

2. The implementation process of the HOG feature extraction algorithm:

Approximate process:

The HOG feature extraction method is to take an image (the target or scan window you want to detect):

1) Grayscale (see the image as a three-dimensional image of x, y, z (grayscale));

2) Use the Gamma correction method to standardize the color space of the input image (normalization); the purpose is to adjust the contrast of the image, reduce the influence of local shadows and illumination changes in the image, and suppress the interference of noise;

3) Calculate the gradient (including size and direction) of each pixel of the image; mainly to capture contour information, while further weakening the interference of lighting.

4) Divide the image into small cells (eg 6*6 pixels/cell);

5) Count the gradient histogram of each cell (the number of different gradients) to form the descriptor of each cell;

6) Each of several cells is formed into a block (for example, 3*3 cells/block), and the feature descriptors of all cells in a block are connected in series to obtain the HOG feature descriptor of the block.

7) Concatenate the HOG feature descriptors of all blocks in the image image to get the HOG feature descriptor of the image (the target you want to detect). This is the final feature vector that can be used for classification.

 

 

 

The detailed process of each step is as follows:

(1) Standardize gamma space and color space

     In order to reduce the influence of lighting factors, the entire image needs to be normalized (normalized) first. In the texture intensity of the image, the local surface exposure contributes a larger proportion, so this compression process can effectively reduce the local shadow and illumination changes of the image. Because the color information has little effect, it is usually converted to a grayscale image first;

     Gamma compression formula:

     For example, you can take Gamma=1/2;

 

(2) Calculate the image gradient

        Calculate the gradient of the abscissa and ordinate directions of the image, and calculate the gradient direction value of each pixel position accordingly; the derivation operation can not only capture contours, silhouettes and some texture information, but also further weaken the influence of lighting.

The gradient of the pixel (x, y) in the image is:

 

       The most common method is: first use the [-1,0,1] gradient operator to convolve the original image to obtain the gradient component gradscalx in the x direction (horizontal direction, with the right direction as the positive direction), and then use [1 ,0,-1] The T gradient operator performs a convolution operation on the original image to obtain the gradient component gradscaly in the y direction (vertical direction, upward is the positive direction). Then use the above formula to calculate the gradient size and direction of the pixel.

 

(3) Build a gradient direction histogram for each cell unit

        The purpose of the third step is to provide an encoding for local image regions while being able to maintain weak sensitivity to the pose and appearance of human objects in the image.

We divide the image into several "cells", for example, each cell is 6*6 pixels. Suppose we use the histogram of 9 bins to count the gradient information of these 6*6 pixels. That is, the gradient direction of the cell is divided into 9 direction blocks 360 degrees, as shown in the figure: For example: if the gradient direction of this pixel is 20-40 degrees, the count of the second bin of the histogram is increased by one. In this way, for the cell Each pixel in the cell uses the gradient direction to perform weighted projection on the histogram (mapped to a fixed angle range), and the gradient direction histogram of the cell can be obtained, which is the 9-dimensional feature vector corresponding to the cell (because there are 9 bins ).

        The pixel gradient direction is used, so what about the gradient size? The magnitude of the gradient is used as the weight of the projection. For example: the gradient direction of this pixel is 20-40 degrees, and then its gradient size is 2 (assuming ah), then the count of the second bin of the histogram is not increased by one, but increased by two (assuming ah).

         Cells can be rectangular or radial.

 

(4) Combine the cell units into large blocks, and normalize the gradient histogram within the block

       The range of gradient intensities is very large due to changes in local illumination and changes in foreground-background contrast. This requires normalizing the gradient strength. Normalization can further compress lighting, shadows and edges.

        The author's approach is to combine individual cell units into large, spatially connected blocks. In this way, the feature vectors of all cells in a block are concatenated to obtain the HOG feature of the block. These intervals overlap each other, which means that the features of each cell will appear multiple times in the final feature vector with different results. We call the normalized block descriptor (vector) the HOG descriptor.

        There are two main geometries of intervals - rectangular intervals (R-HOG) and circular intervals (C-HOG). R-HOG intervals are generally square grids, which can be characterized by three parameters: the number of cell units in each interval, the number of pixels in each cell unit, and the number of histogram channels per cell.

       For example: the optimal parameter settings for pedestrian detection are: 3×3 cells/interval, 6×6 pixels/cell, 9 histogram channels. Then the number of features of a block is: 3*3*9;

 

(5) Collect HOG features

      The final step is to collect HOG features from all overlapping blocks in the detection window and combine them into a final feature vector for classification.

    

(6) What is the HOG feature dimension of an image?

        By the way, a summary: the process of Hog feature extraction proposed by Dalal: the sample image is divided into several pixel units (cells), the gradient direction is divided into 9 intervals (bins) on average, and all pixels in each unit are divided into 9 intervals (bins). The gradient direction is histogram statistics in each direction interval, and a 9-dimensional feature vector is obtained. Each adjacent 4 units constitutes a block, and the feature vectors in a block are connected to obtain a 36-dimensional feature vector. Use The block scans the sample image with a scan step of one unit. Finally, the features of all blocks are concatenated to obtain the features of the human body. For example, for a 64*128 image, every 8*8 pixels form a cell, and every 2*2 cells form a block, because each cell has 9 features, so there are 4*9= 36 features, with 8 pixel steps, then there will be 7 scan windows in the horizontal direction and 15 scan windows in the vertical direction. That is to say, a 64*128 image has a total of 36*7*15=3780 features.

HOG dimension, 16x16 pixel block, 8x8 pixel cell

 

Notes:

Pedestrian detection HOG+SVM

General idea:
1. Extract hog features of positive and negative samples
2. Put into svm classifier training to get model
3. Generate detectors from model
4. Use detectors to detect negative samples to get hardexample
5. Extract hog features of hardexample and combine the first The features in the steps are put into training together to get the final detector.

In -depth study of the principle of hog algorithm:
1. Overview of hog

Histograms of Oriented Gradients, as the name suggests, histogram of directional gradients, is a way of describing the target, which is also a descriptor.
Second, hog proposed that
hog was proposed by a doctor of nb in 2005. The link to the paper is http://wenku.baidu.com/view/676f2351f01dc281e53af0b2.html
3. The algorithm understanding
        has finally arrived in October, and I can finally breathe a sigh of relief and sort out the algorithm process of hog.
First of all, we must have an overall understanding. Each target corresponds to a one-dimensional feature vector. This vector has a total of n dimensions. This n is not guessed out of thin air. It is well-founded. For example, why does opencv come with hog detection? The sub is 3781 dimensional? This problem is really a headache at the beginning, and it has been tangled for a long time, but don't worry,
Let's first take a look at the constructor HOGDescriptor of the HOGDescriptor structure in opencv (Size winSize, Size blocksize, Size blockStride, Size cellSize, ... (the latter parameters are not used here)), to check the default parameters of opencv We can see that winSize(64,128), blockSize(16,16), blockStride(8,8), cellSize(8,8), obviously hog divides a feature window win into many blocks, in each The block is divided into many cell units (ie cells). The hog feature vector is to string together the small features corresponding to all these cells to obtain a high-dimensional feature vector, then the one-dimensional feature vector dimension corresponding to this window. n is equal to the number of blocks in the window x the number of cells in the block x the number of feature vectors corresponding to each cell.
Written here, let's calculate how 3781 is obtained, the window size is 64x128, the block size is 16x16, and the block step size is 8x8, then the number of blocks in the window is ((64-16)/8+1)*((128-16) /8+1) = 7*15 = 105 blocks, the block size is 16x16, and the cell size is 8x8, then the number of cells in a block is (16/8)*(16/8) =4 cells element, here we can see the final required dimension n, we only need to calculate the vector corresponding to each cell, where is this parameter? Don't worry, we project each cell to 9 bins (how to project? It's stuck here for a long time, I'll talk about it later), then the vector corresponding to each cell is 9 dimensions, and each bin corresponds to 9 dimensions A number of the vector, now see if the three requirements for calculating the window dimension are known, n = the number of blocks in the window x the number of cells in the block x the number of eigenvectors corresponding to each cell, bring in Take a look at n= 105x4x9 = 3780, which is the characteristic of this window. Some people will say, why does getDefaultPeopleDetector() in opencv get 3781 dimensions? This is because the other one dimension is a one-dimensional offset, (it's very crashing, right, I also crashed for a long time..., the next paragraph explains).
We use hog+svm to detect pedestrians. The final detection method is the most basic linear discriminant function, wx + b = 0. The 3780-dimensional vector just asked for is actually w, and adding one-dimensional b forms the default of opencv. The 3781-dimensional detection operator, and the detection is divided into two parts: train and test. During the training period, we need to extract the hog features of some series of training samples. The final purpose of using svm training is to obtain the w and b we detected, and extract the pending data during the test period. Is it possible to detect the hog feature x of the target and bring it into the equation?
**************************************************************************************************
                                           gorgeous dividing line
Writing here, at least I have a general understanding of the operation process of hog. I can see a lot of hog calculation methods on the Internet, normalization of gods, calculation of gradients, and projection of each cell. For those who have been in contact, I seem to understand after reading it, but I just don’t know how to use it, how hog and svm work together, and those things are completely useless for our early semester, the advantage is that we will use hog, and then look back The principle is the harvest. There are a lot of information on the Internet, so there is no superfluous here.
It is also worth mentioning that when calculating the cell features, it is necessary to project to each bin. There are many articles in this projection, which are mentioned in the senior thesis. It is called 'three-dimensional linear interpolation'. If you want to know more about hog can be carefully considered.
**************************************************************************************************
                                         Continue the gorgeous segmentation
Let's talk about the use of libsvm and CvSVM. I think libsvm is better to use, but cvsvm is also rewritten based on libsvm2.6 (if I remember correctly). The difference between the two is that libsvm trains a model, while cvsvm is xml file, when calculating the w vector in the final wx+b=0, you can directly process the model file for libsvm, but for cvsvm, you can skip generating the xml file and directly use the attributes in the object of cvsvm (here It's a bit vague, you can choose one of the two, the relationship is not very big)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325216079&siteId=291194637