MCNN paper reading notes

abstract

The aim is to propose a method that can accurately estimatethe number of crowds from a single image with arbitrary density of crowds and angles. A simple multi-column convolutional neural network (MCNN) is proposed to map the image to its crowd density map. MCNN allows input images to be of any size and resolution.

By using filters of different sizes to have different receptive fields,the features learned by the CNN in each column can adapt to changes in head size in different fields of view. .

The accurate calculation of the true density map relies ongeometry-adaptive kernels, which do not require knowledge of the perspective map of the input image

Perspective map is a way to perceive changes in the size of the human head.

The definition of perspective map is how many pixels 1m is actually in the image, which is the pixel distance/actual distance.

Ground truth on how to generate perspective map

  1. visible to pedestrians

    img

    as the picture shows,


    y_h=\frac{f(C-H)}{z_1},y_f=\frac{fC}{z_1} \\h=y_f-y_h=\frac{fH}{z_1}\\ h=\frac{H}{C-H}y_h\\p=\frac{h}{H}=\frac{1}{C-H}y_h
     

     

    Assuming that the average height of actual people is 1.75m, we can sample the pixel height h of several visible people in the picture, and then Yh is also the pixel distance, which can be read from the picture, so that C can be inferred, and then we have With C and H=1.75, the perspctive value pg at different Yh can be calculated.

  2. For those who cannot see the pedestrian’s body,

    In a dense dataset like ShanghaiTechA, it is difficult to sample the body in the picture, and calculate C as above to obtain the entire perspective map. However, we can obtain the average head size based on the nearest neighbor sampling method of KNN, and then sample different After the head size at Yh, this article uses the following interpolation function to fit the perspective values ​​at other positions:

    img

    a, b, c are the parameters to be fitted, but it is not said what the actual value H of the head size should be.

related work

The early method was to establish a detection framework for estimation by improving the wool kernel motion characteristics of pedestrians based on data such as two consecutive frames of pictures or video sequences. Limitations: For those very dense sealed clusters, the detector's effect is often greatly reduced, thus affecting the final accuracy.

In video crowd counting, cluster trajectories of tracking features have been proposed. For example, [24] have used a highly parallel version of the KLT tracker and cluster clustering methods to estimate crowds. [3] proposed to track the characteristics of a single image, and then use a probabilistic method to display independent individuals in motion through cluster grouping. However, tracking-based methods still fail to detect dense crowds in still images.

conrtibution

Purpose: Predict the number of crowds in still pictures taken by cameras from any angle of view and dense crowds, and achieve a certain degree of accuracy.

Method: Find a method that can automatically learn effective features, convolutional neural network.

Multi-column convolutional neural network: Each column of convolutional network is able to learn features of different scales. Input a picture to MCNN, it will output a crowd density map, and the integral of this map is the predicted number of crowds.

Contributions are as follows:

  1. Using multi-column convolutional networks: Hash convolutional networks correspond to the characteristics of three different receptive fields (large, medium, small). Even if people or heads are of different sizes due to picture resolution or shooting angle, each column of convolutional networks The accumulation network can still adapt to

  2. replaces the fully connected layer with 1x1 convolution, so that the input image can be of any size without resize causing distortion. We can get the prediction results through the output density map.

  3. used a new data set for evaluation. Due to differences in viewing angles (UCSD, WorldExpo'10), crowding degree (UCSD), and scale of the data set (UCSD, UCF_CC_50), the existing The data set cannot fully test the model's ability under multi-scenario models. Therefore, we launched a new large-scale dataset—ShangHaiTech, which has 1,200 images and 330,000 accurately annotated heads. As far as we know, ShangHaiTech is currently the largest data set in terms of head count. In this data set, you cannot find two images from the same perspective. The data set is divided into two parts, Part_A and Part_B. Part_A is a collection of pictures randomly crawled from the Internet, most of which have a very dense number of heads. Part_B is obtained from streets in large urban areas in Shanghai, and each picture has been annotated. We will make this data set open to the public.

MCNN model

Since the density map can represent more diverse information, and the density map can obtain the number of heads through integration, it was decided to input a picture and output the density map.

Geometrically adaptive Gaussian kernel generation density map

Use the training data to generate a density map H(x)=\sum^n_{i=0}{\delta(x-x_i)}In order to make H(x)​continuous, we perform a Gaussian kernel convolution on it

F(x)=H(x)*G_\sigma(x)

If you directly use the labeled pictures for training, each pixel value is either 0 or 1. It is like you eat all the cake by yourself and the people around you can't eat it at all. Isn't it too extreme? Moreover, model learning will be more difficult. Therefore, we use a method (actually this method is the Gaussian convolution kernel function), which can slightly "divide" the pixel value of 1 to all other pixel values. Every pixel has it, but there are many Quite a few questions. Notice! This "1" will be assigned to all pixels. If your image size is [1024,1024] resolution, then "1" will be assigned to 1024x1024 pixels!

However, if divided like this, the size of the head will also have an impact on the results. The ones closer to the camera will be divided more. Therefore, the devil should choose the expansion parameter \sigma based on the difference in the size of each head in the picture. We found that the size of the head is often adjacent to the core periphery. The two heads are related, and we propose a method of data adaptation, that is, the expansion parameters are determined by the average distance between a certain head and its surrounding heads.

For each given person x_i in the picture, we assume that his k nearest neighbor distance is [d_{i1}, d_{i12}, d_{i3}..., d_{im}], then, the average distance of x_i That is\overline{d}^i=\frac{1}{m}\sum^m_{i=1}d_i. Therefore, the pixel associated with x_i will correspond to an area, which is roughly a circle with \overline{d}^i as the radius. The variable parameter \sigma of the Gaussian convolution kernel is replaced by \overline{d}^i.

F(x)=\sum^N_1\sigma(x-x_i)*G_{\sigma i}(x),with \,\sigma i=\beta\overline{d}^i

The above method generates a density map, called a geometrically adaptive Gaussian kernel.

\beta takes 0.3 for the best effect. \beta is used to characterize the difference in data divided by Gaussian distributions at different distances.

Multi-column convolutional neural network for training density map generation

Convolution kernels with different receptive fields are used to learn respective local features from the density map.

Each column of MCNN is configured with kernels of different sizes to adapt to features at different scales.

img

In terms of activation function, the effect of using 2x2 maximum pooling and ReLU activation function will be better.

In order to reduce the number of parameters, larger convolution kernels extract fewer feature maps. Then we piece together the feature maps extracted from the three columns in a certain dimension, and finally fuse these feature maps through 1x1 convolution and map them into a predicted density map.

The difference between predicted and true density maps was evaluated by Euclidean distance. loss function
L(\Theta)=\frac{1}{2N}\sum^N_{i=1}||F(X_i;\Theta)-F_i||^2_2

\\Where\Theta is a learnable parameter, N is the number of pictures, X_i is the input picture, F(X_i;\Theta) is the predicted density map, and F_i is the true density map of the picture

MCNN Optimization

Stochastic Gradient Descent and Backpropagation

First, the three-column feature extraction network is trained separately to generate a density map, and then the three columns are used to generate features together. After fusion, the density map is predicted again, so that good results can be achieved by synchronously fine-tuning the parameters.

Generalizability

The advantage of MCNN is that it can sense heads of different sizes to generate density maps. Therefore,If it can first be trained on a data set where the size of the head itself is very different, then it will definitely be able to easily adapt to the size differences of other heads. Such a large data set.

If there are very few training samples in a new field, we can fix the first few convolutional layers of each column of MCNN and just fine-tune the last convolutional layer.

references:

[1] Crowd counting paper translation (1) MCNN - Zhihu (zhihu.com)

[2] 原文链接[PDF] Single-Image Crowd Counting via Multi-Column Convolutional Neural Network | Semantic Scholar

Guess you like

Origin blog.csdn.net/m0_61427031/article/details/132324664