Chinese Academy of Sciences proposes DSNet, a crowd density detection algorithm, with an accuracy increase of 30%

Introduction

In recent years, with the rapid growth of the population, group counting has been widely used in video surveillance, traffic control, and sports events. Early research work estimated the number of people by detecting the body or head, while some other methods learned the mapping relationship from local or global features to the actual number to estimate the number. Recently, the group counting problem has been formulated as the regression of a crowd density map, and then the value of the density map is summed to get the number of people in the image. With the success of deep learning technology, researchers use convolutional neural networks (CNN) to generate accurate population density maps, and can achieve better performance than traditional methods.

However, due to the large scale variation, severe occlusion, background noise and perspective distortion, population counting is still a very challenging task. Among them, the scale change is the most important issue. In order to better handle scale changes, researchers have proposed many multi-column or multi-branch networks. These architectures generally consist of several columns of CNN or several branches at different stages of the backbone network. These columns or branches have different receptive fields to perceive changes in the size of the crowd. Although these methods have been well improved, the scale diversity they capture is limited by the number of columns or branches.

The main challenge of scale change lies in two aspects. First, as shown on the left in Figure 1, people in crowd images are usually of different sizes, ranging from a few pixels to dozens of pixels. This requires the network to be able to handle data with very large scale changes. Second, as shown on the right in Figure 1, the scale of the entire image usually changes continuously, especially for high-density images. This requires the network to be able to intensively sample the scale range. However, the existing methods cannot cope with these two challenges at the same time.

image

Figure 1 There are large scale changes in the population count data set. Left: Input image and corresponding real density map in Shanghai Tech. Right: Input image and corresponding true density map in UCF-QNRF data set.

This paper proposes a new dense-scale single-column neural network-DSNet, which is used for population counting. DSNET is composed of densely connected dilated convolutional blocks, so it can output features with different receptive fields and capture crowd information of different scales. The convolution block of DSNet is similar in structure to DenseASPP, but has a different combination of expansion rates. The author carefully chooses these ratios for the layers within the block, so that each block is more densely sampled on continuously changing scales. At the same time, the selected combination of expansion ratios can use all pixels of the receptive field for feature calculation to prevent gridding effects. In order to further improve the scale diversity captured by DSNet, the author stacks three densely dilated convolutional blocks and uses residual connections for dense connections. The final network can sample a very large range of scale changes in a more intensive manner, so as to deal with the problem of large scale changes in population counts.

Most previous methods used traditional Euclidean loss to train the network, which was based on the assumption of pixel independence. This loss ignores the global and local consistency of the density map, which will affect the result of group counting. In order to solve this problem, the author proposes a multi-scale density level consistency loss to ensure that the global and local density levels between the estimated population density map and the real population density map are consistent.

Paper contribution

  1. A densely dilated convolutional block (DDCB) is proposed, the dilation rate of which is carefully selected. DDCB can perform dense sampling on continuously changing scales. DSNet can be trained end-to-end, and can handle crowded and sparse crowd images.

  1. Introduced multi-scale density level consistency loss to improve model performance. This loss strengthens the global and local consistency between the estimated density map and the true density map.

  2. The author conducted extensive experiments on four challenging public group statistics data sets. Compared with the existing state-of-the-art methods, this method obtains the best performance. The counting accuracy on the UCF-QNRF and UCF_CC_50 datasets has increased by 30%, and the counting accuracy on the Shanghai Tech and UCSD datasets has increased by 20%.

2 The basic idea of ​​the DSNet
method is an end-to-end single-column CNN with denser scale diversity to deal with large scale changes and density level differences in dense and sparse scenes. The architecture of DSNET is shown in Figure 2.

图 2 DSNet 网络结构。DSNet 由 VGG-16 网络的前十层组成的主干网、三个具有密集残差连接(DRC)的密集扩张卷积块(DDCB)和三个用于人群密度图回归的卷积层组成。利用带 DRC 的 DDCB 来扩大特征的尺度多样性和感受野,以应对较大的尺度变化,从而准确估计密度图。

2.1 DSNet 结构

我们提出的 DSNET 包含主干网络作为特征提取器,三个密集的扩张卷积块,由密集残差连接堆叠,扩大了尺度多样性,以及三个卷积层,用于人群密度图回归。

 主干网络

本文所用的主干网络为 VGG-16 的前十层,以及三个池化层。经验表明,在多列网络中,使用内核较小但层数较多的卷积层比内核更大但层数更少的卷积层更有效。此外,它还实现了准确率与计算量之间的最佳权衡,适用于准确、快速的人群计数。

 密集扩张卷积块(Dense dilated convolution block,DDCB)

为了应对尺度变化的挑战,需要一种能够以尽可能密集的方式捕获大范围尺度变化的网络架构。本文提出了一种新的密集扩张卷积块,它包含三个扩张卷积层,其扩张率为 1,2,3。这种设置可以保留来自更密集尺度的信息,并且感受野尺寸差距较小。区块内的每个扩张层与其他层紧密相连,因此每个层都可以访问所有后续层,并传递需要保留的信息。密集连接后,获得的尺度多样性增加,如图 3 所示。

图 3 DDCB 尺度多样性与密集堆叠的扩张卷积中扩张率(1,2,3)的设置相对应。k 表示相应组合的感受野大小。

Another advantage of carefully choosing the expansion rate is that it can overcome the effect of gridding. As shown in Figure 4, the dilated convolutional layer with an expansion rate of 6 is located below the dilated convolutional layer with an expansion rate of 3. In the one-dimensional case, after these two layers, the final result of a pixel can only obtain information from 7 pixels. This phenomenon becomes worse when the input data is two-dimensional. Therefore, the final pixel can only view the original information in a grid mode, and most of the information (86.4%) is lost. Because the local information of the original feature map is completely lost, and due to the large expansion rate, the information may not be relevant in a large distance, which is not conducive to capturing detailed features in the population count. By adopting the new expansion rate combination, the top layer can cover all the pixel information of the original feature map, avoiding large distance irrelevant information caused by excessive expansion rate of the middle layer. This is critical to the accuracy of crowd counting.

Figure 4(a) The superimposed dilated convolutional layer with a large expansion rate in DenseASPP results in a "grid effect" and a lot of information is lost. Red indicates the source of the information. (B) The DDCB proposed in this paper has a subsequent convolutional layer with (1, 2, 3) expansion rate to cover all pixel information.

 Dense residual connection (DRC)

Although DDCB provides dense-scale diversity, the hierarchical features between different blocks are not fully utilized. Therefore, the author improves the architecture through dense residual connections to further improve the information flow. In addition, compared with traditional dense connections, they can also prevent the network from becoming wider. In this way, the output of the DDCB can directly access each layer of the subsequent DDCB, thereby realizing continuous information transfer. Compared with ordinary residual connection, it further expands the scale diversity, and adaptively retains the characteristics suitable for specific scenes in the information flow process.

2.2 Loss function

Most of the previous studies use Euclidean distance loss as the loss function of the population count. It only considers the pixel error and ignores the global and local correlation between the estimated density map and the true density map. In this article, the author combines multi-scale density level consistency loss with Euclidean loss to measure global and local consistency.

 Euclid loss

Euclidean distance is used to measure the pixel-level estimation error between the estimated density map and the true value. The loss function is defined as follows:

image

Where N is the number of images in a batch, G(Xi;θ) is the estimated density map of the training image Xi, and the parameter is θ. D is the actual density map of Xi.

 Multi-scale density levels consistent losses
in addition to the pixel level loss function, the authors also consider the global and local level density consistency between the estimated density map and a real value. The newly proposed training loss is defined as: image

Where s is the number of scale levels used for consistency checking, P is the average pooling operation, and kj is the specified output size of the average pooling.

The scale level divides the density map into different sub-regions and forms a pooled representation to illustrate the density levels of people in different locations. According to the context of the density level, the estimated density map needs to be consistent with the actual situation on different scales. In addition, the number of scale levels and the output size of a specific scale control the trade-off between training speed and estimation accuracy. The author uses three scale levels, and each output size is 1×1, 2×2, and 4×4. The first scale level with an output size of 1×1 captures the global characteristics of the density level, while the other two scale levels represent the local density level of the image patch.

 Final objective function

Through the weighted summation of the above two loss functions, the entire network is trained using the following objective function:

image

Where λ is the weight that balances the loss of pixel and density level consistency. In the experiment, the set values ​​of different data sets λ are shown in Table 1.  

image

 Table 1 Lambda values ​​of different data sets

3 Realization 3.1 Generate real value

For scene graphs with dense crowds in the data set, geometric adaptive kernel processing is used to generate density maps, while for images with relatively sparse crowds in the data set, a fixed Gaussian kernel is used to generate density maps.

3.2 Evaluation method During testing, the entire image is input to the network to generate an estimated density map. Use mean absolute error (MAE) and mean square error (MSE) to evaluate network performance. MAE reflects the accuracy of the model, while MSE reflects the robustness of the model. The lower the value, the better the performance. The two indicators are defined as follows:image

Where n is the number of images in the test set, Ci represents the predicted count, and Cgti represents the true count value.

4 Experiment 4.1 Data Set

The paper evaluated DSNet on four available population statistics data sets: ShanghaiTech, UCF-QNRF, UCF_CC _50 and UCSD.

ShanghaiTech: 包含标注图片 1198 张,共 330165 人,分为 A 和 B 两个部分,A 包含 482 张图片,均为网络下载的含高度拥挤人群的场景图片,人群数量从 33 到 3139 个不等,训练集包含 300 张图片和测试集包含 182 张图片。B 包含 716 张图片,这些图片的人流场景相对稀疏,拍摄于街道的固定摄像头,群体数量从 12 到 578 不等。训练集包含 400 张图像,测试集包含 316 张图像。

UCF-QNRF: 这是最新发布的最大人群数据集。它包含 1535 张来自 Flickr、网络搜索和 Hajj 片段的密集人群图像。数据集包含广泛的场景,拥有丰富的视角、照明变化和密度多样性,计数范围从 49 到 12865 不等,这使该数据库更加困难和现实。此外,图像分辨率也很大,因此导致头部尺寸出现大幅变化。

UCF_CC_50: 包括 50 张黑白低分辨率图像,人流场景非常密集。每张图片的标注人数从 94 人到 4543 人不等,平均人数为 1280 人,这使得深度学习的方法具有挑战性。

UCSD: 由 2000 帧监控摄像机拍摄的照片组成,尺寸为 238×158。这个数据集的密度相对较低,每幅图像 11 到 46 人不等,平均约 25 人。在所有帧中,帧 601 到 1400 为训练集,其余帧为测试集。

4.2 对比实验作者在四个具有挑战性的公开群体计数数据集上进行了对比实验。实验结果见表 2。可以看出,论文提出的方法在所有数据集和所有评估指标上都达到了最先进的性能。说明所提的方法不仅适用于拥挤的人群场景,也适用于稀疏的人群场景。image

表 2 与 Shanghai Tech、UCF-QNRF、UCF_CC_50 和 UCSD 数据集上的最新方法进行比较。与当前最先进的方法相比,DSNet 获得最佳性能,并且具有大幅提升。

该方法的几个密度图示例如图 5 所示。很明显,DSNet 取得了较好的表现。图 5 还验证了该方法可以捕获不同大小的头部尺寸,从而使 DSNet 更加鲁棒和准确。

image

图 5 由 DSNet 生成的估计密度图和人群数量的图示。第一行为从 ShanghaiTech A、ShanghaiTech B 和 UCF-QNRF 数据集中提取的四个样本。第二行显示 DSNet 估计的密度图。最后一行显示了相应的真实密度图。DSNet 能够生成接近真实情况的密度图和精确的人群计数。

4.3 消融实验

在本节中,作者在 ShanghaiTech B 数据集上进行了消融实验,分析了网络构成和损失函数。

 

网络结构: DSNet 包括主干网络、密集扩张卷积块、密集残差连接和多尺度密度级一致性损失。为了证明它们的有效性,作者通过增加这些组件来进行实验。实验结果见表 3。 image

表 3 在 Shanghai Tech B 数据集上,网络的不同组件的估计误差。括号中的数字是密集扩张卷积块的数目。

作者使用后端网络和最后三个卷积层作为基线模型,MAE 值为 15.21,这是表中所有项目中最低的,但仍然可以比过大多数现有方法。仅通过增加所提出的 DDCB,MAE 降低到 7.33,与以往的方法相比有较大幅度的提高,达到了最佳的性能。说明了密集扩张卷积块所产生的尺度密集、感受野大的特征,对于准确、可靠地计算群体数量是必不可少的。

此外,在三个密集扩张卷积块之间增加密集残差连接也改善了结果,MAE 进一步降低到 7.06,这表明密集残差连接通过重复利用不同 DDCB 的特征进一步扩大了尺度多样性。

最后,增加密度级一致性损失来训练整个网络。它进一步将平均绝对误差降低到 6.74,这是所提出方法的最佳性能,并在数据集上达到了最先进的水平。结果表明,该损失可以使估计密度图的密度水平与真实值的密度水平在全局和局部上相一致。

此外,作者比较了密集残差连接与普通残差连接的影响。实验结果见表 4。通过利用残差连接,估计的误差减小到 6.81,这是由于前一个块的特征得到重用,而忽略了其他具有不同尺度的块的特征。为了解决这一问题,采用密集的残差连接进一步将 MAE 降低到 6.74,这表明尺度多样性进一步扩大,特征更加有效。

image

表 4 不同残差连接的估计误差。

损失函数: 作者提出的新损失采用三个尺度(即平均池化操作的 1×1、2×2、4×4 输出大小)。作者对这三个尺度级别进行了实验,证明每一个尺度级别都能使估计的密度图与真实值之间的一致性得到规范。实验结果见表 5。

image

表 5 所提出的一致性损失在不同尺度级别的估计误差。数字为平均池化操作的输出大小。

在加入一致性损失函数之前,该网络的 MAE 值达到了 7.06。采用输出尺寸为 1×1 的单尺度级别,即整个输入图像密度水平的全局上下文,平均绝对误差减小到 6.95。此外,由于加入输出尺寸为 2×2 和 4×4 的局部密度水平的约束,性能继续得到改善,使 MAE 分别降低到 6.88 和 6.74。这些增量实验表明,密度水平的全局和局部正则化都有助于约束估计的密度图与不同尺度上的真实密度图相一致,从而生成高质量的密度图。

5 结论

This paper proposes a new end-to-end single-column model DSNet, which is based on densely expanded convolutional blocks with dense residual connections and can accurately estimate the number of groups. These two components expand the diversity of scales and the receptive field of characteristics, and can solve the problem of large scale changes, thus achieving a good performance in the problem of the number of statistical image groups. In addition, this paper introduces a new loss to enhance the density level of the estimated density map, so that it is consistent with the corresponding real value on different scales. This method has achieved state-of-the-art results on four challenging public population counting data sets, and has been significantly improved compared to previous methods.


Guess you like

Origin blog.51cto.com/15060462/2678067