Crowd Density Estimation--Paper Reading: DM-Count

  1. Paper address
  2. GitHub open source code address

The following translations are mainly machine translations combined with manual proofreading.

Summary

In crowd counting, each training image contains multiple people, each marked with a dot. Existing crowd counting methods require smoothing each annotated point using Gaussian or estimating the likelihood of each pixel given an annotated point. In this paper, we show that imposing a Gaussian approach to annotations hurts generalization performance. Instead, we propose to use Distribution Matching for crowd counting (DM-Count). In DM-Count, we use Optimal Transport to measure the similarity between the normalized predicted density map and the normalized GT density map. To stabilize OT computation, we include a Total Variation loss in the model. We demonstrate that DM-Count has a tighter generalization error bound than that of Gaussian smoothing methods. In terms of mean absolute error (MAE), DM-Count substantially outperforms previous state-of-the-art methods on two large counting datasets (UCF-QNRF and NWPU), and achieves state-of-the-art results on ShanghaiTech and UCF-CC50 datasets. DM-Count reduces the error of the latest published results by about 16%.

1. Introduction

Image-based crowd counting is an important research problem with various applications in many domains including journalism and surveillance. Current state-of-the-art methods [54, 8, 25, 55, 61, 59, 17, 48, 21, 23, 36] treat crowd counting as a density map estimation problem, where a deep neural network first produces a given input image The 2D crowd density map of , and then estimate the total size of the crowd by summing the density values ​​at all spatial locations of the density map. For images of large crowds, this density map estimation method has been shown to be more robust than "detect before counting" methods [22, 19, 62, 12] because the former is less sensitive to occlusions and does not require early binning. Diversified decision-making.

A key step in the development of density map estimation methods is the training of deep neural networks that map from input images to corresponding annotated density maps. In all existing crowd counting datasets [15, 60, 14, 51], the annotated density map of each training image is a sparse binary mask in which each person’s head or forehead is Marked with a dot. Due to the laborious work required to describe the spatial extent, especially when the occlusion blur is too large, the spatial extent of each individual is not provided. Given training images with point annotations, training a density map estimation network is equivalent to optimizing the network parameters to minimize a differentiable loss function that predicts the loss density, which measures the difference between the predicted density map and the point annotation map. It is worth noting that the former is a dense real-valued matrix, while the latter is a sparse binary matrix. Given the sparsity of points, it is difficult to train a function defined based on the pixel-level differences between annotated and predicted density maps, since the reconstruction loss is very unbalanced between 0 and 1 in a sparse binary matrix. One way to alleviate this problem is to turn each annotated point into a Gaussian blob, so that the GT is more balanced and therefore the network is easier to train. Almost all previous crowd density map estimation methods [56, 57, 60, 38, 20, 33, 35, 4, 28, 50, 27, 40, 29, 26] follow this convention. Unfortunately, the performance of the resulting network is highly dependent on the quality of this "pseudo ground truth", but given the large variance in the size and shape of people in the perspective of crowded scenes, setting Getting the width right is not an easy task.

Recently, Ma et al. [31] proposed a Bayesian loss to measure the difference between predicted and annotated density maps. This method converts a binary GT annotation map into N "smooth GT" density maps, where N is the number of people. Given the location of that pixel, each pixel value of the smoothed GT density map is the posterior probability of the corresponding annotation point. Empirically, this method outperforms other aforementioned methods [60, 38, 20, 33, 35, 4]. However, there are two main problems with this loss function. First, it also requires a Gaussian kernel to construct the likelihood function for each annotated point, which involves setting the kernel width. Second, this loss corresponds to an uncertain system of equations with infinitely many solutions. The loss may be 0 for many density maps that are not similar to the GT density map. As a result, training with this loss may result in a predicted density map that is very different from the GT density map.

In this paper, we address the shortcomings of existing methods with the following contributions.
• We show theoretically and empirically that imposing Gaussian methods on annotations hurts the generalization performance of crowd counting networks.
• We recommend using DM-Count, a method for d-distribution matching for population counting. Unlike previous work, DM-Count does not require any Gaussian-smoothed GT annotations. Instead, we use optimal transfer (OT) to measure the similarity between the normalized predicted density map and the normalized GT density map. To stabilize OT computation, we further add a total variation (TV) loss.
• We give generalization error bounds for counting loss, OT loss, TV loss and overall loss in our method. All bounds are tighter than those of the Gaussian smoothing method.
• Empirically, our method greatly improves the state-of-the-art on four challenging crowd counting datasets: UCF-QNRF, NWPU, ShanghaiTech and UCF-CC50. Notably, our method reduces the published state-of-the-art MAE on the NWPU dataset by about 16%.

2. Previous work

2.1 Crowd counting method

Crowd counting methods can be divided into three categories: detect-then-count methods, direct regression counts, and density map estimation methods. Early methods [22, 19, 62, 12] can detect people, heads or upper bodies in images. However, accurate detection is difficult in dense crowds. Furthermore, it requires bounding box annotations, which is a laborious and ambiguous process due to severe occlusions. Later methods [5, 6, 49, 7] avoid the detection problem and directly learn to regress counts from feature vectors. But their results are difficult to interpret, and point-annotated plots are underutilized. Recent studies [20, 35, 15, 4, 31, 50, 27, 40, 29, 26, 54, 8, 25, 55, 61, 59, 17, 48, 21, 23, 30, 47, 53 43, 37, 56, 18, 39, 24, 44, 42] are estimated based on density maps, which have been shown to be more reliable than detection-then-count and count-regression methods.
Density map estimation methods typically define a training loss based on the pixel-wise difference between a Gaussian-smoothed and predicted density map. Instead of using a single kernel width to smooth point annotations, [60, 14, 47] use an adaptive kernel width. The kernel width is chosen based on the distance to the nearest neighbor of the annotation point. Specifically, [15] generated multiple smoothed GT density maps at different density levels. The final loss incorporates reconstruction errors from multiple density levels. However, these methods assume that the population is evenly distributed. In fact, the crowd distribution is very irregular. The Bayesian loss method [31] uses a Gaussian to construct a likelihood function for each annotated point. However, since the loss is not determined, it may not be possible to predict the correct density. A detailed analysis can be found in Section 4.2.

2.2 Optimal transmission

We propose a novel loss function based on Optimal Transport (OT) [46]. To better understand the proposed method, we briefly review the Monge-Kantarovich OT formulation in this section.

Optimal transportation refers to the optimal cost of transforming one probability distribution into another probability distribution. Let X = { xi ∣ xi ∈ R d } i = 1 n \mathcal{X}=\left\{\mathbf{x}_{i} \mid \mathbf{x}_{i} \in \mathbb{ R}^{d}\right\}_{i=1}^{n}X={ xixiRd}i=1nY = { yj ∣ yj ∈ R d } j = 1 n \mathcal{Y}=\mathbf{y}_{j}\mid \mathbf{y}_{j}\in \mathbb{ R}^{d}\right\}_{j=1^{n}Y={ yjyjRd}j=1nfor two ddA set of points in a d- dimensional vector space. Letμ \muμν \nuν is two definitions inX \mathcal{X}X Y \mathcal{Y} A probability measure on Y , where μ , ν ∈ R + n \mu, \nu \in \mathbb{R}_{+}^{n}m ,nR+n1 n T µ = 1 n T ν = 1 ( 1 n \mathbf{1}_{n}^{T} \ballsymbol{\mu}=\mathbf{1}_{n}^{T} \ballsymbol {\nu}=1\left(\mathbf{1}_{n}\right.1nTm=1nTn=1(1nis a nnn- dimensional all-ones vector). Letc : X × Y ↦ R + c: \mathcal{X} \times \mathcal{Y} \mapsto \mathbb{R}_{+}c:X×YR+is from point X \mathcal{X}X moves to pointY \mathcal{Y}The cost function of Y ,C \mathrm{C}C is the correspondingn × nn \times nn×unique function:C ij = c ( xi , yj ) \mathbf{C}_{ij}=c\left(\mathbf{x}_{i}, \mathbf{y}_{j}\right)Cij=c(xi,yj) . LetΓ \GammaΓ is to change the probability mass fromX \mathcal{X}X passes toY \mathcal{Y}The Y- shaped equation of the infinitive is:Γ = { γ ∈ R + n × n : γ 1 = µ , γ T 1 = ν } \Gamma=\left\{\gamma \in \mathbb{R}_{+}; ^{n \times n}: \gamma \mathbf{1}=\ball symbol{\mu}, \gamma^{T}\mathbf{1}=\ball symbol{\nu}\right\}C={ cR+n×n:c 1=m ,cT 1=ν }。 介于μ \muμν \nuThe Monge-Kantorovich optimal transmission (OT) cost between ν
is defined as: W ( μ , ν ) = min ⁡ γ ∈ Γ ⟨ C , γ ⟩ \mathcal{W}(\boldsymbol{\mu}, \boldsymbol {\nu})=\min _{\gamma \in \Gamma}\langle\mathbf{C}, \gamma\rangleW ( μ ,n )=γΓminC,γ
Intuitively, if the probability distribution μ \mu μ is viewed as a unit amount of “dirt” piled on X \mathcal{X} X and μ \mu μ a unit amount of dirt piled on Y \mathcal{Y} Y , the OT cost is the minimum “cost” of turning one pile into the other. The OT cost is a principal measurement to quantify the dissimilarity between two probability distributions, also taking into account the distance between “dirt” locations. (This paragraph I don’t know how to translate it...)
The OT cost can also be calculated by the dual formula (2):
W ( μ , ν ) = max ⁡ α , β ∈ R n ⟨ α , μ ⟩ + ⟨ β , ν ⟩ , st α i + β j ≤ c ( xi , yj ) , ∀ i , j \mathcal{W}(\boldsymbol{\mu}, \boldsymbol{\nu})=\max _{\boldsymbol{\alpha}, \boldsymbol{ \beta} \in \mathbb{R}^{n}}\langle\boldsymbol{\alpha}, \boldsymbol{\mu}\rangle+\langle\boldsymbol{\beta}, \boldsymbol{\nu}\rangle, \quad \text { st } \alpha_{i}+\beta_{j} \leq c\left(\mathbf{x}_{i}, \mathbf{y}_{j}\right), \forall i , jW ( μ ,n )=α , β Rnmaxa ,m+b ,n, st  ai+bjc(xi,yj),i,j

3. DM-Count: Distribution Matching for Crowd Counting

We consider crowd counting to be a distribution matching problem. In this section, we propose DM-Count: Distribution Matching for Crowd Counting, a network for crowd counting that inputs an image and outputs a map of density values. The final count estimate is obtained by summing the predicted density maps. DM-Count is independent of different network architectures. In our experiments, we use the same network as in the Bayesian loss paper [31]. Unlike all previous density map estimation methods that require a Gaussian to smooth GT case annotations, DM-Count does not require any Gaussian to preprocess GT annotations.

z ∈ R + n \mathbf{z}\in\mathbb{R}_{+}^{n}zR+nrepresents a vectorized binary map of point annotations, and z ^ ∈ R + n \hat {\mathbf {z}} \in \mathbb {R} _ {+} ^ {n}z^R+nVectorized predicted density map returned by the neural network. By placing z \mathrm {z}z z ^ \hat {\mathrm {z}} z^ Treated as an unnormalized density function, we use the following three terms to represent the loss function in DM-Count: count loss, OT loss and total change (TV) loss. The first term measures the difference between the total masses, while the last two measure the difference between the normalized density function distributions.

Count loss. Let ∥ ⋅ ∥ 1 \| \cdot \|_{1}1L 1 L_{1} representing the vectorL1norm, so ∥ z ∥ 1 , ∥ z ^ ∥ 1 \| \mathbf {z} \| _ {1}, \| \hat {\mathbf {z}} \| _ {1}z1z^1are ground truth and predicted counts, respectively. The goal of crowd counting is to make ∥ z ^ ∥ 1 \| \hat {\mathbf {z}} \|_{1}z^1as close as possible to ∥ z ∥ 1 \| \mathbf {z} \| _ {1}z1, and then the counting loss is defined as the difference in absolute value between them:
ℓ C ( z , z ^ ) = ∣ ∥ z ∥ 1 − ∥ z ^ ∥ 1 ∣ \ell_{C}(\mathbf{z}, \hat {\mathbf{z}})=\left|\|\mathbf{z}\|_{1}-\|\hat{\mathbf{z}}\|_{1}\right|C(z,z^)=z1z^1
Optimal transmission loss. z \mathbf {z}zz ^ \hat {\mathbf {z}}z^ are both unnormalized density functions, but we can convert them to probability density functions (pdfs) by dividing them by their respective sums. Besides OT, Kullback-Leibler divergence and Jensen-Shannon divergence can also measure the similarity between two pdfs. However, if the source and target distributions do not overlap, these measurements will not provide efficient gradients to train the network [32]. Therefore, we propose to use OT in this work. We define the OT loss as follows:
ℓ OT ( z , z ^ ) = W ( z ∥ z ∥ 1 , z ^ ∥ z ^ ∥ 1 ) = ⟨ α ∗ , z ∥ z ∥ 1 ⟩ + ⟨ β ∗ , z ^ ∥ z ^ ∥ 1 ⟩ \ell_{OT}(\mathbf{z}, \hat{\mathbf{z}})=\mathcal{W}\left(\frac{\mathbf{z}}{\|\ mathbf{z}\|_{1}}, \frac{\hat{\mathbf{z}}}{\|\hat{\mathbf{z}}\|_{1}}\right)=\left \langle\boldsymbol{\alpha}^{*}, \frac{\mathbf{z}}{\|\mathbf{z}\|_{1}}\right\rangle+\left\langle\boldsymbol{\beta }^{*}, \frac{\hat{\mathbf{z}}}{\|\hat{\mathbf{z}}\|_{1}}\right\rangleOT(z,z^)=W(z1z,z^1z^)=a,z1z+b,z^1z^
whereα ∗ \alpha ^ {*}a andβ ∗ \beta ^ {*}b is formula(2) (2)( 2 ) solution. We use the quadratic transmission cost,c ( z ( i ) , z ^ ( j ) ) = ∥ z ( i ) − z ^ ( j ) ∥ 2 2 , c(\mathbf {z}(i), \hat {\mathbf{z}}(j))= \| \mathbf {z}(i)-\hat{\mathbf {z}}(j)\| _ {2} ^ {2},c(z(i),z^(j))=z(i)z^(j)22, of whichz ( i ) \mathbf {z}(i)z ( i )z ^ ( j ) \hat {\mathbf {z}}(j)z^ (j)iirespectivelyi andjjj2 D 2\mathrm {D}2D coordinates . To prevent division by zero errors, we add machine precision numbers to the denominator. Sincez ^ \hat {z}zThe elements in ^ are non-negative numbers, and equation (4) is about z ^ \hat {z}z^ Definition:
∂ l OT ( z , z ^ ) ∂ z ^ = β ∗ ∥ z ^ ∥ 1 − ⟨ β ∗ , z ^ ⟩ ∥ z ^ ∥ 1 2 \frac{\partial \ell_{OT}( \mathbf{z}, \hat{\mathbf{z}})}{\partial \hat{\mathbf{z}}}=\frac{\ball symbol{\beta}^{*}}{\|\hat {\mathbf{z}}\|_{1}}-\frac{\left\angle\ball symbol{\beta}^{*}, \hat{\mathbf{z}}\right\rangle}{\| \hat{\mathbf{z}}\|_{1}^{2}}z^OT(z,z^)=z^1bz^12b,z^
This gradient can be backpropagated to learn the parameters of the density estimation network.

Total change loss. In each training iteration, we use the Sinkhorn algorithm [34] to approximate α ∗ \alpha ^ {*}a andβ ∗ \beta ^ {*}b . The time complexity isO ( n 2 log ⁡ n / ϵ 2 ) O \left(n ^ {2} \log n / \epsilon ^ {2} \right)O(n2logn / ϵ2 )[9], whereϵ \epsilonϵ is the desired optimal margin, the difference between the returned target and the optimal target. When optimizing with the Sinkhorn algorithm, the objective function drops sharply at the beginning but only slowly converges to the optimal solution in later iterations. In practice, we set a maximum number of iterations and the Sinkhorn algorithm returns only approximate solutions. As a result, when we optimize the OT loss using the Sinkhorn algorithm, the predicted density map ends up being close to the GT density map, but not quite the same. The OT loss will approximate dense areas of crowds well, but the approximation may be worse for low density areas of crowds. To address this, we additionally use the Total Variation (TV) loss, defined as:
ℓ TV ( z , z ^ ) = ∥ z ∥ z ∥ 1 − z ^ ∥ z ^ ∥ 1 ∥ TV = 1 2 ∥ z ∥ z ∥ 1 − z ^ ∥ z ^ ∥ 1 ∥ 1 \ell_{TV}(\mathbf{z}, \hat{\mathbf{z}})=\left\|\frac{\mathbf{z}}{\ |\mathbf{z}\|_{1}}-\frac{\hat{\mathbf{z}}}{\|\hat{\mathbf{z}}\|_{1}}\right\| _{TV}=\frac{1}{2}\left\|\frac{\mathbf{z}}{\|\mathbf{z}\|_{1}}-\frac{\hat{\mathbf {z}}}{\|\hat{\mathbf{z}}\|_{1}}\right\|_{1}TV(z,z^)=z1zz^1z^TV=21z1zz^1z^1
The TV loss will also increase the stability of the training process. Optimizing OT loss with the Sinkhorn algorithm is a minimum-maximum saddle point optimization process, similar to GAN optimization [13]. As shown in Pix2Pix GAN [16], the stability of GAN training can be increased by increasing the reconstruction loss. To this end, the TV loss is similar to the reconstruction loss and also adds stability to the training process. TV loss vs predicted density map z ^ \hat{\mathbf{z}}z^ Definition:
∂ l TV ( z , z ^ ) ∂ z ^ = − 1 2 ( sign ⁡ ( v ) ∥ z ^ ∥ 1 − ⟨ sign ⁡ ( v ) , z ^ ⟩ ∥ z ^ ∥ 1 2 ) \frac{\partial \ell_{TV}(\mathbf{z}, \hat{\mathbf{z}})}{\partial \hat{\mathbf{z}}}=-\frac{1}{2 }\left(\frac{\operatorname{sign}(\mathbf{v})}{\|\hat{\mathbf{z}}\|_{1}}-\frac{\langle\operatorname{sign} (\mathbf{v}), \hat{\mathbf{z}}\rangle}{\|\hat{\mathbf{z}}\|_{1}^{2}}\right)z^TV(z,z^)=21(z^1sign(v)z^12sign(v),z^)
其中,v = z / ∥ z ∥ 1 − z ^ / ∥ z ^ ∥ 1 , \mathbf{v}=\mathbf{z} /\|\mathbf{z}\|_{1}-\hat{ \mathbf{z}} /\|\hat{\mathbf{z}}\|_{1},v=z/z1z^/z^1, sign ⁡ ( ⋅ ) \operatorname{sign}(\cdot) s i g n ( ) is the sign function on each element of the vector.
The overall loss function (The Overall Objective). The overall loss function is the sum of count loss, OT loss and TV loss:
ℓ ( z , z ^ ) = ℓ C ( z , z ^ ) + λ 1 ℓ OT ( z , z ^ ) + λ 2 ∥ z ∥ 1 ℓ TV ( z , z ^ ) \ell(\mathbf{z}, \hat{\mathbf{z}})=\ell_{C}(\mathbf{z}, \hat{\mathbf{z}})+ \lambda_{1} \ell_{OT}(\mathbf{z}, \hat{\mathbf{z}})+\lambda_{2}\|\mathbf{z}\|_{1} \ell_{TV }(\mathbf{z}, \hat{\mathbf{z}})(z,z^)=C(z,z^)+l1OT(z,z^)+l2z1TV(z,z^)

where λ 1 \lambda_ {1}l1and λ 2 \lambda_{2}l2are tunable hyperparameters for OT and TV losses. To ensure that the TV loss has the same proportion as the count loss, we multiply this loss term with the total headcount.

Given KKK training images{ I k } k = 1 K \left \{I_ {k} \right \} _ {k = 1} ^ {K}{ Ik}k=1Kand its corresponding point annotation graph { zk } k = 1 K \left \{\mathbf {z} _ {k} \right \} _ {k = 1} ^ {K}{ zk}k=1K, we will minimize L ( f ) = 1 K ∑ k = 1 K ˉ ℓ ( zk , f ( I k ) ) L(f)= \frac {1} {K} \sum_ {k = 1} ^ {\bar {K}} \ell \left(\mathbf {z} _ {k},f \left(I_ {k} \right)\right)L(f)=K1k=1Kˉ(zk,f(Ik) ) to learn a deep neural network fffor density map estimationf

4. Generalization Boundary and Theoretical Analysis

In this section, we analyze the Gaussian smoothing method, the Bayesian loss and the theoretical properties of the proposed DM-Count. Proofs of the theorems in this section can be found in the supplementary material. First, we introduce some notation below.

D = { ( I , z ) } \mathcal {D} = \{(I,\mathbf {z})\}D={ (I,z ) } is the joint distribution of crowd images and corresponding point annotation maps. LetH \mathcal {H}H is the hypothesis space. Everyh ∈ H h \in \mathcal {H}hH -holdI ∈ II \in \mathcal {I}IIprojection z ∈ Z \mathbf {z }\in \mathcal {Z}zEach dimension of Z. LetF = H × ⋯ × H ( n times ) x \mathcal {F} = \mathcal {H} \times \cdots \times \mathcal {H}(n \text{ times})xF=H××H ( n  times ) x is the mapping space. Everyf ∈ F f \in \mathcal {F}fF I ∈ I I \in \mathcal {I} IIprojection z ∈ Z \mathbf {z }\in \mathcal {Z}zZ. _ Supposet \mathrm {t}t for eachz ∈ D \mathbf {z} \in \mathcal {D}zGaussian smoothed density map in D , and let D ~ = { ( I , t ) } \tilde{\mathcal {D}} = \{(I,\mathbf {t})\}D~={ (I,t)} ( I , t ) (I,\mathbf {t}) (I,t ) joint distribution. LetS = { ( I k , zk ) } k = 1 KS = \left \{\left(I_ {k},\mathbf {z} _ {k} \right)\right \} _ {k = 1} ^{K}S={ (Ik,zk)}k=1K S ~ = { ( I k , t k ) } k = 1 K \tilde {S} = \left \{\left(I_ {k},\mathbf {t} _ {k} \right)\right \} _ {k = 1} ^ {K} S~={ (Ik,tk)}k=1Kfor containing KKA finite set of K samples, and respectively fromD \mathcal {D}DD ~ \tilde {\mathcal {D}}DThe independent and identically distributed samples sampled in ~ . SupposeRS ( H ) R_ {S}(\mathcal {H})RS( H ) meansH \mathcal {H}H aboutSSThe empirical Rademacher complexity of S (empirical Rademacher complexity) [3]. Given a data setD ∈ { D , S , D ~ , S ~ } D \in \{\mathcal {D} ,S,\tilde {\mathcal {D}},\tilde {S} \}D{ D,S,D~,S~ }, the mappingf ∈ F f \in \mathcal {F}fF and loss functionℓ \ell , 令R ( D , f , ℓ ) = E ( I,s ) ∼ D [ ℓ ( s , f ( I ) ) ] \mathcal {R} (D,f,\ell)= \mathbb {E} _ {(I,\mathbf {s})\sim D} [\ell(\mathbf {s},f(I))]R(D,f,)=E(Is)D[(s,f ( I ) ) ] defines the graphical function andl 1 ( z , z ^ ) = ∥ z − z ^ ∥ 1 \ell_ {1}(\mathbf {z},\hat {\mathbf {z}})= \| \mathbf{z}-\hat{\mathbf{z}}\| _{1}1(z,z^)=zz^1 。设 f Δ D = argmin ⁡ f ∈ F R ( D , f , ℓ Δ ) f _ {\Delta} ^ {D} = \operatorname {argmin} _ {f \in \mathcal {F}} \mathcal {R} \left(D,f,\ell _ {\Delta} \right) fDD=argminfFR(D,f,D) is the lossℓ Δ \ell _ {\Delta}DIn dataset DDThe minimum value R on D ( D , f , ℓ Δ ) \mathcal {R} \left(D,f,\ell _ {\Delta} \right)R(D,f,D) , whereD ∈ { D , S , D ~ , S ~ } D \in \{\mathcal {D},S,\tilde {\mathcal {D}},\tilde {S} \}D{ D,S,D~,S~ }Δ ∈ { 1 , C , OT , TV , ∅ } \Delta \in \{1,C,OT,TV,\emptyset \}D{ 1,C,OT,TV,}

4.1 Generalized Error Bounds for Gaussian Smoothing Method

Many existing methods (e.g. [60, 20, 35]) use Gaussian smoothed annotation maps for training. Using ℓ 1 \ell_1 on the density plot is given below1The generalization error bound for the loss.
Theorem 1 : Suppose ∀ f ∈ F \forall f \in \mathcal{F}fF( I , t ) ∼ D ~ (I, \mathbf{t}) \sim \tilde{\mathcal{D}}(I,t)D~,有 ℓ ( t , f ( I ) ) ≤ B . \ell(\mathbf{t}, f(I)) \leq B . (t,f(I))B . Then, for any0 < δ < 1 0<\delta<10<d<1 , at least with probability1 − δ 1-\delta1δ有,
a) 泛化误差的上边界为:
R ( D , f 1 S ~ , ℓ 1 ) ≤ R ( D ~ , f 1 D ~ , ℓ 1 ) + 2 n RS ( H ) + 5 B 2 log ⁡ ( 8 / δ ) / K + E ( I , z ) ∼ D ∥ z − t ∥ 1 \mathcal{R}\left(\mathcal{D}, f_{1}^{\tilde{S}} , \ell_{1}\right) \leq \mathcal{R}\left(\tilde{\mathcal{D}}, f_{1}^{\tilde{\mathcal{D}}}, \ell_{1 }\right)+2 n R_{S}(\mathcal{H})+5 B \sqrt{2 \log (8 / \delta) / K}+\mathbb{E}_{(I, \mathbf{ z}) \sim \mathcal{D}}\|\mathbf{z}-\mathbf{t}\|_{1}R(D,f1S~,1)R(D~,f1D~,1)+2 n RS(H)+5B2log ( 8 / d ) / K _ +E(I,z)Dzt1
b) 泛化误差的下边界为:
R ( D , f 1 S ~ , ℓ 1 ) ≥ ∣ E ( I , z ) ∼ D ∥ z − t ∥ 1 − R ( D ~ , f 1 S ~ , ℓ 1 ) ∣ \mathcal{R}\left(\mathcal{D}, f_{1}^{\tilde{S}}, \ell_{1}\right) \geq\left|\mathbb{E}_{ (I, \mathbf{z}) \sim \mathcal{D}}\|\mathbf{z}-\mathbf{t}\|_{1}-\mathcal{R}\left(\tilde{\mathcal {D}}, f_{1}^{\tilde{S}}, \ell_{1}\right)\right|R(D,f1S~,1)E(I,z)Dzt1R(D~,f1S~,1)
In this theorem, as the number of samples KKK region is infinite,2 n RS ( H ) 2 n R_{S}(\mathcal{H})2 n RS( H ) and5 B 2 log ⁡ ( 8 / δ ) / K 5 B \sqrt{2 \log (8 / \delta) / K}5B2log ( 8 / d ) / K _ drop to 0 00 . Theorem1.a1.a1 . a ) shows that the expected riskR ( D , f 1 S ~ , ℓ 1 ) \mathcal{R}\left(\mathcal{D}, f_{1}^{\tilde{S}}, \ell_{1 }\right)R(D,f1S~,1) , which is evaluated on real GT data using empirical minima trained on Gaussian-smoothed GT. Under the condition of sufficient training data, no more thanR ( D ~ , f 1 D ~ , ℓ 1 ) + E ( I , z ) ∼ D ∥ z − t ∥ 1 \mathcal{R}\left(\tilde{\mathcal{ D}},f_{1}^{\tilde{\mathcal{D}}}, \ell_{1}\right)+\mathbb{E}_{(I, \mathbf{z}) \sim \mathcal {D}}\|\mathbf{z}-\mathbf{t}\|_{1}R(D~,f1D~,1)+E(I,z)Dzt1Theorem 1. b 1 . b1 . b ) Show that the expected riskR ( D , f 1 S ~ , ℓ 1 ) \mathcal{R}\left(\mathcal{D}, f_{1}^{\tilde{S}}, \ell_{1 }\right)R(D,f1S~,1) is not less than∣ E ( I , z ) ∼ D ∥ z − t ∥ 1 − R ( D ~ , f 1 S ~ , ℓ 1 ) ∣ . \left|\mathbb{E} _{(I, \mathbf{z}) \sim \mathcal{D}}\|\mathbf{z}-\mathbf{t}\|_{1}-\mathcal{R}\left(\tilde{ \mathcal{D}}, f_{1}^{\tilde{S}}, \ell_{1}\right)\right| .E(I,z)Dzt1R(D~,f1S~,1).Let functionR ( D ~ , f 1 S ~ , l 1 ) ≤ E ( I , z ) ∼ D ∥ z − t ∥ 1 , \mathcal{R}\left(\tilde{\mathcal{D}} , f_{1}^{\tilde{S}}, \ell_{1}\right) \leq \mathbb{E}_{(I, \mathbf{z}) \sim \mathcal{D}}\| \mathbf{z}-\mathbf{t}\|_{1},R(D~,f1S~,1)E(I,z)Dzt1, so the smallest isR ( D ~ , f 1 S ~ , ℓ 1 ) \mathcal{R}\left(\tilde{\mathcal{D}}, f_{1}^{\tilde{S}}, \ ell_{1}\right)R(D~,f1S~,1) , the maximum expected risk isR ( D , f 1 S ~ , ℓ 1 ) \mathcal{R}\left(\mathcal{D}, f_{1}^{\tilde{S}}, \ell_{1 }\right)R(D,f1S~,1) . In other words, the modelf 1 S ~ f_{1}^{\tilde{S}}f1S~In Gaussian smoothed GT D ~ \tilde{\mathcal{D}}DThe better the performance on ~ , in the actual GTD \mathcal{D}The generalization performance on D is worse. In addition, since R ( D ~ , f 1 S ~ , ℓ 1 ) ≠ E ( I , z ) ∼ D ∥ z − t ∥ 1 , \mathcal{R}\left(\ tilde{\mathcal{D}}, f_{1}^{\tilde{S}}, \ell_{1}\right) \neq \mathbb{E}_{(I, \mathbf{z}) \sim \mathcal{D}}\|\mathbf{z}-\mathbf{t}\|_{1},R(D~,f1S~,1)=E(I,z)Dzt1, R ( D ~ , f 1 S ~ , ℓ 1 ) = 0. \mathcal{R}\left(\tilde{\mathcal{D}}, f_{1}^{\tilde{S}}, \ell_{1}\right)=0 . R(D~,f1S~,1)=0.,有 R ( D , f 1 S ~ , ℓ 1 ) > 0. R ( D , f 1 S ~ , ℓ 1 ) \mathcal{R}\left(\mathcal{D}, f_{1}^{\tilde{S}}, \ell_{1}\right)>0 . \mathcal{R}\left(\mathcal{D}, f_{1}^{\tilde{S}}, \ell_{1}\right) R(D,f1S~,1)>0.R(D,f1S~,1) deterministic functionE ( I , z ) ∼ D ∥ z − t ∥ 1 \mathbb{E}_{(I, \mathbf{z}) \sim \mathcal{D}}\|\mathbf{z}- \mathbf{t}\|_{1}E(I,z)Dzt1. This is not what we want, because we want to evaluate the risk R ( D , f 1 S ~ , ℓ 1 ) \mathcal{R}\left(\mathcal{D}, f_{1}^{\ tilde{S}}, \ell_{1}\right)R(D,f1S~,1) tends to0 00

4.2 Uncertain Bayesian loss

The Bayesian loss [31] is:

ℓ Bayesian ( z , z ^ ) = ∑ i = 1 N ∣ 1 − ⟨ p i , z ^ ⟩ ∣ ,  where  p i = N ( q i , σ 2 1 2 × 2 ) ∑ i = 1 N N ( q i , σ 2 1 2 × 2 ) \ell_{\text {Bayesian}}(\mathbf{z}, \hat{\mathbf{z}})=\sum_{i=1}^{N}\left|1-\left\langle\mathbf{p}_{i}, \hat{\mathbf{z}}\right\rangle\right|, \text { where } \quad \mathbf{p}_{i}=\frac{\mathcal{N}\left(\mathbf{q}_{i}, \sigma^{2} \mathbf{1}_{2 \times 2}\right)}{\sum_{i=1}^{N} \mathcal{N}\left(\mathbf{q}_{i}, \sigma^{2} \mathbf{1}_{2 \times 2}\right)} Bayesian(z,z^)=i=1N1pi,z^, where pi=i=1NN(qi,p212×2)N(qi,p212×2)
where NNN isthe number of heads in z, the number of heads in \mathbf{z},For each , N ( qi , σ 2 1 2 × 2 ) \mathcal{N} \ left(\mathbf{q}_{i}, \sigma^{2}\mathbf{1}_{2\ times 2}\right)N(qi,p212×2) is based onqi \mathbf{q}_{i}qiis the mean, and the variance is σ 2 1 2 × 2 \sigma^{2} \mathbf{1}_{2 \times 2}p212×2Gaussian distribution. qi \mathbf{q}_{i}qiis z \mathbf{z}the ithi^{th}thin zit h label points. .pi\mathbf{p}_{i}piz \mathbf{z}The dimensions of z aren , n,n is the number of pixels in the density map. However, due to the number of marked pointsNNN is less thann , n,n , the Bayesian loss is undefined. For GT annotationz \mathbf{z}z , there are infinitely manyz ^ \hat{\mathbf{z}}z^ interfaceBayesian ( z , z ^ ) = 0 \ell_{\text {Bayesian}}(\mathbf{z}, \hat{\mathbf{z}})=0Bayesian(z,z^)=0z ^ ≠ z . \hat{\mathbf{z}} \neq \mathbf{z} .z^=z . Therefore, the predicted density map may be very different from the GT density map.

4.3 Generalization Error Bounds for Loss Functions in DM-Count

In the following theorem, we give generalization error bounds for the loss function of the proposed method.
Theorem 2 : Suppose ∀ f ∈ F \forall f \in \mathcal{F}fF( I , z ) ∼ D (I, \mathbf{z}) \sim \mathcal{D}(I,z)D ,if∥ z ∥ 1 ≥ 1 , ∥ f ( I ) ∥ 1 ≥ 1 \|\mathbf{z}\|_{1} \geq 1,\|f(I)\|_{1} \geq 1z11,f(I)11 (can be obtained by addingz \mathrm{z}z f ( I ) f(I) f ( I ) all add a virtual dimension with a value of 1 to satisfy) andℓ C ( z , f ( I ) ) ≤ B . \ell_{C}(\mathrm{z}, f(I)) \leq B .C(z,f(I))B . Then, for any0 < δ < 1 0<\delta<10<d<1 , at least with probability1 − δ 1-\delta1With δ :
a.) a.)a . ) The generalization error bound for the counting loss is:
R ( D , f CS , ℓ C ) ≤ R ( D , f CD , ℓ C ) + 2 n RS ( H ) + 5 B 2 log ⁡ ( 8 / δ ) / K \mathcal{R}\left(\mathcal{D}, f_{C}^{S}, \ell_{C}\right) \leq \mathcal{R}\left(\mathcal{D}, f_{C}^{\mathcal{D}}, \ell_{C}\right)+2 n R_{S}(\mathcal{H})+5 B \sqrt{2 \log (8 / \delta) /K}R(D,fCS,C)R(D,fCD,C)+2 n RS(H)+5B2log ( 8 / d ) / K _
b . ) b.) b _ ) OT is independently defined as:
R ( D , f OTS , l OT ) ≤ R ( D , f OTD , l OT ) + 4 C ∞ n 2 RS ( H ) + 5 C ∞ 2 log ⁡ ( 8 / δ ) / K \mathcal{R}\left(\mathcal{D}, f_{OT}^{S}, \ell_{OT}\right) \leq \mathcal{R}\left(\mathcal{D} }, f_{OT}^{\mathcal{D}}, \ell_{OT}\right)+4 \mathbf{C}_{\infty}n^{2}R_{S}(\mathcal{H} )+5 \mathbf{C}_{\infty}\sqrt{2\log(8/\delta)/K}R(D,fOTS,OT)R(D,fOTD,OT)+4Cn2 RS(H)+5 CORE2log ( 8 / d ) / K _
c . ) c.) c . ) The generalization error bound for TV loss is:
R ( D , f TVS , ℓ TV ) ≤ R ( D , f TVD , ℓ TV ) + n 2 RS ( H ) + 5 2 log ⁡ ( 8 / δ ) K \mathcal{R}\left(\mathcal{D}, f_{TV}^{S}, \ell_{TV}\right) \leq \mathcal{R}\left(\mathcal{D}, f_{ TV}^{\mathcal{D}}, \ell_{TV}\right)+n^{2} R_{S}(\mathcal{H})+5 \sqrt{2 \log (8 / \delta) K}R(D,fTVS,TV)R(D,fTVD,TV)+n2 RS(H)+52log ( 8 / d ) K _
d . ) d.) d _ ) Specifies the equivalence of the following function:
R ( D , f S , l ) ≤ R ( D , f D , l ) + ( 2 n + 4 λ 1 C ∞ n 2 + λ 2 N n 2 ) RS ( H ) + 5 ( B + λ 1 C ∞ + λ 2 N ) 2 log ⁡ ( 8 / δ ) K \begin{array}{r}\mathcal{R}\left(\mathcal{D}, f^{ S}, \ell\right) \leq \mathcal{R}\left(\mathcal{D}, f^{\mathcal{D}}, \ell\right)+\left(2n+4\lambda_{ 1} \mathbf{C}_{\infty} n^{2}+\lambda_{2} N n^{2}\right) R_{S}(\mathcal{H}) \\ +5\left( B+\lambda_{1}\mathbf{C}_{\infty}+\lambda_{2} N\right)\sqrt{2\log(8/\delta)K}\end{array}R(D,fS,)R(D,fD,)+( 2 n+4 min1Cn2+l2Nn2)RS(H)+5(B+l1C+l2N)2log ( 8 / d ) K _
where, C ∞ \mathbf{C}_{\infty}Cis the maximum value of the OT cost matrix, N = sup ⁡ { ∥ z ∥ 1 ∣ ∀ ( I , z ) ∼ D } N=\sup \left\{\|\mathbf{z}\|_{1} \mid \forall(I, \mathbf{z}) \sim \mathcal{D}\right\}N=sup{ z1(I,z)D } is the largest number of heads on the data set.
In the above theory, withKKK increases,RS ( H ) R_{S}(\mathcal{H})RS( H ) and2 log ⁡ ( 1 / δ ) K \sqrt{2 \log (1 / \delta) K}2log ( 1 / d ) K _ decrease. All using empirical minimizers f Δ S ˙ f_{\Delta}^{\dot{S}}fDS˙The expected risk R ( D , f Δ S , ℓ Δ ) \mathcal{R}\left(\mathcal{D}, f_{\Delta}^{S}, \ell_{\Delta}\right)R(D,fDS,D) converges to using optimal minimizersf Δ D f_{\Delta}^{\mathcal{D}}fDDThe expected risk R ( D , f Δ D , ℓ Δ ) \mathcal{R}\left(\mathcal{D}, f_{\Delta}^{\mathcal{D}}, \ell_{\Delta}\right )R(D,fDD,D) Δ ∈ { C , O T , T V , ∅ } \Delta \in\{C, O T, T V, \emptyset\} D{ C,OT,TV,} . This means that all caps are tight. Furthermore, all upper bounds are better than Theorem1. a 1 . \mathrm{a}1. The Gaussian smoothing method shown in a ) has a tighter upper bound. Theorem2. b 2 . \mathrm{b}2 . b ) The bound of OT loss and the maximum transmission costC ∞ \mathrm{C}_{\infty}Crelated. Therefore, we need to use a small transfer cost in OT for better generalization performance. RS ( H ) R_{S}(\mathcal{H})RS( H ) counting loss factor isO ( n ) O(n)O ( n ) , the coefficient of OT loss and TV loss isO ( n 2 ) O\left(n^{2}\right)O(n2 ). This means that for larger image sizes, we need more images for training. When using counting loss, this number is related toz \mathbf{z}The size of z has a linear relationship, while when using OT loss or TV loss, it is a quadratic relationship. When using these three loss functions, we need toλ 1 \lambda_{1}l1and λ 2 \lambda_{2}l2Set to be small in order to balance the three loss values.

5. Experiment

In this section, we present experiments on toy data and benchmark crowd counting datasets. A more detailed dataset description, implementation details and experimental setup can be found in the supplementary material.

5.1 Results on the Toy Data dataset

To understand the empirical behavior of different methods, we consider a toy problem in which the task is to convert the source density map z ^ \mathbf{\hat{z}} using pixel-wise loss, Bayesian loss and DM-Countz^ Move to target density mapz \mathbf{z}z . Source density mapz ^ \mathbf{\hat{z}}z^ is from0 to 00 to0.01 0.01Initialized with a uniform distribution between 0.0 and 1 , the target density map is shown on the far left in Figure 1 . All three methods start from the same source density map. Figure 1 shows z ^ \mathbf{\hat{z}}at the final convergencez^ Visualization diagram. A pixel-wise loss produces a blurred density map with higher counts. The Bayesian loss performs better than the pixel-level loss [52] in terms of counting error, peak signal-to-noise ratio (PSNR), and structural similarity in images (SSIM) [52], but the final density map differs from the target value is very different, with high values ​​in many locations where there are no annotation points. This confirms our analysis that the Bayesian loss corresponds to an uncertain system and thus the output density map may differ significantly from the target density map. In contrast, DM-Count was able to produce more accurate counts and density maps. DM-Count significantly outperforms Bayesian loss in both PSNR and SSIM.

5.2 Results on the Benchmark dataset

We conduct experiments on four challenging crowd counting datasets: UCF-QNRF [15], NWPU [51], ShanghaiTech [60] and UCF-CC-50 [14]. Notably, the NWPU dataset is the largest and most challenging crowd counting dataset publicly available today. GT counts for test images are not released, and results on the test set must be obtained by submitting to the evaluation server https://www.crowdbenchmark.com/nwpucrowd.html . Prior to previous work [35, 15, 4, 14, 60], we used the following metrics: mean absolute error (MAE), root mean square error (RMSE) and mean normalized absolute error (NAE) as evaluation metrics. For all three metrics, smaller is better. For a fair comparison, we use the same network as in the Bayesian loss paper [31]. In all experiments, we set λ 1 = 0.1, λ 2 = 0.01 \lambda_1= 0.1, \lambda_2= 0.01l1=0 . 1 l2=0 . 0 1 , and set the Sinkhorn entropy regularization parameter to 10. The number of Sinkhorn iterations was set to 100. On average, the OT computation time per image is 25ms.

Quantitative results. Tables 1 and 2 compare the performance of DM-Count with various methods. In all experiments, except the CAN method, DM-Count outperforms all other methods in MSE performance (based on the comparable NWPU dataset). Although we use the same hyperparameter set for DM-Count in all experiments, DM-Count still achieves the best performance, which shows that the performance of DM-Count is stable in various datasets.

In all experiments, DM-Count outperforms both pixel-wise and Bayesian losses when under the same network architecture and training procedure as DM-Count. This demonstrates the effectiveness of our proposed loss function. Pixel-wise loss is much worse than DM-Count in Table 1. Furthermore, DM-Count still achieves state-of-the-art performance on all four datasets even without using multi-scale architectures in [4, 47] or deeper networks in [2, 50]. This shows that it is very important to have a good loss function in crowd counting.

DM-Count significantly outperforms state-of-the-art methods on the large and challenging datasets of UCF-QNRF and NWPU. Specifically, on the UCF-QNRF dataset, DM-Count reduces the MAE and MSE of the Bayesian loss from 88.7 to 85.6 and from 154.8 to 148.3, respectively. It is worth noting that on the NWPU test set (obtained by submitting to the evaluation server), DM-Count greatly reduces MAE and NAE, MAE is reduced from 105.4 to 88.4, and NAE is reduced from 0.203 to 0.169.

qualitative results. Figure 2 shows the predicted density plots of pixel-wise loss, Bayesian loss and DM-Count. The figure shows that: 1) the count value produced by DM-Count is closer to the actual value of GT, and 2) the density map produced by DM-Count is clearer than pixel-wise loss and Bayesian loss. In Fig. 2, DM-Count produces much higher PSNR and SSIM than pixel-wise and Bayesian losses. On the entire UCF-QNRF test set, the average PSNR and SSIM for pixel-wise loss are 34.79 and 0.43, Bayesian loss is 34.55 and 0.42, and DM count is 40.65 and 0.55. Since the pixel-wise loss uses a Gaussian-smoothed ground truth, it produces densities that are blurrier than the actual ground truth. This empirically validates our theoretical analysis of the generalized bound on the Gaussian smoothing method. As shown, Pixelwise and Bayesian loss fail to locate people in densely populated areas. In contrast, DM-Count does a good job of locating people in both densely populated and sparsely populated areas. Figure 3 shows the predicted density map by DM counts. The predicted density maps are in good agreement with the population densities in both sparse and dense areas, demonstrating the effectiveness of DM-Count in spatial density estimation.
insert image description here

insert image description here

5.3 Model Simplification Test

Hyperparameter research. We adjust λ 1 λ1 in DM-Count of UCF-QNRF datasetλ 1λ 2 λ2λ 2 . First, we willλ 1 λ_1l1fixed at 0.1, and λ 2 λ_2l2Adjusted from 0.01, 0.05 to 0.1. MAE varies from 85.6, 87.8 to 88.5. Since λ 2 = 0.01 λ_2 = 0.01l2=0 . 0 1 can get the best result, we willλ 2 λ_2l2fixed at 0.01, and λ 1 λ_1l1Adjusted from 0.01, 0.05 to 0.1. MAE varies from 87.2, 86.2 to 85.6. Therefore, we set λ 1 = 0.1, λ 2 = 0.01 λ_1 = 0.1, λ_2 = 0.01l1=0 . 1 l2=0.0 1 and use it for all datasets.

The effect of the number of Sinkhorn iterations. Table 3 lists the results of DM-Count on the UCF, QNRF datasets using different numbers of Sinkhorn iterations. As shown in the table below, using a small number of iterations degrades the performance of DM-Count, which indicates that the OT solution we obtain is not accurate. When the number of iterations increases to 100, the performance of DM-Count exceeds the previous state-of-the-art. The performance stabilizes after the number of iterations exceeds 100. Therefore, in all our experiments we use 100 Sinkhorn iterations for DM-Count.
insert image description here

Contribution of each loss function. DM-Count counting loss function consists of three parts, counting loss, OT loss and TV loss. We investigate the contribution of each function on the UCF-QNRF dataset. The results are listed in Table 5. As shown in the table, all components are essential for the final performance. However, OT loss is the most important component.
insert image description here

Robustness to noisy annotations. Crowd annotation is done by placing a point on the person. Such a process is ambiguous and can lead to unavoidable annotation errors. We study the performance of different loss functions for annotation errors. We add uniform random noise to the original annotations and train different models with the same noisy annotations. Noise is randomly generated between 0% and 5% of the image height, with an average of about 80 pixels. As shown in Table 4, the proposed DM-Count is more robust to annotation errors compared to the pixel-wise Bayesian loss.

6. Conclusion

In this paper, we have shown that smoothing GT point annotations with a Gaussian kernel hurts the generalization range of the model when tested on real GT data. Instead, we treat crowd counting as a distribution matching problem and propose optimized transport based DM-Count to address this problem. Unlike previous work, DM-Count does not require a Gaussian kernel to smooth annotated points. The generalization error bound of DM-Count is tighter than that of the Gaussian smoothing method. Extensive experiments on four crowd counting benchmarks demonstrate that DM-Count significantly outperforms previous state-of-the-art methods.

Guess you like

Origin blog.csdn.net/weixin_43335465/article/details/110816981