Dense Crowd Estimation/Crowd Counting Algorithm Based on Convolutional Neural Network [Includes Tutorial and Stepping Pit]


foreword

The graduation season is coming, and the most important thing before that is the graduation project, and my subject is research on algorithms for dense crowd estimation. Like most of you, I started from scratch. I learned the whole picture of crowd estimation bit by bit while reading papers, modifying code, training models, and adjusting parameters. This article strives to lead beginners to further understand the crowd density estimation algorithm, and try to cover the whole process of understanding the algorithm in the early stage to reproducing the code of the paper and training the model in the later stage. Some codes and data sets appearing in this article will provide network disk resources for readers to download at the end of the article.

1. What is Dense Crowd Estimation?

In short, you are given a picture, and then you can count the number of people in the picture. Such algorithms based on convolutional neural networks are inseparable from the generation of Ground Truth. The Gaussian kernel function generates a density map, and then integrates the density map. The result of the integration is the estimated number of people. Obviously, the algorithm does not count one by one. The headcount is estimated by the integral method. Some people ask why this is done, because the focus of crowd estimation is density. There may be hundreds or thousands of people in a picture, and it is unrealistic even if the algorithm counts them. Therefore, the best way now is to fit It is estimated that it can be regarded as a method of regression algorithm.
If you want to get a general understanding, you can read the following article:

A brief introduction to CNN-based crowd density map estimation method
: crowd counting: starting from MCNN~

2. Preparation before experiment

1. Github open source project - Awesome Crowd Counting

The project with the name Awesome will never let you down. This project organizes the codes of the papers and models in the papers about crowd estimation. Of course, not all papers can find the codes. One principle is to find a paper with code to read first, because the paper is always empty, and the first step is only when you actually run the code yourself.
The address of the project is as follows, here is https://hub.fastgit.org, a github mirror website, which can solve the problem of intermittent connection to github, and is strongly recommended to readers.

Awesome Crowd Counting

2. Dataset download

The data sets for crowd density estimation are relatively limited, basically just a few common data sets.
1.ShanghaiTech dataset [Most commonly used]
A total of 1198 labeled pictures. The dataset is divided into two parts, part_A and part_B. The pictures in part_B are more sparsely distributed than the pictures in part_A.
This data set was established for the first time in MCNN, 300 images of part_A are used for training, 182 images are used for testing; 400 images of part_B are used for training, and 316 images are used for testing.
2. WorldExpo's dataset
has a total of 3980 labeled pictures, of which 3380 are used for training and the remaining pictures are used for testing.
The test set consists of 5 different scenes, each with 120 images. ROIs are provided in each scene, so crowd counting is only performed on the ROI part.
3. UCF_CC_50 dataset
has a total of 50 pictures. This data set is characterized by a small number of pictures, but a large change in the number of people.
4. Others
There are also the pedestrian dataset UCSDpeds taken by the invigilator and the mall_dataset dataset of the shopping mall.

3. Environment configuration

There is nothing to say about the environment configuration in this step. Readers configure the environment by themselves. Configuring the environment is the first step in the experiment, and it can be regarded as a kind of exercise for personal ability. There is no doubt that it is necessary to use the Python language for development. The version should be 3.7 or higher. It is recommended to install anaconda and learn to manage different projects through the virtual environment. The solution to the slow installation of pip or conda install is to use domestic Tsinghua and Ali mirrors to speed up. I chose Pytorch for the deep learning framework. I also explained the reason in the AlexNet article in the past, because Pytorch is more Pythonic. Finally, it is best to have a graphics card for deep learning training, not to run on your laptop.

3. ShanghaiTech dataset experiment

1. Paper code reproduction

In fact, readers are better off reading the original paper before this step, but the problem is not too big, and it is actually easier to understand by looking at the code. There are two paper codes that I ran through, [CAN] Context-Aware Crowd Counting and [CSR] CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. The two codes are basically the same in structure. CSRNet was published earlier than CAN. Although the CAN paper gives the experimental proof that it exceeds the model of the previous paper, I am more optimistic about CSRNet after I have done the experiment, although the training time of CSRNet is much longer than CAN.

2. CAN recurrence

The code downloaded directly from github will definitely report an error when running, but it can run normally only by modifying a few places.
For details, please refer to the article: [CAN] Context-Aware Crowd Counting Recurrence Process Record
[Treading the Pit] The article mainly focuses on training partB in the ShanghaiTech dataset. If it is training partA, an error may also be reported. The problem lies in the batch_size, partB training The pictures are all of uniform size, while partA is of different sizes. For inconsistent picture sizes, batch_size can only be set to 1. Not only that, when training partA, you may even encounter an abnormal situation where the Loss is nan during the training. It is said on the Internet that the learning rate is too high, resulting in a gradient explosion. Just reduce the learning rate by about one tenth of the original.

3. CSRNet recurrence

Modify the original code in the same way as CAN reproduces, and there will be basically no problems in training.
[Remember] After each training, you must make an additional backup of the trained parameters, that is, the pth file, otherwise you may be overwritten when you train again, and you will experience the bloody lesson yourself!

4. Visual parameter adjustment

Generally speaking, the effect of partA is often far from the effect of the paper. When I was training, CSRNet was very close to the result given by the paper, but CAN is just a bit better. Next, we can visualize the training process of CAN on partA, and then manually adjust the learning rate and other parameters through the image (I am currently only modifying the CAN learning rate).
It is strongly recommended that PP Flying Paddle VisualDL can be easily used as long as it is installed directly, and then you can easily see the visualized training parameter curve in the browser by adding a few lines of code to the code. The official guide VisualDL usage guide is also very comprehensive, just follow along. I am currently only using the simple function of scalar charts. Readers can try the function of hyperparameters by themselves. It is said that the effect is good.
The image during my training is as follows:
1. CAN training image when the learning rate is 1e-5:
insert image description here
2. CAN training image when the learning rate is 1e-6:
insert image description here
3. CAN training image when the learning rate is 1e-7:
insert image description here
Loss value (ie MAE ) The smaller the effect, the lower the lower limit at 1e-5, so I choose the learning rate of e-5 during training.

5. Reproduce code performance evaluation

insert image description here

Among them, [re] represents the recurring code. It can be seen that the results of CAN and the original paper are very different. I am still looking for the reason for this. If there is a problem with my code configuration, I may need to experiment a few more times. Look at it or even change the machine to run it, otherwise it is really a problem with the original paper. The incomplete data of the paper is made. This kind of thing has not happened in academia. I hope it is the former. The code framework is similar to CSRNet.
In contrast to the reproduced CSRNet code, the result of training on the 3060GPU in the school laboratory is better than the original paper, which is unbelievable.
To sum up, CSRNet is a crowd counting algorithm with excellent performance. Its accuracy and stability stand out among many algorithms. As an algorithm model published in 2018, it still has a lot of references after 4 years.

4. UCF_CC_50 dataset experiment

1. Data set directory structure reconstruction

The UCF_CC_50 data set downloaded from the Internet is often in the same folder as the picture and the mat file, while in ShanghaiTech it is divided into a train training set and a test test set. In order to better adapt our reproduced code to the UCF dataset, we need to refactor the UCF directory. According to the paper, 50 pictures should be divided into 5 groups of 10 pictures for cross-validation, 4 sets of training sets, and 1 set of test sets; do this 5 different times and finally take the average value. My five groups are arranged as follows:

group The pictures contained in the test test set
part1 1-10.jpg as a test set
part 2 11-20 do test set
part 3 21-30 do test set
part 4 31-40 do test set
part 5 41-50 for the test set
Taking part1 as an example, it is actually consistent with the file directory of ShanghaiTech.
insert image description here
[Treading the pit] The mat file in UCF_CC_50 and the mat file coordinates in SHanghaiTech have different keys in the dictionary, just change it to the one shown in the figure below.
insert image description here

2. Experimental results

The training time is not shorter than that of ShanghaiTech. My training results are as follows for readers’ reference:
1. CAN training results
insert image description here
2. CSRNet training results
insert image description here
You can see why I say CSRNet is better by observing the above chart. A certain network module is useless, but I can't prove it, so I just listen to it, and I have to use experimental data to speak. In academia, it is true that the superiority of the entire model algorithm is represented by the data that happens at a certain moment, while the key factor of robustness is ignored, so there is a rumor that papers in the AI ​​​​field cannot be reproduced.

5. DIY dataset experiment

Running through the experiment can only prove that the reader has the ability to configure the environment and run the program by himself. To further understand the crowd estimation algorithm, he must make his own data set for training.

1. Dataset collection

At this step, readers can decide according to their own situation, such as mobile phone cameras and other equipment to take images of crowds or ready-made pictures collected from the Internet. I used the former, using my mobile phone to photograph crowds on campus. As shown in the picture:
insert image description here

[Treading pits] Before head labeling, an important parameter to pay attention to is the resolution, because the ground truth needs to be generated during training, and the size of the h5 file is directly linked to the resolution. For example, the resolution of the picture taken by my mobile phone is 3000x4000=12 million pixels, and the generated h5 file is close to 100M. During training, the graphics card of GeForce RTX 3060 directly reports that there is insufficient space, because the h5 file of any picture in ShanghaiTech is only about 6K. Therefore, if the resolution of the picture taken by the reader is too high, it is necessary to reduce the resolution of the picture.

2. Reduce the image resolution

#resolution_changer.py
import glob
from PIL import Image
import os
img_path = glob.glob("dataset/train/images/*jpg")
path_save = "dataset/train/images/"
for file in img_path:
  name = os.path.join(path_save, file)
  print(name)
  im = Image.open(file)
  im.thumbnail((1080,720)) # 尽量按比例改成1080,720格式,尽管最后图片不会都是找个大小,但是一定比这个规格小
  print(im.format, im.size, im.mode)
  im.save(name,'JPEG')

This is the code directly found on the Internet. Readers can change img_path and path_save by themselves. It is not difficult to find that the two paths are the same, but img_path has an extra *.jpg.

3. Dataset head annotation

[T-PAMI] NWPU-Crowd: A Large-Scale Benchmark for Crowd Counting and Localization is a good website, worth a look, in fact, you can find it in Awesome-Crowd-Counting. Next, we will use the Annotation Tools tool CCLabeler-master . If readers cannot access the Internet scientifically and it is not easy to access github, they can download it from the link in the blue cloud at the end of the article. Readers can directly enter the link to install, which has complete instructions and is easy to use. Nevertheless, let me say a few more words.
1. First put all the pictures with reduced resolution under cclabeler-master\data\images, and number the pictures.
insert image description here
2. Enter cclabeler-master\users\, you will see the test.json file, open the json file, the password is used when logging in to the browser interface, and the data stores the name of the picture you want to label (you can write a python script to generate characters String), without suffix, done and half remain empty.
insert image description here
3. Open cmd, cd to the cclabeler-master directory, execute python manage.py runserver 0.0.0.0:8000, if prompted, enter localhost:8000 in the browser and then log in and it will be OK.
insert image description here
4. The interface is as follows, see the HOWTO.pdf of the project for specific operations
insert image description here

4. Dataset integration

After the labeling is completed, the position coordinates of the head are stored in the cclabeler-master\data\jsons\ directory, and the json file inside is equivalent to the mat file in the data set on the market. Refer to the structure of the ShanghaiTech dataset into train, test, and then into ground_truth and images subdirectories.

5. DIY dataset training

After all, unlike mat files in traditional datasets, DIY is a json file, so the corresponding position of the code needs to be slightly adjusted. Just change make_dataset.py here, and there seems to be nothing to change in other places. After the GT h5 file is generated, subsequent operations are treated equally.
insert image description here

6. Training results for reference only

CAN Learning Rate = e-5 MAE=4.231 RMSE=7.334
CSRNet no change required MAE=5.774 RMSE=7.546
The number of pictures in my dataset is only 50, 40 are used for training, 10 are used for testing, and the number of epochs is 500. It seems that CAN defeated CSRNet this time.

6. Follow-up work of crowd counting

After completing the network training, there is still a lot of work to be done, because we have not learned more about the details of the entire project until the network weights have been trained, but we can only run through the model. Next, I will introduce several aspects that can facilitate readers to write similar graduation thesis, such as generating density maps for training and generating density maps for network prediction.

1. Generate a density map of the input network training

Today's crowd counting algorithms and mainstream crowd counting algorithms use convolutional neural networks to find the mapping relationship between the low-level features of the image and the density map, instead of the previous mapping relationship between the image and the number of people. Therefore, the input of the neural network should be a density map, and the output should also be a density map. One of the advantages of the density map is that it can be directly integrated and summed, which is reflected in the python code by calling the sum() method.
Generate the code for the density map of the input network as follows, preferably using a Jupyter notebook.

# ipynb文件
import h5py
import scipy.io as io
import PIL.Image as Image
import numpy as np
import os
import glob
from matplotlib import pyplot as plt
from scipy.ndimage.filters import gaussian_filter 
import scipy
from scipy import spatial
import json
import cv2
from matplotlib import cm as CM
from image import *
from model import CSRNet # 以CSRNet为例,此文件最好放到与项目同级目录
import torch
import random
import matplotlib.pyplot as plt
%matplotlib inline # 使用py文件请删除此行

img_path = r'SCAU_50\train\images\35.jpg' # 读者自行更改想要显示图片路径
json_path = img_path.replace('.jpg', '.json').replace('images', 'ground_truth') # 这里是自制数据集,所以读的是json文件,如果读者用已有数据集可以更改成mat文件,道理都是一样的就是读取人头位置坐标。
with open(json_path,'r')as f:
    mat = json.load(f) 
arr = []
for item in mat['points']: # 由于是自制数据集是points,如果是ShanghaiTech的话可以参考项目源码对应部分
    arr.append([item['x'],item['y']])
gt = np.array(arr)
img = plt.imread(img_path)
k = np.zeros((img.shape[0], img.shape[1]))# 按图片分辨率生成零矩阵
# img.shape是先图片height然后是width,所以如下代码gt[i][1]与height比较,gt[i][0]与width比较
for i in range(0, len(gt)):
    if int(gt[i][1]) < img.shape[0] and int(gt[i][0]) < img.shape[1]:
        k[int(gt[i][1]), int(gt[i][0])] = 1  # 往零矩阵填人头坐标填1

k = gaussian_filter(k, 15)# 高斯滤波,请自行了解,这里的15是sigma值,值越大图像越模糊
plt.subplot(1,2,1) # 将plt画布分成1行2列,当前图片置于位置1
plt.imshow(img)
plt.subplot(1,2,2) # 当前图片置于位置2
plt.imshow(k,cmap=CM.jet)
plt.show()

The resulting image is as follows:
insert image description here

2. Generate a density map of the network prediction results

When calling test.py for prediction, the neural network reads in any picture of the crowd and outputs a matching density map, which we then display.

#ipynb文件
import sys 
# 导入项目路径,这样在jupyter notebook就可以直接导入
sys.path.append(r"E:\大四上\毕设\Context-Aware-Crowd-Counting-master")
# 这里以CANNet为例,如果不是用jupyter notebook请忽略这步

import glob
from image import *
from model import CANNet
import os
import torch
from torch.autograd import Variable
from sklearn.metrics import mean_squared_error, mean_absolute_error
from torchvision import transforms
from pprint import pprint

import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Fangsong'] 
matplotlib.rcParams['axes.unicode_minus'] = False
# 以上两步为设置matplotlib显示中文,这里可以忽略
%matplotlib inline

transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                std=[0.229, 0.224, 0.225]),
]) # RGB转换使用的转换器,其中的mean和std参数可以不用理会,这些数据是大家公认使用的,因为这些数值是从概率统计中得到的,直接照写即可不必深究

model = CANNet()# 导入网络模型
model = model.cuda()

checkpoint = torch.load(r'E:\大四上\毕设\Context-Aware-Crowd-Counting-master\scau_model_best.pth.tar') # 载入训练好的权重
model.load_state_dict(checkpoint['state_dict'])
model.eval() # 准备预测评估

# 指定任意一张图片即可,都是从CANNet项目源码里复制的,这里仅为作者本地所运行代码供参考
img = transform(Image.open('E:\\大四上\\dataset\\SCAU_50\\train\\images\\35.jpg').convert('RGB')).cuda()
img = img.unsqueeze(0)
h, w = img.shape[2:4]
h_d = h // 2
w_d = w // 2
# 可以看出输入图片被切割成四份
img_1 = Variable(img[:, :, :h_d, :w_d].cuda())
img_2 = Variable(img[:, :, :h_d, w_d:].cuda())
img_3 = Variable(img[:, :, h_d:, :w_d].cuda())
img_4 = Variable(img[:, :, h_d:, w_d:].cuda())
density_1 = model(img_1).data.cpu().numpy()
density_2 = model(img_2).data.cpu().numpy()
density_3 = model(img_3).data.cpu().numpy()
density_4 = model(img_4).data.cpu().numpy()

# 将上部两张图片进行拼接,...为表示省略表示后面参数都全选
up_map=np.concatenate((density_1[0,0,...],density_2[0,0,...]),axis=1)
down_map=np.concatenate((density_3[0,0,...],density_4[0,0,...]),axis=1)
# 将上下部合成一张完成图片
final_map=np.concatenate((up_map,down_map),axis=0)
plt.imshow(final_map,cmap=cm.jet) # 展示图片
print(final_map.sum())# 直接输出图像预测的人数

The effect display is for reference:
insert image description here
the original image, the input density image trained by the network, and the output density image predicted by the network.

Resources at the end of the article

https://wwn.lanzoul.com/b030neddc
Password: regn
Because the training result has too many network parameters and the single file is too large, it is not put on the network disk. The content of the external link includes the executable code of cc-labeler, CSRNet and CAN , UCF_CC_50 original data set.
Considering that the reader may not be able to train the model due to insufficient resources, regarding the weight of the model trained by CSRNet and CANNet (the result of training more than 500epoch on the 3060 graphics card), if necessary, you can contact me with qq (the qq number is in my personal profile), but not For free, the price is about 20~30RMB.

Summarize

The road is long and you are not alone. You are not alone in facing this difficulty. There have been many people who have faced the same problem as you, and there will be countless others in the future. The only thing we can do and the only thing we can do is to have a clear conscience , do your best. The completion of the project is not over yet, I will update this article later, please pay attention to readers, welcome to exchange and learn with all the big guys.
Finally, thanks to the authors and organizations of the links cited in this article.

Guess you like

Origin blog.csdn.net/weixin_43594279/article/details/122413109
Recommended