[Multi-target tracking and counting] (3) DeepSORT actual vehicle and pedestrian tracking and counting

1. Introduction to DeepSort

Paper address:

https://arxiv.org/pdf/1703.07402.pdf

Reference article:

Explanation of DeepSort

Code address:

https://github.com/mikel-brostrom/Yolov5_DeepSort_OSNet (you can refer to this source code, if you need my source code, you can send a private message)

SORT vs. DeepSORT:

Although SORT is a very simple, effective, and practical multi-target tracking algorithm, but only through IOU to match, although the speed is very fast, the corresponding number of ID Switches is also large ;

On the basis of the original, DeepSORT integrates the appearance information , so that the model can handle the situation that the target is blocked for a long time, and reduces the index of ID Switch by 45% ; the appearance information is trained through a ReID model;

Thinking: The reason for the large number of ID Switches in the SORT algorithm?

Because the correlation matrix used is designed to be more accurate when the uncertainty of state prediction is small (that is, there are certain problems in state estimation)

DeepSORT improves the cost calculation method of this state estimation, replacing the original one with a cost calculation that combines motion and appearance information;

2. Pedestrian ReID feature training

Pedestrian dataset introduction:

Market 1501 dataset: collected on the campus of Tsinghua University and made public in 2015, with a total of 1501 pedestrians and 32668 detected pedestrian detection frames;

It can be downloaded under this link: download address

Training steps:

1. Since the data set does not classify people, first of all, the same person should be stored in a folder according to the information on the picture name;

insert image description here

illustrate:

As can be seen from the figure above, folder 0002 contains all pictures of the same person, and there are 751 such folders in total, which are divided into 751 categories;

2. Define the network structure, and finally output a feature vector with the same number of categories, which is a 751-dimensional vector in this task;

Network structure:

In the front, through several layers of Conv layers and ReLU activation functions, the input features are down-sampled, focusing on the final classification layer;

insert image description here

Finally, through two layers of fully connected layers, the feature information is output as a feature vector of 751 dimensions;

In fact, the final output is similar to a classification task. By inputting an image, the 751-dimensional vector feature of the image is obtained;

Training result:

Here is just training 40 epochs, and a good effect can be achieved under this data set:

insert image description here

3. Tool class code explanation

First understand the code structure of the entire project:

insert image description here

Let's take a look at the entire implementation flow chart:

insert image description here

The following explains some important tool class codes in turn:

nn_matching.py

Role: For each target, returns the distance measure of the nearest neighbor, that is, the closest distance to any sample that has been observed so far.

1. Euclidean distance calculation

"""
a :NxM 矩阵,代表 N 个样本,每个样本 M 个数值 
b :LxM 矩阵,代表 L 个样本,每个样本有 M 个数值 
返回的是 NxL 的矩阵,比如 dist[i][j] 代表 a[i] 和 b[j] 之间的平方和距离
"""
def _pdist(a, b):
    a, b = np.asarray(a), np.asarray(b)
    if len(a) == 0 or len(b) == 0:
        return np.zeros((len(a), len(b)))
    a2, b2 = np.square(a).sum(axis=1), np.square(b).sum(axis=1)
    r2 = -2. * np.dot(a, b.T) + a2[:, None] + b2[None, :]
    r2 = np.clip(r2, 0., float(np.inf))			# 将矩阵小于0的值都变为0
    return r2

The above is actually pushed by a formula. For details, please refer to the following blog post introduction:

https://blog.csdn.net/frankzd/article/details/80251042

expand:

Find the Euclidean distance to the nearest neighbor

distances = _pdist(a, b)
return np.maximum(0.0, distances.min(axis=0))	# 实际上就是求第一维度上的最小值

2. Cosine distance calculation

"""
a :NxM 矩阵,代表 N 个样本,每个样本 M 个数值 
b :LxM 矩阵,代表 L 个样本,每个样本有 M 个数值 
返回的是 NxL 的矩阵,比如 c[i][j] 代表 a[i] 和 b[j] 之间的余弦距离
"""


# np.linalg.norm 求向量的范式,默认是 L2 范式 
a = np.asarray(a) / np.linalg.norm(a, axis=1, keepdims=True)
b = np.asarray(b) / np.linalg.norm(b, axis=1, keepdims=True)
return 1. - np.dot(a, b.T) # 余弦距离 = 1 - 余弦相似度

Reference blog post: https://blog.csdn.net/u013749540/article/details/51813922

Note: The method of finding the nearest neighbor is the same as the method of finding the nearest neighbor of the Euclidean distance;

3. Calculation of the cost matrix

"""
计算features和targets之间的距离,返回一个成本矩阵(代价矩阵)
"""
cost_matrix = np.zeros((len(targets), len(features)))
for i, target in enumerate(targets):
	cost_matrix[i, :] = self._metric(self.samples[target], features)	# 默认采用余弦距离
return cost_matrix

Personal understanding:

I think the main function of nn_matching was introduced after DeepSORT, that is, after the introduction of apparent features, the feature information of a target can be obtained through the REID network, and the features obtained by the target between the two frames are calculated by the nearest neighbor of the cosine distance. Value, which is the method explained above, can finally get a cost matrix;

linear_assignment.py

Function: through the cost matrix and the Hungarian algorithm, that is, the function of cascade matching;

# 计算成本矩阵
cost_matrix = distance_metric(
	tracks, detections, track_indices, detection_indices)
cost_matrix[cost_matrix > max_distance] = max_distance + 1e-5

# 执行匈牙利算法,得到指派成功的索引对,行索引为tracks的索引,列索引为detections的索引
"""
官方函数说明:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimi#ze.linear_sum_assignment
"""
row_indices, col_indices = linear_assignment(cost_matrix)

matches, unmatched_tracks, unmatched_detections = [], [], []
# 找出未匹配的detections
for col, detection_idx in enumerate(detection_indices):
	if col not in col_indices:
		unmatched_detections.append(detection_idx)
# 找出未匹配的tracks
for row, track_idx in enumerate(track_indices):
	if row not in row_indices:
		unmatched_tracks.append(track_idx)
# 遍历匹配的(track, detection)索引对
for row, col in zip(row_indices, col_indices):
	track_idx = track_indices[row]
	detection_idx = detection_indices[col]
# 如果相应的cost大于阈值max_distance,也视为未匹配成功
	if cost_matrix[row, col] > max_distance:
		unmatched_tracks.append(track_idx)
		unmatched_detections.append(detection_idx)
	else:
		matches.append((track_idx, detection_idx))
return matches, unmatched_tracks, unmatched_detections

Fourth, the main code explanation

First look at the detection module, which uses the YOLOV5 network, and its effect also affects the effect of the entire task;

objdetector.py

Function: used to realize target detection and detect the target in the video frame;

# 检测的目标对象(如果不希望检测的目标可以直接去掉)
OBJ_LIST = ['person', 'car', 'bus', 'truck']
# YoloV5模型权重,这里也可以选择其他类型的模型
DETECTOR_PATH = 'weights/yolov5m.pt'    

# 先定义一个基础类,实现初始化和函数原型
class baseDet(object):
    def __init__(self):
        self.img_size = 640		# 缩放的尺寸
        self.threshold = 0.3	# 阈值
        self.stride = 1

    def build_config(self):
        self.frameCounter = 0

    def feedCap(self, im, func_status):
        # 这个字典是最终返回的结果,也就是将模型的输出保存成字典的形式
        retDict = {
    
    
            'frame': None,
            'list_of_ids': None,
            'obj_bboxes': []
        }
        self.frameCounter += 1
        # 这里调用了objtracker,会调用ReID模型得到特征,然后进行匹配
        im, obj_bboxes = objtracker.update(self, im)
        retDict['frame'] = im
        retDict['obj_bboxes'] = obj_bboxes

        return retDict
    def init_model(self):
        raise EOFError("Undefined model type.")
    def preprocess(self):
        raise EOFError("Undefined model type.")
    def detect(self):
        raise EOFError("Undefined model type.")

# 对YOLOV5检测器的一个封装,使得使用起来更加简便
class Detector(baseDet):
    def __init__(self):
        super(Detector, self).__init__()
        self.init_model()
        self.build_config()
	
    # 加载模型
    def init_model(self):
        self.weights = DETECTOR_PATH
        self.device = '0' if torch.cuda.is_available() else 'cpu'
        self.device = select_device(self.device)
        model = attempt_load(self.weights, map_location=self.device)
        model.to(self.device).eval()
        model.half()
        self.m = model
        self.names = model.module.names if hasattr(
            model, 'module') else model.names
	
    # 对传进来的视频帧进行预处理
    def preprocess(self, img):
        img0 = img.copy()
        img = letterbox(img, new_shape=self.img_size)[0]
        img = img[:, :, ::-1].transpose(2, 0, 1)
        img = np.ascontiguousarray(img)
        img = torch.from_numpy(img).to(self.device)
        img = img.half()  # 半精度
        img /= 255.0  # 图像归一化
        if img.ndimension() == 3:
            img = img.unsqueeze(0)
        return img0, img		# img0是原始的图像,img是处理后的图像

    def detect(self, im):
        im0, img = self.preprocess(im)
        pred = self.m(img, augment=False)[0]		# 将图像传入检测器中,得到推理后的结果
        pred = pred.float()
        pred = non_max_suppression(pred, self.threshold, 0.4)	# 进行非极大值抑制
        pred_boxes = []
        for det in pred:
            if det is not None and len(det):
                det[:, :4] = scale_coords(
                    img.shape[2:], det[:, :4], im0.shape).round()
                for *x, conf, cls_id in det:
                    lbl = self.names[int(cls_id)]
                    if not lbl in OBJ_LIST:			# 这里就是判断类别,不在我们需要检测的类别中就跳过
                        continue
                    x1, y1 = int(x[0]), int(x[1])
                    x2, y2 = int(x[2]), int(x[3])
                    pred_boxes.append(
                        (x1, y1, x2, y2, lbl, conf))
        return im, pred_boxes							# 最后返回原始图像以及检测到的目标框

objtracker.py

Role: a tracker class that tracks the detected target;

cfg = get_config()
cfg.merge_from_file("deep_sort/configs/deep_sort.yaml")
# 首先需要实例化一个DeepSORT的类,其中封装了一些工具类的实现
deepsort = DeepSort(cfg.DEEPSORT.REID_CKPT,
                    max_dist=cfg.DEEPSORT.MAX_DIST, min_confidence=cfg.DEEPSORT.MIN_CONFIDENCE,
                    nms_max_overlap=cfg.DEEPSORT.NMS_MAX_OVERLAP, max_iou_distance=cfg.DEEPSORT.MAX_IOU_DISTANCE,
                    max_age=cfg.DEEPSORT.MAX_AGE, n_init=cfg.DEEPSORT.N_INIT, nn_budget=cfg.DEEPSORT.NN_BUDGET,
                    use_cuda=True)
                    
# 下面重点看一下update这个函数,也就是更新图像中的检测框
def update(target_detector, image):
		# 这里也就是用之前的检测器得到检测框
		_, bboxes = target_detector.detect(image)
        bbox_xywh = []
        confs = []
        bboxes2draw = []
        if len(bboxes):
            # Adapt detections to deep sort input format(更新检测对象的状态)
            for x1, y1, x2, y2, _, conf in bboxes:
                obj = [
                    int((x1+x2)/2), int((y1+y2)/2),
                    x2-x1, y2-y1
                ]
                bbox_xywh.append(obj)
                confs.append(conf)
            xywhs = torch.Tensor(bbox_xywh)
            confss = torch.Tensor(confs)

            # Pass detections to deepsort(这里就可以得到最终的这一帧的目标框和目标ID)
            outputs = deepsort.update(xywhs, confss, image)
            for value in list(outputs):
                x1,y1,x2,y2,track_id = value
                bboxes2draw.append(
                    (x1, y1, x2, y2, '', track_id)
                )
        # 这里起到一个将检测框和ID信息绘制到图像上的作用
        image = plot_bboxes(image, bboxes2draw)
        return image, bboxes2draw

demo.py

Function: realize the target tracking of the incoming video file, and save the final result as a new video file; in fact, it is to call the previously encapsulated class, get the information we want, and finally visualize it;

VIDEO_PATH = './video/test_person.mp4'		# 传入视频文件
RESULT_PATH = 'result.mp4'					# 输出视频文件

def main():

    func_status = {
    
    }
    func_status['headpose'] = None
    
    name = 'demo'

    det = Detector()
    cap = cv2.VideoCapture(VIDEO_PATH)
    fps = int(cap.get(5))
    print('fps:', fps)			# 得到帧率
    t = int(1000/fps)			# 每一帧的间隔时间

    size = None
    videoWriter = None

    while True:

        # try:
        _, im = cap.read()	# 读入一帧帧数据,也就是视频的全部帧
        if im is None:
            break
        
        result = det.feedCap(im, func_status)	# 这里的im表示所有输入视频帧
        result = result['frame']
        result = imutils.resize(result, height=500)
        # 下面代码也就是保存成视频的一个操作
        if videoWriter is None:
            fourcc = cv2.VideoWriter_fourcc(
                'm', 'p', '4', 'v')  # opencv3.0
            videoWriter = cv2.VideoWriter(
                RESULT_PATH, fourcc, fps, (result.shape[1], result.shape[0]))

        videoWriter.write(result)
        cv2.imshow(name, result)
        cv2.waitKey(t)

        if cv2.getWindowProperty(name, cv2.WND_PROP_AUTOSIZE) < 1:
            # 点x退出
            break

    cap.release()
    videoWriter.release()
    cv2.destroyAllWindows()

Show results:
insert image description here

count_person.py

Function: Realize the function of counting, and can count and count the detected targets;

This part of the code is mainly to realize the custom line collision, and count the number and ID of the passed targets, the code will not be explained here;

Pedestrian counting effect display:

insert image description here

Of course, like vehicle counting is also possible, the effect is as follows:

insert image description here

V. Summary

Key knowledge points:

  • Understand the whole process of DeepSORT;
  • Have a preliminary understanding of target detection and target re-identification;
  • Task integration of multiple models;
  • Calculation of distance measurement method and cost matrix;
  • the ability to encapsulate code into classes;

To be in-depth knowledge points:

  • The function and realization of Kalman filter;
  • How to improve the operational efficiency of tasks;
  • Deploy the task and convert it into a dynamic library callable by C++;
  • Lightweight the processes and models in it to optimize performance;

Guess you like

Origin blog.csdn.net/weixin_40620310/article/details/124501917