[Open Source] Motion Counting APP Development Based on Pose Estimation (2)

1. First show the current effect

I picked up a video tutorial of sit-ups from keep for counting test:

(CSDN can't play the video, please leave a message in the comment area below if you are interested)

2. Review:

In the content of the previous issue ( [Open Source] Motion Counting APP Development Based on Pose Estimation (1) ), by using the shufflenet lightweight network + upsampling to output the heatmap of the key points, it can already be trained in the coco data set, and can Identify key points. But there is also a problem, that is, for the sit-ups, the recognition accuracy is very low. By analyzing the reasons, there are mainly two aspects. One is that the postures of people in the open source data set are some more life-like postures, and there are very few postures such as sit-ups. On the other hand, the network itself is relatively small, and the ability to extract features is limited. In order to realize real-time detection on the mobile terminal, the output resolution is limited to 224*224, which will limit the accuracy. This issue is mainly to carry out some optimizations based on these problems, and initially implement a demo that can perform real-time counting.

3. Rethink data:

The coco and mpii data sets that have been analyzed before are not very suitable for us to train an APP such as a sit-up counter. And for this task, it is not necessary to identify so many key points. Just identify two key points, one is the head key point and the other is the knee key point. In this way, it will not affect the function of our final counting APP, and it will reduce the task of the network, so that we can focus on and identify these two key points, thereby improving the accuracy. As shown below.

Since there is no ready-made sit-up data set, I can only do it myself and have enough food and clothing. Fortunately, there are still many related resources on the Internet for routine exercises such as sit-ups. Here I use two ways to download videos and pictures. First search for "sit-ups" videos from the Internet, download about 10 video clips, and then extract a part of frames from each video as training data by means of frame extraction. The keyframes extracted from the video are shown in the figure below.

Only using the frames extracted from the video will have a more serious problem, that is, the background is too single, which can easily cause overfitting. So I searched for pictures from the Internet, and got a picture with a richer background, as shown in the figure below:

After collecting the data, it is time to mark. Here, for convenience, I developed a key point marking tool. After all, I developed it by myself and it is easy to use. Left mouse button to mark, right mouse button to cancel the previous mark. I have to say that it is very convenient to develop some UI-based tools with python+qt! Compared with C++, too much productivity is liberated!

4. Solve overfitting:

    Through the about 1K pictures collected earlier, after training, you will find that the generalization performance is not good, and it is easy to misidentify. In fact, it is not difficult to find that due to the small amount of data, the network has already been overfitted, and the loss at the end of training is very small. The best way to solve overfitting is to increase the amount of data, but time is limited, I really don’t want to collect and label data, it’s a waste of youth. So we have to consider using some data enhancement methods. I have used some lighting enhancement methods before, such as randomly changing the brightness, randomly adjusting HSV, and here mainly adding some geometric transformations. Because of the need to modify the tag value, it will be a little troublesome. Here I mainly consider crop, padding, and flip.

After using the above data enhancement method, the effect is significantly improved, and the overfitting is not so serious, but there will be some wrong recalls, and some schoolbags or clothes will be used as key points. Then the main original is that the background of the training data is not rich enough. Here, the mixup method is used to select some pictures without people from the coco data set as the background, and randomly overlap with the training pictures, which is conducive to solving this problem.

In the end, after these data enhancements, the effect is not bad. The following is the relevant code, crop and padding are implemented together.

class KPRandomPadCrop(object):
    def __init__(self, ratio=0.25, pad_value=[128, 128, 128]):
        assert (ratio > 0 and ratio <= 1)
        self.ratio = ratio
        self.pad_value = pad_value

    def __call__(self, image, labels=None):
        if random.randint(0,1):
            h, w = image.shape[:2]
            top_offset = int(h * random.uniform(0, self.ratio))
            bottom_offset = int(h * random.uniform(0, self.ratio))
            left_offset = int(w * random.uniform(0, self.ratio))
            right_offset = int(w * random.uniform(0, self.ratio))
            # pad
            if random.randint(0,1):
                image = cv2.copyMakeBorder(image, top_offset, bottom_offset, left_offset, right_offset, cv2.BORDER_CONSTANT, value=self.pad_value)
                if labels is not None and len(labels) > 0:
                    labels[:, 0] = (labels[:, 0] * w + left_offset) / (w + left_offset + right_offset)
                    labels[:, 1] = (labels[:, 1] * h + top_offset) / (h + top_offset + bottom_offset)
            # crop
            else:
                image = image[top_offset:h - bottom_offset, left_offset:w-right_offset]
                if labels is not None and len(labels) > 0:
                    labels[:, 0] = (labels[:, 0] * w - left_offset) / (w - left_offset - right_offset)
                    labels[:, 1] = (labels[:, 1] * h - top_offset) / (h - top_offset - bottom_offset)
        return image, labels
                
class KPRandomHorizontalFlip(object):
    def __init__(self):
        pass

    def __call__(self, image, labels=None):
        if random.randint(0, 1):
            image = cv2.flip(image, 1)
            h, w = image.shape[:2]
            if labels is not None and len(labels) > 0:
                labels[:, 0] = 1.0 - labels[:, 0]
        return image, labels
        
  
class KPRandomNegMixUp(object):
    def __init__(self, ratio=0.5, neg_dir='./coco_neg'):
        self.ratio = ratio
        self.neg_dir = neg_dir
        self.neg_images = []
        files = os.listdir(self.neg_dir)
        for file in files:
            if str(file).endswith('.jpg') or str(file).endswith('.png'):
                self.neg_images.append(str(file))

    def __call__(self, image, labels):
        if random.randint(0, 1):
            h, w = image.shape[:2]
            neg_name = random.choice(self.neg_images)
            neg_path = self.neg_dir + '/' + neg_name
            neg_img = cv2.imread(neg_path)
            neg_img = cv2.resize(neg_img, (w, h)).astype(np.float32)
            neg_alpha = random.uniform(0, self.ratio)
            ori_alpha = 1 - neg_alpha
            gamma = 0
            img_add = cv2.addWeighted(image, ori_alpha, neg_img, neg_alpha, gamma)
            return image, labels
        else:
            return image, labels

5. Online Difficult Example Mining

    Through the above data enhancement, the trained model has a certain ability, but through a large number of test pictures, it is found that the detection ability of the network for the knee and the head is different. The detection of the knee is relatively stable, while the detection of the head is often wrong. The reason for the analysis may be that the change of the head is much larger than that of the knees. The head may be facing the camera, facing away from the camera, or may be blocked by the arm, because it is difficult for the network to learn the real head features. There are two ways to improve here. One is to simply give the head a greater weight through the loss weight, so that the gradient information of the head is larger than that of the knee, forcing the network to pay more attention to the head information. Another is Online hard keypoint mining is used. This is what I saw when I checked Kuestyle's cpn human pose estimation network. The author has a video to introduce it. In fact, it is very simple to implement. It is to calculate the loss for different key points, and then sort the loss, and only return a certain proportion of the loss to find the gradient. The ratio given by the author is 0.5, that is, return half of the key point loss to find the gradient.

6. Summary

    This stage is mainly to improve the accuracy of the network. By reducing the number of key points, re-collecting and labeling data, adding padding, crop, and flip data enhancement, and introducing mixup and online difficult case mining, the network generalization performance is gradually improved. And implemented a python demo (see the beginning of the article), the next issue is mainly to realize the function of the demo as APP.

@end

Guess you like

Origin blog.csdn.net/cjnewstar111/article/details/115445738