Workshop dangerous behavior detection system based on improved YOLOv7 and Opencv (source code & tutorial)

1. Research Background

Under the trend of industrial intelligent upgrading, more and more enterprises are trying to build smart factories through technologies such as robots and artificial intelligence. In a smart factory, workers have a wide range of activities and the environment is complex. It is strictly forbidden to use open flames, smoke, make phone calls and other behaviors that may cause accidents in the workshop. If the traditional method of manually analyzing video is used to monitor personnel behavior, it is time-consuming and labor-intensive and prone to omissions. Therefore, it is urgent to study the workshop personnel behavior recognition method for smart factories, so as to realize intelligent security control.

2. Picture demonstration

2.png

3.png

4.png

3. Video demonstration

Workshop dangerous behavior detection system based on improved YOLOv7 and Opencv (source code & tutorial)_哔哩哔哩_bilibili

4. Research status at home and abroad

The early research on behavior recognition is to identify the outline and direction of motion of the moving human body as feature information, mainly including time-space interest point method and motion trajectory method, among which the improved dense trajectory (Improved Dense 'Trajectories, IDT) algorithm is the most classic. model. With the development of human three-dimensional data acquisition technology, behavior recognition can be roughly divided into two categories. One is behavior recognition based on bone key points, which uses the changes of key points between video frames to describe human behavior. Yan et al. constructed spatiotemporal graphs through skeleton sequences, and proposed a skeleton-based action recognition spatiotemporal graph convolutional network (Spatial Temporal Graph Convolutional Networks, ST-GCN). The Spatial-Temporal 'Transformer Network (ST-TR) proposed by Plizzari et al. improved the shortcoming of ST-GCN that can only capture local features, and proposed a spatial self-attention module (Spatial Self-Attention, SSA) and temporal self-attention module (Temporal Self-Attention, TSA) to capture features. The other is RGB-based deep learning methods, which can be divided into video-based and image-based behavior recognition according to the input tasks to be processed. The video-based behavior recognition method is to give the edited video behavior segment and output the video-level behavior category. The mainstream methods include TSN (Temporal Segment Networks), TSM ('Temporal ShiftModule), SlowFast, TimeSformer (Time-Space Transformer )wait. TSN belongs to the two-stream method ('Two-Stream), which divides feature extraction into two paths, one branch uses RGB video frames to extract spatial features, and the other path extracts optical flow features for time. The SlowFast proposed by Feichtenhofer et al. is similar to the two-stream method. The Slow branch is used to learn spatial semantic information, and the Fast branch is used to learn motion information. "TimeSformer" proposed by Facebook is a non-convolutional video classification method based on the self-attention mechanism of the Transformer model. Extract spatiotemporal sequences over a sequence of video frames and apply temporal and spatial attention separately for learning. Image-based behavior recognition methods are divided into behavior classification represented by ResNet and behavior detection represented by YOLO (You Only LookOnce). ResNet outputs image-level classification results, and YOLO performs localization and category detection for each target in the input video frame, and is an end-to-end training and inference method.

Workshop behavior is mostly the interaction between people and objects. The behavior recognition method based on key points of bones only inputs the coordinate information of key points, discards key objects and semantic information, and it is difficult to distinguish similar actions (such as making a phone call and touching ears). Among the deep learning methods based on RGB, most methods have strict requirements on the processing of input data. To ensure sufficient model coordination, a large amount of data is invested, which requires high computing equipment and lacks inference speed. Among them, the image-based behavior recognition method uses the network structure of AP3TO, the reasoning speed is faster, the model size is smaller and the valley is easy to install. The above two types of behavior recognition methods are based on visible light video input. Although the visible light image contains clear and rich texture details, it is difficult to observe double results in some dark or hidden environments in the workshop, and it is easy to miss the target.

On the contrary, the infrared image can separate the target from the educational area according to the radiation difference, and the features are more prominent. Commonly used surveillance videos include infrared and visible light. The method of only using infrared recognition also suffers from the problems of low weight and missing details. Therefore, it is considered to fuse visible light images and infrared images to make up for the lack of single sensor imaging. Improve recognition accuracy. Fusion methods include pixel-level fusion, super shed fusion and large-scale fusion. Pixel-level fusion and feature-level fusion "in terms of computing power and time requirements are higher than decision-level fusion, and decision-level fusion can absorb the complementary information of visible light and infrared to achieve the global optimum.

In order to effectively standardize the behavior of workshop personnel, this project proposes an improved YOLOv7 network and proposes a decision-level fusion algorithm, which can reduce missed detections. To improve the performance and accuracy of behavior recognition.

5. Hand bone point detection

Refer to the blog post about OpenCV bone point detection , which shows the simple usage of OpenCV as a DNN inference tool.
Yesterday Satya Mallick published an article about using OpenCV to call the hand pose estimation model in the OpenPose project. For friends who want to use hand key point detection for gesture recognition, sign language recognition, and smoking detection This is a very simple tutorial to get started.

algorithm thinking

The algorithm model used by the author is CMU Perceptual Computing Lab's open source library OpenPose, which integrates human body, face, and hand key point detection. The hand key point detection (Hand Keypoint detector) algorithm comes from the CVPR2017 paper "Hand Keypoint Detection in Single Images using Multiview Bootstrapping".
Due to different perspectives and flexible fine movements of human hands in 3D space, it is difficult to obtain accurately labeled data sets. In this paper, the author proposes an iterative improvement algorithm for hand key point detection called Multiview Bootstrapping, which realizes a detection algorithm with higher precision.
image.png

As shown in the figure above, the author proposes to first train the Convolutional Pose Machines neural network using a small amount of labeled data sets containing key points of human hands, use 31 high-definition cameras with different angles of view to shoot human hands, use the above detection model to initially detect key points, and convert these key points Construct a triangle (triangulation) according to the pose of the camera to obtain the 3D position of the key point, and then reproject the calculated 3D point position to each 2D image with different perspectives, and then use these 2D images and key point annotations to train the detection model Network, after several iterations, a more accurate hand key point detection model can be obtained.
The model proposed in the original paper can generate 22 key points, 21 of which are human hands, and the 22nd point represents the background. The figure below shows the 21 key points of the human hand.
image.png

6. Improve YOLOv7

SPD module

Convolutional neural networks (CNNs) have achieved great success in many computer vision tasks such as image classification and object detection. However, their performance degrades rapidly on more difficult tasks where the image resolution is low or the objects are small. Referring to this blog points out that there is a flawed but common design in existing CNN architectures , namely the use of strided convolution and/or pooling layers, which leads to the loss of fine-grained information and the learning of less efficient feature representations. For Therefore, we propose a new CNN building block named SPD-Conv to replace each strided convolutional layer and each pooling layer (thus eliminating them entirely). SPD-Conv consists of a spatial-to-depth (SPD) layer followed by a non-stripped convolutional (Conv) layer and can be applied to most, if not all, CNN architectures. We explain this new design under two of the most representative computer vision tasks: object detection and image classification. We then create a new CNN architecture by applying SPD-Conv to YOLOv7 and ResNet, and empirically demonstrate that our method significantly outperforms state-of-the-art deep learning models, especially in more complex environments with low-resolution images and small objects. on difficult tasks.
image.png

modular structure

image.png

7. Code implementation

Add the following modules to the ./models/common.py file

class space_to_depth(nn.Module):
    # Changing the dimension of the Tensor
    def __init__(self, dimension=1):
        super().__init__()
        self.d = dimension

    def forward(self, x):
         return torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1)


8. System Integration

The complete source code & environment deployment video tutorial & custom UI interface in the picture below
1.png
refer to the blog "Based on Improved YOLOv7 and Opencv Workshop Dangerous Behavior Detection System (Source Code & Tutorial)"

9. References


[1] Ren Min . The integration of intelligence and numbers, four typical application scenarios of smart factories [J]. China Information Technology . 2021, (1). 47-49.

[2] Sun Junding , Li Haihua , Jin Jiaolin , et al. Automatic image annotation based on multi-feature fusion and PLSA-GMM [J]. Measurement and Control Technology . 2017, (4).

[3] Tan Guanzheng , Ye Hua , Chen Minjie . Research on Unsupervised Human Pose Feature Extraction and Recognition Based on Frequency Screening [J]. Measurement and Control Technology . 2017, (9). DOI: 10.3969/j.issn.1000-8829.2017.09.002 .

[4] Qin Yang , Mo Lingfei , Guo Wenke , et al. The combination and application of 3D CNNs and LSTMs in behavior recognition [J]. Measurement and Control Technology . 2017, (2).

[5]Zhang, Yu,Zhang, Lijia,Bai, Xiangzhi,等.Infrared and visual image fusion through infrared feature extraction and visual information preservation[J].Infrared physics & technology.2017.83227-237.

[6]Xie, Xiaozhu,Dong, Mingjie,Zhou, Zhiqiang,等.Fusion of infrared and visible images for night-vision context enhancement[J].Applied Optics.2016,55(23).6480-6490.

[7]Bavirisetti, Durga Prasad,Dhuli, Ravindra.Fusion of Infrared and Visible Sensor Images Based on Anisotropic Diffusion and Karhunen-Loeve Transform[J].IEEE sensors journal.2016,16(1).203-209.DOI:10.1109/JSEN.2015.2478655.

[8]Bavirisetti, Durga Prasad,Dhuli, Ravindra.Two-scale image fusion of visible and infrared images using saliency detection[J].Infrared physics and technology.2016.7652-64.DOI:10.1016/j.infrared.2016.01.009.

[9]Kai KANG,Weibin LIU,Weiwei XING.Motion Pattern Study and Analysis from Video Monitoring Trajectory[J].IEICE transactions on information & systems.2014,E97.D(6).DOI:10.1587/transinf.E97.D.1574.

[10]Shutao, Li,Xudong, Kang,Jianwen, Hu.Image fusion with guided filtering.[J].IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.2013,22(7).2864-75.

Guess you like

Origin blog.csdn.net/qunmasj/article/details/128483955