Behavior recognition Activity Recognition

behavior recognition

Behavior detection is a broad research field, and its applications include security monitoring, health care, entertainment, etc.

Course Outline

introduction

The application of graph convolution in behavior recognition: paper study, code interpretation, experiment

hrnet in Topdown key point detection: paper study, code interpretation, experiment

Other Algorithms for Action Recognition

Important components in deep learning: optimizers, learning rate strategies

introduction

Brief description of the main algorithm

Mission brief:

Recognition of human behavior action recognition
behavior classification (classification), behavior localization (localization), behavior detection (detection)

Action recognition refers to the automatic recognition of human behavior from video data. Behavior recognition can be divided into three levels: behavior classification (classification), behavior localization (localization) and behavior detection (detection).

  1. Behavior classification (classification)
    Behavior classification refers to dividing the entire video sequence into different behavior categories, such as walking, running, driving, etc. This is the most fundamental task in action recognition, and also the most common. In behavior classification, a deep learning model (such as convolutional neural network, recurrent neural network, etc.) is usually used to model video sequences, and a softmax classifier is used to classify each behavior category. The difficulty of behavior classification lies in solving the problem of intra-class difference and inter-class similarity, as well as the problem of data variation under different scenes and lighting conditions.

  2. Behavior localization (localization)
    Behavior localization refers to locating the time period of a certain behavior in a video sequence, for example, locating the specific time period when a person walks in a long video. Behavior localization is more challenging than behavior classification, because it needs to determine the start and end time of the behavior, and the same behavior may appear differently in different time periods. In behavior localization, the method of time window is usually used to segment the video sequence, and then the behavior of each time window is classified, and finally the time period of the behavior is determined by the method of time alignment.

  3. Behavior detection (detection)
    Behavior detection refers to detecting the occurrence of a specific behavior in a video sequence, such as detecting whether a person is walking. Behavior detection usually requires the use of techniques such as object detection to locate the position of the human body, and the use of behavior classification technology to classify the behavior at each human body position. The difficulty of behavior detection is to solve the problem of multi-person behavior detection and behavior detection in complex backgrounds.

It should be noted that action recognition faces many challenges, such as insufficient data volume, data noise, intra-class differences, and inter-class similarity issues. In order to improve the accuracy of behavior recognition, it is usually necessary to combine multiple technologies and models, such as using multiple sensor data, multiple feature extraction methods, and multiple deep learning models for fusion.

Data modality:

Appearanceappearancedepthoptical
flowoptical
-flowskeletal
graph

The time dimension t
traverses the dimensions h,w,t

Data modality refers to the type of data used for behavior recognition. Commonly used data modality includes appearance, depth, optical flow, and skeleton. The time dimension (time dimension) refers to the time axis in the video data, and the traversal dimension (spatial dimensions) refers to the image size in the video data, usually height (height) and width (width).

  1. Appearance (appearance)
    Appearance modality refers to the image data extracted from the video, usually using deep convolutional neural network (Convolutional Neural Networks, CNN) to extract the features of the image, and use the classifier to classify the behavior. Appearance modality can increase the diversity and robustness of data by using different image preprocessing methods, such as using methods such as data augmentation and transfer learning.

  2. Depth (depth)
    The depth mode refers to the data obtained from the depth camera or other depth sensors, which can obtain the three-dimensional posture information of the human body. Depth modalities can be processed using deep convolutional neural networks or other deep learning models and fused with other data modalities to improve the accuracy of action recognition.

  3. Optical flow (optical-flow)
    optical flow modality refers to the optical flow data extracted from the video sequence, and optical flow is a method to describe the motion changes of pixels in time. Optical flow modality can be used to describe the speed and direction of human motion, usually using optical flow features and deep learning models for behavior recognition.

  4. Skeleton (skeleton)
    The skeleton mode refers to the human skeleton joint information obtained from the motion capture device, and the dynamic posture information of the human body can be obtained. Skeletal modalities can be processed using methods such as skeletal joint coordinates and skeletal motion features, and fused with other data modalities to improve the accuracy of behavior recognition.

In the temporal dimension, action recognition usually cuts the video sequence to form different video segments. In the traversal dimension, behavior recognition usually uses convolutional neural networks to extract features from video frames, and uses recurrent neural networks (Recurrent Neural Networks, RNN) or convolutional neural networks to model video segments. In order to improve the accuracy of action recognition, it is usually necessary to use multiple data modalities and processing methods for fusion.

Operator extension

Conv3d
GraphConv

  1. Conv3d
    Conv3d is a three-dimensional convolution operation for processing three-dimensional data such as video data. It is similar to the two-dimensional convolution operation in the convolutional neural network, but Conv3d is more effective for feature extraction of three-dimensional data. The Conv3d operation usually consists of multiple 3D convolution kernels, and each convolution kernel performs convolution in three dimensions to extract features in three-dimensional data.

The input of Conv3d is a four-dimensional tensor whose dimensions are [batch_size, channels, depth, height, width], where batch_size represents the batch size, channels represents the number of channels, depth represents depth, height represents height, and width represents width. The output of Conv3d is also a 4D tensor with the same dimensions as the input tensor.

  1. GraphConv
    GraphConv is a convolution operation for graph neural networks for processing graph-structured data. Unlike traditional convolution operations, GraphConv does not perform sliding window operations on the image, but performs convolution operations on the vertices of the graph. GraphConv uses the adjacency matrix of the graph to describe the relationship between vertices, and uses the adjacency matrix as a convolution kernel to extract the features of the vertices.

The input of GraphConv is a two-dimensional tensor whose dimension is [batch_size, num_nodes, num_features], where batch_size represents the batch size, num_nodes represents the number of vertices of the graph, and num_features represents the feature dimension of each vertex. The output of GraphConv is also a 2D tensor with the same dimensions as the input tensor.

GraphConv can be applied to various types of graph-structured data, such as social networks, recommender systems, chemical molecules, etc. It can effectively capture the relationship between vertices, thereby improving the performance of graph neural networks.
insert image description here
https://blog.csdn.net/weixin_44402973/article/details/103498856

Optical flow equation

The optical flow equation is a mathematical model describing the optical flow, which describes the pixel-level motion state of the same object in the image between different frames. In the optical flow method, the optical flow equation is derived based on the assumption that the light intensity is constant.

Assume that the position of a pixel in frame t is (x, y), and the position in frame t+1 is (x+u, y+v), where (u, v) means that the pixel is between x and y Assuming that the light intensity is constant, the gray values ​​of the pixels at these two positions are equal. Therefore, the optical flow equation can be obtained:

I(x,y,t) = I(x+u,y+v,t+1)

Among them, I(x,y,t) represents the gray value of the pixel point (x,y) in the tth frame, and I(x+u,y+v,t+1) represents the t+1th frame The gray value of the pixel (x+u, y+v).

Using the Taylor expansion to approximate the light intensity, we can get:

I(x+u,y+v,t+1) ≈ I(x,y,t) + uIx + vIy + It

Among them, Ix, Iy and It represent the gray gradient along the x, y and time t directions at the pixel point (x, y) respectively.

Substituting the above formula into the optical flow equation, we can get:

Ix u + Iy v = -It

This is the optical flow equation, which describes the motion state of pixels in adjacent frames, where (u, v) represents the displacement of the pixel in the x and y directions, and Ix and Iy represent the pixel at (x, y) The grayscale gradient along the x and y directions, It represents the change of pixel grayscale in adjacent frames.

light flow

Optical flow refers to the motion state of the same object at the pixel level between two adjacent frames in an image sequence. Optical flow method is a pixel-level object motion analysis method, which can be used for tracking of moving targets, 3D reconstruction, visual odometry and other applications.

In the optical flow method, the core assumption is the assumption of constant light intensity, that is, the light intensity of the same point in adjacent frames is constant. From this, the optical flow equation can be derived, which describes the motion state of pixels in adjacent frames. The derivation of the optical flow equation is based on three assumptions: the assumption of constant light intensity, the assumption of differentiable motion and the assumption of regional consistency.

According to the assumption of constant light intensity, the light intensity of the same point in adjacent frames is constant, and the light intensity can be approximated using the first-order Taylor expansion. From this, the optical flow equation can be obtained, which describes the motion state of pixels in adjacent frames. However, the optical flow equation has only one equation and cannot be solved yet. In order to solve this problem, the equation system of multiple points in the neighborhood of a point can be constructed, and based on the assumption of regional consistency, it is considered that the neighborhood of a point moves in the same way. In this way, multiple equations can be obtained, and the least square method is used to solve the overdetermined equations to obtain the motion state of the pixel.

The Lucas-Kanade optical flow method is a sparse optical flow method, which uses the neighborhood points of pixels to construct a system of equations, which improves the efficiency of solving optical flow. For each pixel point, only some points in its neighborhood are selected to construct the equation system, thus reducing the complexity of the solution.

In short, the optical flow method is a pixel-level object motion analysis method, which can be used for tracking of moving targets, 3D reconstruction, visual odometry and other applications.

Optical flow optical-flow

Object movement trend in adjacent frames,

Core assumption:
assumption of constant light intensity, the light intensity difference between adjacent frames at the same point is 0. The ambient light is consistent, and only the motion of the object is caused by the change of light intensity. The motion is differentiable, and the time change
does not cause the position mutation .
sparse optical flow)

First order Taylor expansion

Refers to the higher-order infinitesimal
based on, obtain the optical flow equation

Divided by dt, dt can be regarded as 1, to obtain

Note that it is the movement speed of the pixel point p(x,y) along the x,y direction at time t. At
this time, there is only one equation, which cannot be solved yet.

Similar to the above method, construct a set of equations for multiple points in the neighborhood of point P(x, y),
based on the assumption that the motion consistency of the 3 neighborhoods is consistent, and the grayscale change of adjacent frames
can be used for multiple points in the neighborhood of one point points, and obtain the equation system

The sum is the partial derivative of the single-frame grayscale image in the x and y directions, which can be calculated using the sobel operator. It
is the difference between the adjacent frame grayscale images

For example, there are 1 central point, 8 neighboring points, and 9 equations in total, that is, the motion in the 3*3 window is consistent. At
this time, there are two parameters to be solved. This is an overdetermined equation system, which can be solved by using the least square method
for reference . Information:
Optical flow methods - Zhihu (zhihu.com)
https://zhuanlan.zhihu.com/p/384651830
Source:
Optical flow estimation - from traditional methods to deep learning - Zhihu (zhihu .com)
A Comprehensive Study of Deep Video Action Recognition
https://zhuanlan.zhihu.com/p/74460341

insert image description here

skeleton skeleton

In the scene of the key points of the human skeleton, an undirected graph G can be used to represent the topology of the skeleton, where the point V represents the key points of the human body, and the edge E represents the connection relationship between the bones. The adjacency matrix A can be used to represent the connection between points in the graph, where A(i,j) indicates whether there is a connection between point i and point j, if there is a connection, it is 1, otherwise it is 0.

For the key points of the human body, each key point can be expressed as a two-tuple [k,p], where k represents the identification of the key point, and p represents the position information of the key point. Position information can be expressed using two-dimensional coordinates (x,y) or three-dimensional coordinates (x,y,z), where x, y, and z represent the x, y, and z coordinates of the key point in space, and c represents the key point confidence or visibility.

The connected edge lengths can represent distances between bones and can be calculated using Euclidean distance or other distance metrics. In the adjacency matrix, the edge length can be expressed as a weight, that is, A(i,j) represents the edge weight between point i and point j, and it is 0 if there is no connection. This allows for a more accurate representation of connectivity and distance relationships between keypoints.
Undirected graph G
point V
side E
adjacency matrix A,
adjacency is 1, regardless of weight

Human key points: [k,p], p=[x,y,c] or [x,y,z]

Adjacency matrix: the side lengths of
all point pairwise connections (outer product) connections

What's wrong with the image on the right?
0 means disconnection or the distance is 0, how to express the connection between v and v itself?
In the figure on the right, 0 in the adjacency matrix means there is no edge connection between two points, and 1 means there is an edge connection between the two points. Therefore, if a point is connected to itself, the corresponding adjacency matrix element should be 1. In the case of human key points, if a key point is connected to itself, it means that the key point has a self-connection, which can be represented by 1. If a keypoint is not connected to another keypoint, then the adjacency matrix element between these two keypoints should be 0 instead of representing a distance of 0.

For the length of the connected edge, it can be expressed as a weight, and a weighted adjacency matrix can be used to represent the graph, where the matrix elements represent the weight of the corresponding edge. If there is no connection between two keypoints, the corresponding adjacency matrix element is 0, indicating that there is no edge connection between them and no weight.

图源:
https://zhuanlan.zhihu.com/p/89503068
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
insert image description here
insert image description here

Operator extension

insert image description here
insert image description herehttps://zhuanlan.zhihu.com/p/63974249

The im2col method is a method of converting convolution operation into matrix multiplication, which is widely used in convolutional neural networks. Its implementation steps are as follows:

  1. Zero-padding is performed on the input image so that the output after convolution has the same size as the input. The size of zero padding is determined by the convolution kernel size and stride.
  2. Sliding window operation is performed on the zero-padded input image, and the pixels in each window are arranged into a column vector in a column-first manner to form a two-dimensional matrix. Each column of this matrix corresponds to a pixel in a sliding window.
  3. Concatenate all the column vectors in the sliding window into a large matrix, and each column corresponds to a pixel in a sliding window.
  4. Arrange the elements in the convolution kernel into a row vector in a row-first manner to form a two-dimensional matrix. Each row of this matrix corresponds to an element in the convolution kernel.
  5. For the window of each convolution kernel size, in the large matrix generated in step 3, select the corresponding column vector and arrange it by row into a row vector. Each row of the matrix thus obtained corresponds to a window of the size of a convolution kernel.
  6. For each kernel-sized window, multiply its corresponding row vector by the kernel matrix to get a scalar. The scalar thus obtained forms an element in the output matrix.
  7. Reshape the output matrix to get the output image.

In this way, by converting the convolution operation into matrix multiplication, the efficiency and parallelism of matrix multiplication can be utilized to accelerate the calculation process of convolutional neural networks.

  1. The reverse calculation of im2col is col2im. In backpropagation, the error needs to be passed back to the input data, and col2im is the process of converting the error back to the input data. The specific steps are as follows:
    1) Rearrange the error tensor into a two-dimensional matrix, and each column corresponds to a column in im2col.
    2) Rearrange the convolution kernel into a two-dimensional matrix, each row corresponds to a row in im2col.
    3) Multiply the two-dimensional error matrix and the two-dimensional convolution kernel matrix to obtain a two-dimensional output matrix.
    4) Rearrange the output matrix to the shape of the input data, that is, convert the output matrix to the shape of the input data tensor in a column-first manner using the same window size and stride as in im2col.

  2. The purpose of image affine enhancement (rotation) is to increase the diversity of data and enhance the generalization ability of the model. By rotating the image, more training data can be obtained, so that the model can better learn image features. For the convolution kernel in the convolutional neural network, if the image is rotated, the convolution kernel should also be rotated accordingly, so as to better identify the features in the image. Therefore, when image rotation is performed, the convolution kernel parameters in the convolutional neural network also need to be rotated, and the model must be retrained to obtain better performance.

The feature reflected in the filter parameter is that the rotated convolution kernel parameters no longer have the original symmetry and translation invariance, so more parameters are needed to represent the convolution kernel to adapt to the rotated image features. When rotating an image, it is necessary to retrain the convolution kernel parameters in the convolutional neural network to adapt to the rotated image features.

Graph Convolutional Neural Networks

Graph Convolutional Networks (GCN for short) is a neural network model that can process graph data. Different from the traditional convolutional neural network, GCN can process data with non-Euclidean structure, such as social network, protein molecular structure, etc., and has a wide range of application prospects.

The core idea of ​​GCN is to perform convolution operation on the graph, transfer the feature representation of a node to its neighbor nodes, and perform weighted summation. Specifically, each node in GCN has a feature representation, which can be the node's own feature vector, or an aggregate feature vector composed of the feature vectors of the node's neighbor nodes. GCN obtains a new feature vector for each node by performing linear transformation and nonlinear transformation on the feature vector of each node. This process can be seen as mapping the feature representation of nodes into a higher-dimensional space and performing convolution operations in this space.

Specifically, the convolution operation of GCN can be expressed as:

H ( l + 1 ) = σ ( D ^ − 1 2 A ^ D ^ − 1 2 H ( l ) W ( l ) ) H^{(l+1)} = \sigma(\hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}) H(l+1)=s (D^21A^D^21H(l)W(l))

where H ( l ) H^{(l)}H( l ) means thellthThe feature matrix of all nodes in the l- layer GCN,A ^ \hat{A}A^ is an adjacency matrix with self-loops,D ^ \hat{D}D^ isA ^ \hat{A}A^ diagonal matrix,σ \sigmaσ is a nonlinear activation function,W ( l ) W^{(l)}W( l ) is thellthThe weight matrix of the l- layer GCN. This formula means that the feature matrixH of the node ( l ) H^{(l)}H( l ) Through a linear transformationW ( l ) W^{(l)}W( l ) Map to a new feature space, and then pass the adjacency matrixA ^ \hat{A}A^ Calculate the adjacency relationship between nodes, pass the feature vector of the node to the neighbor nodes, and perform weighted summation. Finally, through the nonlinear activation functionσ \sigmaσ maps this weighted sum back to the original feature space to get a new feature vectorH ( l + 1 ) H^{(l+1)}H(l+1)

The training process of GCN is usually optimized using the backpropagation algorithm, and the goal is to minimize the loss function to improve the generalization ability of the model.

Graph Convolutional Neural Network (GCN) emerged to solve the problem of graph data processing. Traditional neural network models (such as fully connected neural networks, convolutional neural networks, etc.) can only process data with a Euclidean structure, that is, the data has a regular grid structure in space. For graph data, the connection relationship between its nodes is arbitrary and irregular, so it cannot be directly processed by traditional neural network models.

The emergence of GCN just fills this gap. GCN can transfer and aggregate information between nodes through graph convolution operations, and extract feature representations of graphs on this basis, so that it can be used for tasks such as classification, clustering, and link prediction of graph data.

In addition to the problem that traditional neural networks cannot handle graph data, there are some other reasons that have contributed to the emergence of graph convolutional neural networks:

  1. Sparseness of graph data: In graph data, the connection relationship between nodes is arbitrary, which leads to graph data is usually sparse. However, traditional neural network models require dense connection parameters, which makes it difficult for traditional neural networks to handle sparse graph data. The adjacency matrix used by GCN can effectively deal with the sparsity problem.

  2. Invariance of graph data: In graph data, the labels of nodes are arbitrary, which leads to possible changes in the relative positions of nodes, which makes it difficult for traditional neural network models to deal with. The convolution operation used by GCN is based on the adjacency matrix, which can ensure that the same nodes in different graphs have the same feature representation, thus having certain invariance.

  3. Nontriviality of graph data: Unlike traditional Euclidean data, each node in graph data is independent, and the feature vector of each node may have different dimensions. This makes it difficult for traditional neural network models to handle non-trivialities in graph data. The convolution operation used by GCN can share weights among different nodes, so as to effectively deal with non-trivial problems in graph data.

Therefore, the emergence of the graph convolutional neural network is to solve the processing problem of graph data. GCN has the ability to effectively represent and learn the features of graph data, and can achieve better performance in tasks such as classification, clustering, and link prediction of graph data. good performance.

Conv3d : cincout(dhw)

3d data: d-depth
Timing data: d-time

Kernel : cin*(d,k,k)
Filter : cout*(cin,d,k,k)

GraphConv generally only considers adjacent points with a distance of 1

Image is described as a graph:
available:
point set {V}={RGB(x,y)}
edge set {E}: {P(x,y) adjacent to p(x1,y), p(x,y±1 )}, corner points have 2 adjacent points, edge points have 3 adjacent points, and other points have 4 adjacent points

Conv2d is described as GraphConv:
Assuming a conv2d kernel, corner point = 0, other points = 1/5, the convolution process is equivalent to averaging a pixel with four adjacent points up, down, left, and right, which is equivalent to using
GraphConv to point P Do graph convolution with adjacent points {p(x1,y),p(x,y±1)}

insert image description here

Figure source: From Graph to Graph Convolution: Talking about the Graph Neural Network Model (2) - Zhihu (zhihu.com)
Operator Extension: The eigenvector H of the
GraphConv node adjacency matrix A, adjacency = 1 , regardless of the weight degree matrix D, the number of adjacent points Laplacian matrix L, DA activation function σ




Feature Aggregation Calculation Method

  1. H' = σ(A@H), what's the problem?
    Dimension transformation: (k,p) = (k,k) @ (k,p)
    L = DA

  2. H' = σ(L@H@W), what's the problem?

  3. L' = , which can be simply understood as the matrix product of the inverse of the degree matrix and the adjacency matrix

  4. What's wrong with H' = σ(A@H)?
    In this case, the calculation method of feature aggregation is to directly perform matrix multiplication of adjacency matrix A and feature matrix H to obtain a new feature matrix H'. One problem with this method is that it does not take into account the degree information of nodes, which has an important impact on the aggregation and transfer of features. In this case, the greater the degree of a node, the greater the influence of neighbor nodes on the feature aggregation of the node, but directly using the adjacency matrix for feature aggregation cannot reflect this effect.

  5. What's wrong with H' = σ(L@H@W)?
    In this case, the feature aggregation method is to perform matrix multiplication of the Laplacian matrix L and the feature matrix H, and multiply it by the weight matrix W to obtain a new feature matrix H'. This method considers the degree information of nodes, because the degree matrix D in the Laplacian matrix reflects the degree information of nodes. However, there is a problem with this method, that is, it does not take into account the node's own feature information, because the feature matrix H is directly multiplied by the Laplacian matrix L, and the feature matrix is ​​not transformed. This may cause the model to under-handle the characteristics of the nodes themselves and affect the performance of the model.

  6. L' = D^-1/2 * A * D^-1/2 can be simply understood as the matrix product of the inverse of the degree matrix and the adjacency matrix. In this case, the calculation method of
    feature aggregation is to first pass the adjacency matrix A through The inverse matrix of the degree matrix D is normalized to obtain a new adjacency matrix L', and then L' is multiplied by the feature matrix H, and multiplied by the weight matrix W to obtain a new feature matrix H'. This method comprehensively considers the degree information of the node and the characteristic information of the node itself, because it considers the adjacency matrix and the degree matrix at the same time. This method outperforms the above two methods because it can better reflect the feature aggregation and transfer process of nodes.

mainstream algorithm

Based on dual stream network

Dual-stream network is a deep learning model commonly used in video action recognition, which uses two parallel convolutional neural networks to process two input streams of video: image stream and optical flow stream.

An image stream is composed of a sequence of frames of a video, which provides information about the appearance of static in the video. Optical flow flow is obtained by calculating the pixel motion between adjacent frames, which provides dynamic motion information in the video.

In a two-stream network, each input stream has its own convolutional neural network, and these networks usually have different structures and parameters. For example, image flow can use conventional convolutional neural networks, such as VGG, ResNet, etc., while optical flow flow usually uses optical flow estimation algorithms to extract optical flow features, and uses 3D convolutional neural networks to process optical flow tensors. Then, the features of the two streams can be combined and input into a fully connected layer for classification.

The advantage of the two-stream network is that it can utilize both static and dynamic information in videos, and it has shown excellent performance in action recognition tasks. It has been widely used in many video action recognition tasks, such as human action recognition, traffic action recognition, and gesture recognition, etc.

A deep learning method based on a two-stream network can be used for action recognition, which is a two-stream model using RGB images and optical flow images. Optical flow refers to the motion trajectory of a pixel in two adjacent frames of images, which can provide motion information of objects. In the dual-stream network, the RGB image and the optical flow image are respectively input into two convolutional neural networks for processing, and then their features are combined for classification to realize behavior recognition.

The following are some examples of deep learning applications for action recognition based on two-stream networks:

  1. Human behavior recognition: A deep learning method based on a two-stream network can be used for human behavior recognition, such as recognizing running, jumping, punching and other behaviors. This approach is usually processed using convolutional neural networks and recurrent neural networks.

  2. Traffic behavior recognition: A deep learning method based on a dual-stream network can be used for traffic behavior recognition, such as recognizing the behavior of vehicles, pedestrians, bicycles and other vehicles. This approach is usually processed using convolutional neural networks and recurrent neural networks.

  3. Action recognition: A deep learning method based on a two-stream network can be used for action recognition, such as recognizing gestures, facial expressions, and other actions. This approach is usually processed using convolutional neural networks and recurrent neural networks.

It should be noted that the deep learning method based on the dual-stream network requires a large amount of training data and computing resources, and at the same time requires special attention to issues such as data preprocessing, data enhancement, and model selection. In addition, there are still some challenges in the practical application of deep learning methods based on two-stream networks, such as how to deal with optical flow images and how to choose an appropriate network structure.

Based on 3d model

3D Convolutional Neural Network (3D CNN) is a deep learning model for processing 3D data such as videos, medical images, and 3D objects. They are extensions of 2D CNNs that can perform convolution operations in three dimensions of video or 3D images.

The basic structure of 3D CNN is similar to that of 2D CNN, including convolutional layers, pooling layers, and fully connected layers. In the convolution layer, 3D CNN uses 3D convolution kernel to convolve 3D tensor. In the pooling layer, we use 3D pooling kernels to pool 3D tensors. In the fully connected layer, we convert the 3D tensor to a 1D vector and feed it into the fully connected layer for classification or regression.

When we are dealing with 3D data, deep learning models based on 3D Convolutional Neural Networks (3D CNN) can be used. Similar to 2D CNN, 3D CNN also uses convolutional and pooling layers to extract features in 3D data, and uses fully connected layers for classification or regression.

The input of a 3D CNN is a 3D tensor, usually composed of multiple 3D images. Similar to 2D CNN, 3D CNN can also increase the depth of the network by stacking multiple convolutional and pooling layers. In the convolution layer, we use a 3D convolution kernel to convolve a 3D tensor. In the pooling layer, we use 3D pooling kernels to pool 3D tensors.

When training 3D CNN, we can use similar methods to 2D CNN, such as stochastic gradient descent (SGD) and backpropagation algorithm. At the same time, we can also use other advanced optimization algorithms, such as Adam and Adagrad.

3D CNNs are often used to process 3D data, such as videos, 3D objects, and medical images. They have a wide range of applications in areas such as action recognition, object recognition, and medical image segmentation.
The deep learning method based on 3D model can be used for behavior recognition, which is a method of using 3D information in video sequences for behavior recognition. Different from 2D model-based methods, 3D model-based methods can directly exploit the spatial information in videos, which can better capture the motion and behavior of objects.

Here are some examples of deep learning applications for 3D model-based action recognition:

  1. Human behavior recognition: 3D model-based deep learning methods can be used for human behavior recognition, such as recognizing running, jumping, punching and other behaviors. This approach is usually processed using 3D convolutional neural networks.

  2. Traffic behavior recognition: 3D model-based deep learning methods can be used for traffic behavior recognition, such as recognizing the behavior of vehicles, pedestrians, bicycles and other vehicles. This approach is usually processed using 3D convolutional neural networks.

  3. Action recognition: 3D model-based deep learning methods can be used for action recognition, such as recognizing gestures, facial expressions, and other actions. This approach is usually processed using 3D convolutional neural networks.

It should be noted that deep learning methods based on 3D models also require a large amount of training data and computing resources, and special attention needs to be paid to data preprocessing, data enhancement, and model selection. In addition, there are still some challenges in deep learning methods based on 3D models, such as how to deal with 3D models of different scales, how to deal with motion blur, and so on.

Based on 2d model + timing model

The behavior recognition algorithm based on 2D model + time series model usually refers to the method of combining 2D images and time series information for behavior recognition. The algorithm consists of two main steps: first, features are extracted from each video frame using a 2D convolutional neural network (CNN). Second, use a time series model (such as recurrent neural network, LSTM, etc.) to model the extracted feature sequence as a behavior sequence, and classify or regress it.

Specifically, the behavior recognition algorithm based on 2D model + time series model can be divided into the following steps:

  1. Data preprocessing: Preprocessing the video, such as cropping, scaling, and normalization, for subsequent processing.

  2. Feature Extraction: Use 2D CNN to extract features from video frames. You can use a pre-trained 2D CNN model (such as VGG, ResNet, etc.), or you can train a task-specific 2D CNN model yourself.

  3. Feature Sequence Modeling: The sequence of features extracted by 2D CNN is modeled as a sequence of behaviors using a temporal model. Recurrent neural network models such as LSTM and GRU can be used, or models that combine convolution and LSTM such as convolutional LSTM can be used.

  4. Behavioral classification or regression: Use fully connected layers to map the output of a time series model to class labels or regression values.

Behavior recognition algorithms based on 2D models + time series models usually require a large amount of training data and computing resources. In addition, special attention needs to be paid to issues such as data preprocessing, data augmentation, and model selection, as well as how to deal with issues such as different scales and motion blur.

Skeleton-based graph model

The behavior recognition algorithm based on the skeleton graph model is a method of behavior recognition using human skeleton information. The algorithm first detects the human skeleton information from the video, then converts the skeleton information into a graph model, and finally uses a graph neural network to model and classify the graph model.

Specifically, the behavior recognition algorithm based on the skeleton graph model can be divided into the following steps:

  1. Data preprocessing: Preprocessing the video, such as cropping, scaling, and normalization, for subsequent processing.

  2. Skeleton detection: Use human skeleton detection algorithms (such as OpenPose) to extract human skeleton information from videos.

  3. Graph model construction: convert each skeleton node into a node in the graph model, and connect adjacent nodes to build a skeleton graph model. Different graph models can be selected according to different requirements, such as graph models based on adjacency matrix or graph models based on edge list, etc.

  4. Modeling with Graph Neural Networks: Modeling skeleton graph models using graph neural networks. A graph neural network model based on GCN (Graph Convolutional Network) or GAT (Graph Self-Attention Network) can be used.

  5. Behavioral Classification: Mapping the output of a graph neural network to class labels using fully connected layers.

The behavior recognition algorithm based on the skeleton graph model can use the human skeleton information and is not affected by factors such as lighting and background in the video, so it has good robustness. In addition, the graph neural network can model different nodes and edges, so it can better capture the joint motion and pose information in the skeleton, and improve the accuracy of behavior recognition.

It should be noted that the behavior recognition algorithm based on the skeleton graph model requires a large amount of training data and computing resources. At the same time, special attention needs to be paid to data preprocessing, data enhancement and model selection, as well as how to deal with different action speeds and posture changes. question.

slowfast

SlowFast is a deep learning model for video behavior recognition, proposed by Facebook AI Research. It combines slow and fast processes to improve the ability to capture fast motion and details in videos. The SlowFast model consists of two parts: a slow backbone network and a fast backbone network.

The slow backbone network processes low-frequency information, such as the overall movement and pose changes of objects, and processes videos at a lower frame rate. It usually adopts a deep 2D CNN model (such as ResNet) to extract features. The fast backbone network processes high-frequency information, such as object details and artifacts, to process video at a higher frame rate. It usually adopts a shallower 2D CNN model (such as VGG) to extract features.

Key benefits of the SlowFast model include:

  1. Efficient feature extraction: By processing videos in layers, the SlowFast model can efficiently extract key features in videos, thereby improving the accuracy of action recognition.

  2. Sensitivity to fast actions and details: The slow and fast processes can effectively capture fast actions and details in the video, thereby improving the accuracy of behavior recognition.

  3. Scalability: The SlowFast model can be easily applied to different video action recognition tasks, such as action classification, action detection, and action localization, etc.

It should be noted that using the SlowFast model for behavior recognition requires a large amount of training data and computing resources. At the same time, special attention needs to be paid to data preprocessing, data enhancement and model selection, as well as how to deal with issues such as different scales and motion blur.

st-gcn

ST-GCN is a video behavior recognition model based on Spatial-Temporal Graph Convolutional Network, which can model human skeletal motion and perform behavior classification. The ST-GCN model can capture the spatio-temporal relationship in human skeletal motion, so as to better characterize human motion characteristics.

ST-GCN mainly includes three parts: construction of spatiotemporal graph, spatiotemporal graph convolutional neural network and classifier.

  1. Construction of the spatio-temporal graph: First, the human skeleton motion sequence is converted into a spatio-temporal graph structure, each node represents a key point of the skeleton, and each edge represents the connection relationship between different key points. Then, according to the distance between key points and the relationship in time, a spatio-temporal graph is constructed.

  2. Spatiotemporal graph convolutional neural network: Input the constructed spatiotemporal graph into ST-GCN's convolutional neural network, where each node represents a bone key point, and each spatiotemporal graph convolution layer contains spatiotemporal convolution and corresponding Non-linear activation function. Spatio-temporal features can be efficiently extracted by stacking multi-layer spatio-temporal graph convolutional networks.

  3. Classifier: Finally, the output of the spatio-temporal graph convolutional network is converted into a fixed-length feature vector using a global pooling operation, which is then mapped to an action category using a fully connected layer for classification.

The ST-GCN model has the following advantages:

  1. It can effectively capture the spatio-temporal relationship of human skeleton movement and improve the accuracy of behavior classification.

  2. Skeletal motion sequences with different numbers of keys and lengths can be handled.

  3. The model parameters are relatively few and have high computational efficiency.

It should be noted that using the ST-GCN model for behavior recognition requires a large amount of training data and computing resources. At the same time, special attention needs to be paid to issues such as data preprocessing, data enhancement, and model selection, as well as how to deal with issues such as different scales and pose changes.

Hrnet under Topdown-heatmap key point detection

data mode

Data modality:

RGB image => centered image of human body detection frame

Save the frame center coordinates center, and width and height wh (zoom scale), used to map the result back to the original image (decode)

Operator extension

Operator extension:

Conv2d(stride=2,…), reduce the image size by 2 times
Conv2d(ksize=1,…), keep the original image size, only do channel transformation

Conv2d(dilation=2,...), expansion convolution/hole convolution, insert dilation-1 row/column 0 value in convolution kernel row/column

Conv2d(group=k,…), grouped convolution, input channels are divided into k groups, output channels are divided into k groups
(depth separable convolution: Conv2d(group=n,…)+Conv2d(kernelsize=1,…))

Conv2dtranspose(inchannel,outchannel,stride,pad,…), transpose convolution (deconvolution deconvolution)
insert stride-1 row/column 0 value in the input image row/column
minus pad edge

Pool2d/unpool2d

References:
conv_arithmetic/README.md at master vdumoulin/conv_arithmetic (github.com)
for regular convolution

Conv2d(stride=2,...), reduce the image size by 2 times

Pool2d(stride=2,...), reduce the size, increase the receptive field, and will not change the number of channels

Backward
insert image description here
image size:

Calculation of the receptive field (the size of the receptive field in the i layer of the T layer, i starts from 0, that is, the original image):

Conv1 = conv2d(3,3,1), conv2=conv2d(3,3,1)
Feature0=7, feature1=3, feature2=1
Receptive field of layer 2 in layer 1: 3 = (1-1) *1+3
The receptive field of layer 2 in layer 0: 5 = (3-1)*1+3
regular convolution

Conv2d(ksize=1, pad=0,...), keep the original image size

It can be used for channel conversion (alignment,...)
It can be used in the output head instead of linear for pixel-granular output. The
fully connected layer limits the input size

insert image description here

Source:
1X1 Convolution, CNN, CV, Neural Networks | Analytics Vidhya (medium.com)
Expansion Convolution / Hole Convolution

Conv2d(dilation=2,...), insert dilation-1 row/column 0 value in convolution kernel row/column

Im2col calculation method is similar to conv2d

Increase the receptive field (reception field)
insert image description here

group convolution

Conv2d(group=k,…), grouped convolution, input channels are divided into k groups, output channels are divided into k groups
(depth separable convolution: Conv2d(group=n,…)+Conv2d(kernelsize=1,…))

Conv2d parameter amount:

Grouped convolution parameters:

=>
Depthwise separable convolution: depthwise-conv + pointwise-conv

At that time, the group convolution became depthwise-conv, and
when ksize=1, the conventional convolution became pointwise-conv, and the parameter amount

This idea of ​​scanning channel dimension and space dimension separately is also applicable to other lightweight operators.

Transposed convolution
is not strictly a deconvolution (deconvolution): only the size is restored, and the parameters are not restored

Conv2dtranspose(stride,pad,…)

  • Kernel is consistent with the previous corresponding convolutional layer
  • Stride does not refer to the sliding step of the convolution kernel, but refers to the number of rows and columns inserted with 0 values:
  1. Insert stride-1 row/column 0 value in input graph row/column
  2. The convolution kernel sliding step is constant at 1
  • Pad is to subtract sides, not add sides
  • Kernel will scan beyond the boundary of the image, keeping 1 row/column at the farthest

Initialize the convolution kernel using bilinear interpolation

Img —conv2d(ksize=3,stride=2,pad=1)—>mid_img—conv2dtranspose(ksize=3,stride=2,pad=1)—>img_rec Imgsize(i):Pooling Pool2d(stride
=
2
) , downsampling, increasing the receptive field. avg/max?
output pooling value and local coordinates

Unpooling (Max)
Unpool2d
accepts pooling values ​​and local coordinates,
only restores the coordinates of the maximum value in the original image, and discards other values

Both pooling and unpooling will result in information loss during size reduction
insert image description here

Source:
MaxUnpool2d — PyTorch 2.0 documentation

fusion method

Fusion method (fuse)
flatten splicing
add
channel splicing

Post fusion:
logits/prob fusion, direct addition

Feature fusion:
single-stream/multi-stream
single-hop/dense
block/inter-block

Alignment channel
alignment (conv2d(ksize=1))
size alignment (interpolate, conv2d/conv2dtranspose, pool/unpool)

Topdown-heatmap

insert image description here

https://zhuanlan.zhihu.com/p/394060630Original
imageTarget detection model
Human body frame diagram (reserve frame center coordinates and frame length and width)
human body centering image that conforms to the input size of the key point modelKey point model
downsampling heat map , each channel corresponds to a heat map of a key point, and the maximum value is the estimated coordinate
of the key point.
Box length and width, restore the key point coordinates in the original image

Top-down heatmap is a method based on human body pose estimation, which predicts human body pose by generating heatmaps of human key points.

Specifically, the method first inputs the image into a convolutional neural network for feature extraction, and then generates multiple heat maps of key points of the human body on the output of the last convolutional layer, where each heat map corresponds to a key point . The value in each heatmap represents the possible position of the key point in the image, and the larger the value, the more likely the position is the position of the key point. Then, according to these heat maps, some specific algorithms or models can be used to extract the position of each key point, so as to obtain the estimation result of human body pose.

The advantage of the Top-down heatmap method is that it can handle occlusions, attitude changes, etc. well, and has good accuracy and robustness. However, this method still has certain challenges for multi-person pose estimation and key point matching problems. Therefore, in recent years, some new methods such as Bottom-up heatmap have gradually received attention.

Hrnet paper process analysis

HRNet (High-Resolution Network) is a deep neural network model for image classification and target detection, proposed by the Computer Vision Laboratory of Peking University. Compared with traditional deep neural networks, HRNet performs feature fusion through multiple branch networks while maintaining high-resolution feature maps, thereby improving the accuracy of the model.

The following is the main process analysis of the HRNet paper:

  1. High-resolution feature extraction: HRNet adopts a strategy of high-resolution input and high-resolution feature extraction, that is, the resolution of the input image is kept at a high level, and multiple branch networks are used to extract features of different scales.

  2. Multi-scale feature fusion: HRNet extracts features of different scales through multiple branch networks, and uses an efficient feature fusion method to fuse these features into a high-quality feature representation. Specifically, HRNet performs cascading or addition operations on the features in the branch network to perform feature fusion.

  3. High-resolution feature reconstruction: HRNet restores high-resolution feature maps by upsampling low-resolution feature maps using methods such as transposed convolution or upsampling. This one-step operation enables the model to have a small computational overhead while maintaining high-resolution features.

  4. Classification or detection: HRNet inputs the reconstructed high-resolution feature map into the fully connected layer for classification or detection tasks.

HRNet has achieved good performance in multiple image classification and target detection tasks, especially in tasks that need to extract high-resolution features.

Hrnet actual code process analysis

The following is the specific code implementation process of HRNet:

  1. Data preprocessing: Use image processing libraries such as OpenCV to scale and crop the original image to obtain an input image of a specified size, and then convert it to Tensor format. This step can be achieved using PyTorch or TensorFlow's data loader and transformation tools.
import cv2
import numpy as np
import torch

def preprocess(img, input_size):
    img = cv2.resize(img, (input_size[1], input_size[0]))
    img = np.array(img, dtype=np.float32)
    img = img / 255.0
    img = (img - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
    img = np.transpose(img, (2, 0, 1))
    img = np.expand_dims(img, axis=0)
    return torch.from_numpy(img)
  1. Building a network: The network structure of HRNet can be implemented using PyTorch. In the implementation process, it is necessary to first define the high-resolution feature extraction module, multi-scale feature fusion module and high-resolution feature reconstruction module, and then combine these modules to construct the HRNet network.
import torch.nn as nn
import torch.nn.functional as F

class HighResolutionModule(nn.Module):
    def __init__(self, num_branches, blocks, num_blocks, num_channels, fuse_method):
        super(HighResolutionModule, self).__init__()
        self.num_branches = num_branches
        self.fuse_method = fuse_method

        self.branches = self._make_branches(num_branches, blocks, num_blocks, num_channels)
        self.fuse_layers = self._make_fuse_layers()
        self.relu = nn.ReLU(inplace=True)

    def _make_one_branch(self, branch_index, block, num_blocks, num_channels):
        layers = []
        layers.append(block(64, num_channels[branch_index], stride=2))
        for i in range(1, num_blocks):
            layers.append(block(num_channels[branch_index], num_channels[branch_index], stride=1))
        return nn.Sequential(*layers)

    def _make_branches(self, num_branches, block, num_blocks, num_channels):
        branches = []
        for i in range(num_branches):
            branches.append(self._make_one_branch(i, block, num_blocks, num_channels))
        return nn.ModuleList(branches)

    def _make_fuse_layers(self):
        if self.num_branches == 1:
            return None

        num_branches = self.num_branches
        num_fuse_layers = num_branches - 1
        fuse_layers = []
        for i in range(num_fuse_layers):
            fuse_layer = []
            for j in range(num_branches):
                if j > i:
                    fuse_layer.append(nn.Conv2d(num_channels[j], num_channels[i], kernel_size=1, stride=1, padding=0))
                elif j == i:
                    fuse_layer.append(None)
                else:
                    conv3x3s = []
                    for k in range(i-j):
                        in_channels = num_channels[j + k]
                        out_channels = num_channels[i]
                        conv3x3s.append(nn.Sequential(
                            nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=2, padding=1),
                            nn.BatchNorm2d(out_channels),
                            nn.ReLU(inplace=True)))
                    fuse_layer.append(nn.Sequential(*conv3x3s))
            fuse_layers.append(nn.ModuleList(fuse_layer))
        return nn.ModuleList(fuse_layers)

    def _fuse(self, x):
        if self.num_branches == 1:
            return x

        out = []
        for i in range(len(self.fuse_layers)):
            y = x[0] if i == 0 else self.fuse_layers[i][0](x[0])
            for j in range(1, self.num_branches):
                if i == j:
                    y = y + x[j]
                elif j > i:
                    width_output = x[i].shape[-1]
                    height_output = x[i].shape[-2]
                    y = y + F.interpolate(self.fuse_layers[i][j](x[j]), size=[height_output, width_output],
                                           mode='bilinear', align_corners=True)
                else:
                    y = y + self.fuse_layers[i][j](x[j])
            out.append(self.relu(y))
        return out

    def forward(self, x):
        if self.num_branches == 1:
            return [self.branches[0](x[0])]

        x = self.branches[0](x[0])
        x = [x]
        for i in range(1, self.num_branches):
            y = self.branches[i](x[i-1])
            if i == self.num_branches - 1 and self.num_branches > 1:
                y = self.fuse_layers[-1][i-1](y)
            x.append(y)
        x = self._fuse(x)
        return x
  1. Define the loss function: In HRNet, the mean square error (MSE) is usually used as the loss function to calculate the difference between the predicted keypoint position and the true position.
import torch.nn as nn

class HeatmapLoss(nn.Module):
    def __init__(self):
        super(HeatmapLoss, self).__init__()

    def forward(self, pred, gt):
        loss = ((pred - gt) ** 2).mean()
        return loss
  1. Train the model: By defining a training loop, you can use PyTorch's optimizer and loss function to train the HRNet model. During the training process, the input image and the target heat map need to be passed into the network, the loss is calculated and backpropagated, and finally the network parameters are updated.
import torch.optim as optim

# define HRNet model and optimizer
model = HRNet()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=0.0001)

# define loss function
criterion = HeatmapLoss()

# define training loop
for epoch in range(num_epochs):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, targets = data
        inputs = inputs.to(device)
        targets = targets.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, targets)

        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print('Epoch %d loss: %.3f' % (epoch + 1, running_loss / len(train_loader)))
  1. Test model: Use the test set data to evaluate the trained HRNet model, calculate the difference between the predicted key point position and the real position, and output the average error.
# define testing loop
test_loss = 0.0
with torch.no_grad():
    for data in test_loader:
        inputs, targets = data
        inputs = inputs.to(device)
        targets = targets.to(device)

        outputs = model(inputs)
        loss = criterion(outputs, targets)

        test_loss += loss.item()

    print('Test loss: %.3f' % (test_loss / len(test_loader)))
  1. Prediction using the model: Use the trained model to perform keypoint detection on new images. First, the image needs to be preprocessed, and then input into the HRNet model to obtain the predicted heat map. Finally, post-processing techniques such as non-maximum suppression can be used to extract keypoint locations.
import numpy as np
import cv2

# load image
img = cv2.imread('test.jpg')

# preprocess image
input_size = (256, 192)
img_tensor = preprocess(img, input_size)

# predict heatmaps
with torch.no_grad():
    output = model(img_tensor.to(device))
    heatmaps = output[-1].cpu().numpy()

# postprocess heatmaps
keypoints = []
for heatmap in heatmaps:
    heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))
    heatmap = cv2.GaussianBlur(heatmap, (3, 3), 0)
    heatmap = heatmap / np.max(heatmap)
    heatmap[heatmap < 0.3] = 0
    heatmap = heatmap.astype(np.uint8)
    contours, _ = cv2.findContours(heatmap, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    for contour in contours:
        keypoint = cv2.minMaxLoc(heatmap, mask=contour[::-1])[3]
        keypoints.append(keypoint)

# visualize keypoints
for keypoint in keypoints:
    cv2.circle(img, keypoint, 3, (0, 255, 0), -1)
cv2.imshow('keypoints', img)
cv2.waitKey(0)
cv2.destroyAllWindows()

This code will take as input a test image, get predicted keypoint coordinates, and finally visualize the results.

The above is a simple process of human pose estimation. There are many parameters that can be adjusted, such as network structure, preprocessing method, postprocessing method, etc., which need to be adjusted according to specific application scenarios.

Hrnet main module

HRNet is a new type of high-resolution network. Compared with the traditional deep learning network, its biggest advantage is that it can extract features at multiple resolutions, thus taking into account the features of high-resolution and low-resolution. The HRNet model is mainly composed of four modules, namely:

  1. High-Resolution Feature Extraction Module: This module is responsible for extracting high-resolution features from the input image and feeding them into the next layer of the network. This module consists of two branches, one branch downsamples the input image to obtain low-resolution features, and the other branch maintains the resolution of the input image to obtain high-resolution features.

  2. Multi-Resolution Fusion Module: This module fuses features from different resolutions to take into account feature information at different resolutions. Specifically, this module restores low-resolution features to high-resolution through upsampling, and fuses them with high-resolution features.

  3. High-Resolution Feature Reconstruction Module: This module is responsible for reconstructing the fused features back to the original high-resolution features. Specifically, this module upsamples the fused features to the original resolution via a deconvolution operation.

  4. Final Prediction Module: This module uses the reconstructed high-resolution features for final prediction. In human pose estimation tasks, this module usually outputs a heatmap of keypoints to indicate where the keypoints are located in the image.

The above four modules constitute the basic structure of the HRNet model. In practice, these modules can be tuned and modified according to specific tasks to achieve better performance.

hrnet full process code analysis of topdown key point detection

The following is the complete code of HRNet key point detection of the Top-Down method, which includes detailed analysis of steps such as data preprocessing, model training, and model prediction.

  1. data preprocessing

First, we need to preprocess the data for training in the HRNet model. Data preprocessing generally includes the following steps:

  • Scale the image so that the length of the long side is the input size (such as 256 or 384), and the short side is scaled proportionally;
  • Scale the key point coordinates to correspond to the scaled image;
  • Perform data augmentation on images, such as random rotation, random cropping, random flipping, etc.
import cv2
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

class KeyPointDataset(Dataset):
    def __init__(self, img_path, label_path, input_size=(256, 192)):
        self.input_size = input_size
        self.img_path = img_path
        self.label_path = label_path
        self.img_list = []
        self.label_list = []
        with open(img_path, 'r') as f:
            for line in f.readlines():
                self.img_list.append(line.strip())
        with open(label_path, 'r') as f:
            for line in f.readlines():
                label = np.array(line.strip().split(' ')).astype(np.float32).reshape(-1, 3)
                self.label_list.append(label)
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])

    def __len__(self):
        return len(self.img_list)

    def __getitem__(self, index):
        # load image and label
        img = cv2.imread(self.img_list[index])
        label = self.label_list[index]

        # resize image and label
        h, w, _ = img.shape
        scale_h = self.input_size[0] / h
        scale_w = self.input_size[1] / w
        img = cv2.resize(img, self.input_size)
        label[:, 0] *= scale_w
        label[:, 1] *= scale_h

        # data augmentation
        if self.training:
            angle = np.random.randint(-30, 30)
            scale = np.random.uniform(0.8, 1.2)
            trans_x = np.random.randint(-30, 30)
            trans_y = np.random.randint(-30, 30)
            center = np.array([w / 2, h / 2])
            M = cv2.getRotationMatrix2D(tuple(center), angle, scale)
            M[:, 2] += np.array([trans_x, trans_y])
            img = cv2.warpAffine(img, M, (w, h))
            label[:, :2] = self._affine_transform(label[:, :2], M)

        # normalize image
        img = self.transform(img)

        return img, label

    def _affine_transform(self, pts, M):
        n = pts.shape[0]
        pts_pad = np.concatenate([pts, np.ones((n, 1))], axis=1)
        pts_trans = np.dot(pts_pad, M.T)
        return pts_trans[:, :2]

# create dataset and dataloader
train_dataset = KeyPointDataset('train_img.txt', 'train_label.txt')
val_dataset = KeyPointDataset('val_img.txt', 'val_label.txt')
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)

In the above code, we define a KeyPointDataset class, which is used to load the dataset and perform preprocessing. In the __init__ function, we read the paths of images and labels and store them in img_list and label_list respectively. In the __getitem__ function, we first read the corresponding indexed image and label, and then scale and augment them. Finally, we convert the image into a Tensor type and perform normalization.

  1. model training

Next, we use the PyTorch framework to train the HRNet model. Before training, we need to define the loss function and optimizer.

import torch.nn as nn
import torch.optim as optim

# define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

Then, we can start training the model. In each epoch, we use the training set for training and the validation set for evaluation. In each batch, we calculate the output of the model and the difference from the label (loss function), and update the model parameters using the optimizer.

def train(model, dataloader, criterion, optimizer, device):
    model.train()
    train_loss = 0.0
    for i, (inputs, labels) in enumerate(dataloader):
        inputs = inputs.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * inputs.size(0)
    train_loss /= len(dataloader.dataset)
    return train_loss

def evaluate(model, dataloader, criterion, device):
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for inputs, labels in dataloader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            val_loss += loss.item() * inputs.size(0)
    val_loss /= len(dataloader.dataset)
    return val_loss

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = HRNet().to(device)
num_epochs = 50
for epoch in range(num_epochs):
    train_loss = train(model, train_dataloader, criterion, optimizer, device)
    val_loss = evaluate(model, val_dataloader, criterion, device)
    print(f'Epoch {
      
      epoch+1:02}/{
      
      num_epochs}, Train Loss: {
      
      train_loss:.4f}, Val Loss: {
      
      val_loss:.4f}')
  1. model prediction

After the training is complete, we can use the trained model to make predictions. Below is a sample code for single image prediction using a trained HRNet model.

def predict(model, img_path):
    model.eval()
    img = cv2.imread(img_path)
    h, w, _ = img.shape
    scale_h = 256 / h
    scale_w = 192 / w
    img = cv2.resize(img, (192, 256))
    img = img.astype(np.float32) / 255.0
    img = (img - np.array([0.485, 0.456, 0.406])) / np.array([0.229, 0.224, 0.225])
    img = np.transpose(img, (2, 0, 1))
    img = np.expand_dims(img, axis=0)
    img = torch.from_numpy(img).to(device)
    with torch.no_grad():
        output = model(img)
    output = output.cpu().numpy()[0]
    output[:, 0] /= scale_w
    output[:, 1] /= scale_h
    return output

model_path = 'hrnet.pth'
model = HRNet().to(device)
model.load_state_dict(torch.load(model_path))
output = predict(model, 'test.jpg')

In prediction, we first read a test image and scale it to the network input dimensions. Then, we convert the image to Tensor type and perform normalization. Next, we use the trained model to make predictions, and inversely scale the output to get the final key point coordinates.

Guess you like

Origin blog.csdn.net/qq_44089890/article/details/130447510