A Brief Introduction to Behavior Recognition

Table of contents

1. Background introduction

2. Main direction

3. Common methods

3.1 Traditional method

3.1.1 Space-time interest points

3.1.2 Dense-trajectories

3.2 Methods based on deep learning

3.2.1 3D Convolutional Network (C3D)

 3.2.2 Graph Convolutional Network (GCN)

3.2.3 LSTM (Long Short-Term Memory Network)

4. Common data sets

4.1 HMDB-51

4.2 UCF-101

4.3 The Hollywood Dataset2


1. Background introduction

       Behavior recognition studies the movement of objects in a video, such as determining whether a person is walking, jumping, or waving. There are important applications in video supervision, video recommendation and human-computer interaction. In recent decades, with the rise of neural networks, many methods have been developed to deal with the problem of action recognition. Different from target recognition, behavior recognition not only needs to analyze the spatial dependence of targets, but also needs to analyze the historical information of target changes. This adds difficulty to the problem of behavior recognition. Input a series of continuous video frames, the first problem faced by the machine is how to segment this series of images according to the correlation, for example, a person may first walk, then wave, and then jump. The machine needs to judge that the person has made three actions, and separate out the video of the corresponding time period for separate judgment. The second problem that the machine has to solve is to separate the target to be analyzed from an image, for example, there is a person and a dog in a video, and it is necessary to analyze the behavior of the person and ignore the dog. The last step is to extract the features of a person's behavior in a period of time, conduct training, and make judgments on actions.

2. Main direction

1. Action classification, give a video truncation, and judge the action category of the video, or offline.

2. Action recognition. A natural video is given without any cropping. At this time, it is necessary to know the start time and end time of the action, and then the category of the action.

3. Common methods

3.1 Traditional method

3.1.1 Space-time interest points

        The main idea of ​​extracting spatio-temporal key points is that the key points in the video image are usually the data that change strongly in the spatio-temporal dimension, and these data reflect the important information of the target motion. For example, if a person is waving his palm, the palm will definitely move the most in the front and back frames, and the surrounding image data will change the most. While the rest of the person's body changed little, the numbers stayed almost the same. If this change data can be extracted and its location information further analyzed, it can be used to distinguish other actions.

       The extraction method of spatio-temporal key points is an extension of the spatial key point method. The most common way to describe a multi-scale image f(x,y) is to use Gaussian transformation:

        G is a Gaussian function, and different variances will produce images with different resolutions. Harris et al. perform Laplace transforms on images of different scales. Harris uses the following quantities to find key points:

        The method of key points is relatively old. This method ignores many video details, may lose a lot of information, and has weak generalization ability. Moreover, it treats the time dimension and the space dimension as equivalent, but the characteristics of time and space are different. The extracted spatio-temporal key points may not reflect the action information well.

3.1.2 Dense-trajectories

       Dense Trajectories Algorithm (Dense Trajectories Algorithm): Improve dense trajectory, referred to as iDT, is an algorithm for extracting video dense tracking trajectories, and usually calculates descriptors based on the trajectories.

As shown in the figure is the basic framework of the algorithm, including:

1. Dense sampling feature points

2. Trajectory tracking of feature points

3. Trajectory-based feature extraction

Four characteristics:

1.trajectory trajectory features:

       Each trajectory can extract a trajectory feature vector S' (when L=15, S' is 30 dimensions), and encode the local action mode.

2. HOG features:

       The HOG feature calculates the histogram of the gradient of the grayscale image. To describe the surface features of the video block, the number of bins in the histogram is 8, so the length of the HOG feature is 2X2X3X8=96.

3.HOF features:

       HOF calculates the histogram of optical flow. The number of bins in the histogram is 8+1, and the first 8 bins are the same as HOG. An additional one is used to count pixels whose magnitude of optical flow is less than a certain threshold. Therefore, the characteristic length of HOF is 2 2 3*9=108.

4. MBH features:

       MBH calculates the histogram of the gradient of the optical flow image, which can also be understood as the HOG feature calculated on the optical flow image. Since the optical flow image includes the X direction and the Y direction, MBHx and MBHy are calculated separately. The total feature length of MBH is 2*96=192. Finally, the features are normalized. In the DT algorithm, the L2 norm is used for normalization of HOG, HOF and MBH.

3.2 Methods based on deep learning

3.2.1 3D Convolutional Network (C3D)

       As the name suggests, 3D convolution is to perform convolution operations in time and space dimensions, and 3D convolution and pooling can preserve time information. 2D convolution only operates on the spatial dimension, and the time information of the image will be lost. The 3D convolution kernel pooling outputs three-dimensional data, which contains information and features in time series. Due to the large amount of 3D convolution calculations, in order to reduce the depth and width of the network, 8 layers of convolution, 5 layers of pooling, 2 layers of full connection and 1 layer of softmax are used.

The above is the C3D network structure.

       1. All 3D convolution kernels are 3×3×3 (dxkxk, d is time depth), and the step size is 1×1×1

       2. In order to retain more time information in the early stage, set the pool1 core size to 1×2×2, and the step size to 1×2×2 (when the time depth is 1, pooling will be performed on each frame separately, greater than 1 , pooling will be performed on the time axis, that is, between multiple frames)

       3. All other 3D pooling layers are 2×2×2, with a step size of 2×2×2

       4. Each fully connected layer has 4096 output units

       3D convolution and pooling are more suitable for learning spatiotemporal features. Through 3D convolution and 3D pooling, temporal information can be modeled, while 2D convolution can only learn features in space. When 2D convolution input images and videos, the output is an image, and after 3D convolution input video, the output is also a video (3D feature map), which retains the input time information. The C3D network takes full video as input, does not rely on any processing, and can be easily extended to large datasets.

 3.2.2 Graph Convolutional Network (GCN)

        Graph-based models perform very well on graph-structured data. The graph neural network has two structures, one is the combination of graph and RNN, and the other is the combination of graph and CNN (GCN). There are two types of GCN, one is spatial GCN and the other is spectral GCN. Among them, spectral GCN is to convert image data into spectral space representation. In order to apply GCNs in action recognition, a graphical representation of the video is first required. A graph can be represented as (Vt, et), where vt is a vertex and et is an edge. The joint coordinates of the target can be obtained by using the pose estimation method, and then these joint points can be used as the points of the graph, and these points can be connected according to space or time. The adjacent nodes of a node can be represented as follows:

D is the defined node distance. After this definition, each node and its adjacent nodes can be divided into the same group. After identifying these nodes, it is like a tensor. Then the convolution operation can be performed. Graph convolution is usually expressed as:

where l(v_tj) is the serial number of the node identified. X is the graph representation and W is the weight.

In the above figure, g represents the graph convolution calculation, and fatt is the attention network. Since Ht contains a large amount of spatial and temporal information, applying an attention network to it effectively improves the tracking of valid nodes.

3.2.3 LSTM (Long Short-Term Memory Network)

       The memory function of the LSTM network itself is suitable for processing long-term dependent input signals, and videos are images that change in time. Therefore, some papers use LSTM to extract video features. Srivastava et al. adopted the LSTM encoding and decoding structure. The LSTM encoder extracts features from a series of video images to generate a description of the video. The LSTM decoder can be used for video prediction or recovery. The encoder input of LSTM can make any video information that has undergone feature extraction, such as optical flow information or target information extracted by convolutional network, and then further learn time-related features in LSTM.

       The output of traditional RNN nodes is only determined by weights, biases, and activation functions. RNN is a chain structure, and each time slice uses the same parameters. as the picture shows:

       The reason why LSTM can solve the long-term dependency problem of RNN is that LSTM introduces a gate mechanism to control the circulation and loss of features. For the above example, LSTM can transfer the features at time t2 at time t9, so that it can be very effective in judging whether to use singular or plural at time t9. LSTM is composed of a series of LSTM units (LSTM Unit), and its chain structure is as follows:

       Each yellow box represents a neural network layer, consisting of weights, biases, and activation functions; each pink circle represents element-level operations; arrows represent vector flow; intersecting arrows represent vector splicing; bifurcated arrows represent vectors copy.

4. Common data sets

4.1 HMDB-51

       HMDB51 was released by Brown University in 2011. Most of the videos come from movies, public databases, and online video libraries such as YouTube. The database contains 6849 samples, which are divided into 51 categories, and each category contains at least 101 samples. It mainly includes facial movements (smile, laugh..), body movements (climb, dive..) and so on.

 

4.2 UCF-101

       UCF-101 is a series of databases released by the University of Central Florida (UCF) since 2012. These database samples come from various sports samples collected from BBC/ESPN broadcast TV channels, as well as samples from the video website YouTube. The samples are 13,320 videos, and the categories include makeup, music equipment, sports, etc.

 

4.3 The Hollywood Dataset2

       The Hollywood-2 dataset was released by the IRISA Research Institute in 2009. It contains 12 action categories and 10 scenes with a total of 3669 samples. All samples are extracted from 69 Hollywood movies.

 

 

Guess you like

Origin blog.csdn.net/m0_63156697/article/details/126310386