PyTorch actual combat of qiuzitao deep learning (16)

The simplest, practical and easy-to-understand PyTorch series of practical tutorials in history! (Novice friendly, novice please come in, suggest collection)

Video analysis and action recognition based on 3D convolution

1. Interpretation of 3D convolution principle

Video is composed of frames of images that are spliced ​​by time. 3D convolution has one more time dimension than 2D convolution.

Insert picture description here
Insert picture description here

2. Introduction to UCF 101 Action Recognition Data Set

UCF 101 action recognition data set official website address download: https://www.crcv.ucf.edu/data/UCF101.php

Insert picture description here

101 categories of videos, in each category is a person doing a type of action, such as shooting, drawing lipstick, drawing eyeliner, etc. The data set is 6.5G, and I also uploaded a web disk for everyone to download or go to the data set. Download from the official website.

The operation categories of the UCF101 data set are: applying eye makeup, applying lipstick, archery, baby crawling, balance beam, band parade, baseball field, basketball shooting, basketball dunk, bench press, cycling, billiard shooting, drying hair, blowing candles, Weight squatting, bowling, boxing punching bag, boxing speed bag, breaststroke, brushing teeth, cleaning and jerk, cliff diving, cricket bowling, cricket shooting, cutting in the kitchen, diving, drumming, fencing, hockey fines, floor gymnastics, Frisbee catching , Front Crawl, Golf Swing, Haircut, Hammer Throwing, Hammering, Handstand Push-ups, Handstand Walking, Head Massage, High Jump, Horse Race, Horse Riding, Hula Hoop, Ice Dance, Javelin Throw, Juggling Ball, Jump Rope, Jump Jack, Kayaking, knitting, long jump, stab, parade, mixed batter, mopping the floor, nun chucks, parallel bars, pizza toss, guitar, piano, tabla, violin, cello, Daf, Dhol, flute, Sitar, pole vault, pommel horse, pull-ups, punching, push-ups, rafting, indoor rock climbing, rope climbing, boating, salsa spinning, shave, shot put, skate boarding, skiing, Skijet , Skydiving, football juggling, football free throws, static ring, sumo, surfing, swing, table tennis racket, tai chi, tennis swing, throwing discus, trampoline jumping, typing, uneven bars, volleyball stab, walking with dogs, wall push-ups, in Write on the boat, yo-yo. Shaving, Shot Put, Skating Boarding, Skiing, Skijet, Skydiving, Football Juggling, Football Penalty, Still Life Ring, Sumo, Surfing, Swing, Table Tennis Shooting, Tai Chi, Tennis Swing, Discus Throwing, Trampoline Jumping, Typing, No Even bar, volleyball spikes, walking with dogs, wall push-ups, writing on the boat, yo-yo. Shaving, Shot Put, Skating Boarding, Skiing, Skijet, Skydiving, Football Juggling, Football Penalty, Still Life Ring, Sumo, Surfing, Swing, Table Tennis Shooting, Tai Chi, Tennis Swing, Discus Throwing, Trampoline Jumping, Typing, No Even bar, volleyball spikes, walking with dogs, wall push-ups, writing on the boat, yo-yo.

Insert picture description here

3. "Video Analysis and Action Recognition Based on 3D Convolution" Project Actual Combat

Code link: https://pan.baidu.com/s/1rEVP8jJB2HGKukfFK2nIGQ
extraction code: agpd

3.1. Test results and project configuration

Remember to create a data folder in the same directory as the project folder and place the decompressed data with the data downloaded above before you can run the test. 100 generations of models have been trained for everyone to test. Run inference.py to test.

Insert picture description here
For training, you need to set the path in mypath and create a new data_process folder. Then first run dataset.py to process the data, and then use train.py to train, but because of the dataset, it may take a few days to train.

Insert picture description here
These are the results of data preprocessing. The video is preprocessed into pictures, and the data set is divided into training set, test set and validation set.

Insert picture description here
Insert picture description here

3.2. Video data preprocessing method

dataset Here we get two paths, the first is where we read the data (self.root_dir), and the second is the path to save the preprocessed data (self.output_dir).

Then a resize operation is required, and the specifications of each frame will be the same after the operation.

Insert picture description here
Next, create three folders, and then traverse the category names in the 101 category folder, which is the data label, and split the data set proportionally: train, val, test

Insert picture description here
Then we fetch the data. We don't fetch it frame by frame. This is too similar. We fetch one frame every 4 frames. To determine if the total number of extracted features is less than 16, we will take -1, take one frame every 3 frames, then less than 16 and then -1, and then less than 16 and then -1.

Insert picture description here

Then process the pictures taken out, resize them, etc., and then write them into the storage folder.

Insert picture description here

3.3. Data Batch production method

First read in the data
and then take 16 pictures, self.crop(), that means we may have more than 16 videos frame by frame, but we want 16 at a time, which 16 will be taken, this is where we come from Started to intercept, instead of taking 16 randomly. Then take the h and w of the interception area again, here is 112*112.

Insert picture description here
Insert picture description here
The 16 here means reading 16 frames (16 pictures) at a time (one sample). batch_size = 6, which means taking 6 samples at a time for training.

Insert picture description here

3.4 Modules involved in 3D convolutional network

Initialize the network parameters, convolution-pooling-convolution-pooling, now 3d is changed to nn.Conv3d, and then kernel_size = (3,3,3), one more dimension than before, one more 3, each One feature is extracted from 3 frames. Then MaxPool3d has one more 3d than before, kernel_size = (1, 2, 2) where 1 is not compressed in the time dimension, and everything else becomes one-half of the original. The following convolutional pooling is normal, and the time dimension will also be compressed. The 8192 of the final fully connected layer is the size of the feature map obtained after convolution pooling, and then converted into 4096 dimensions, and then converted into several categories, and finally add dropout and relu activation functions, and it's done.

Insert picture description here

3.5, training network model

Network structure changes and output at the beginning of forward propagation

Insert picture description here
Model save operation

Insert picture description here

Guess you like

Origin blog.csdn.net/qiuzitao/article/details/109449586