The simplest, practical and easy-to-understand PyTorch series of practical tutorials in history! (Novice friendly, novice please come in, suggest collection)
Video analysis and action recognition based on 3D convolution
1. Interpretation of 3D convolution principle
Video is composed of frames of images that are spliced by time. 3D convolution has one more time dimension than 2D convolution.
2. Introduction to UCF 101 Action Recognition Data Set
UCF 101 action recognition data set official website address download: https://www.crcv.ucf.edu/data/UCF101.php
101 categories of videos, in each category is a person doing a type of action, such as shooting, drawing lipstick, drawing eyeliner, etc. The data set is 6.5G, and I also uploaded a web disk for everyone to download or go to the data set. Download from the official website.
The operation categories of the UCF101 data set are: applying eye makeup, applying lipstick, archery, baby crawling, balance beam, band parade, baseball field, basketball shooting, basketball dunk, bench press, cycling, billiard shooting, drying hair, blowing candles, Weight squatting, bowling, boxing punching bag, boxing speed bag, breaststroke, brushing teeth, cleaning and jerk, cliff diving, cricket bowling, cricket shooting, cutting in the kitchen, diving, drumming, fencing, hockey fines, floor gymnastics, Frisbee catching , Front Crawl, Golf Swing, Haircut, Hammer Throwing, Hammering, Handstand Push-ups, Handstand Walking, Head Massage, High Jump, Horse Race, Horse Riding, Hula Hoop, Ice Dance, Javelin Throw, Juggling Ball, Jump Rope, Jump Jack, Kayaking, knitting, long jump, stab, parade, mixed batter, mopping the floor, nun chucks, parallel bars, pizza toss, guitar, piano, tabla, violin, cello, Daf, Dhol, flute, Sitar, pole vault, pommel horse, pull-ups, punching, push-ups, rafting, indoor rock climbing, rope climbing, boating, salsa spinning, shave, shot put, skate boarding, skiing, Skijet , Skydiving, football juggling, football free throws, static ring, sumo, surfing, swing, table tennis racket, tai chi, tennis swing, throwing discus, trampoline jumping, typing, uneven bars, volleyball stab, walking with dogs, wall push-ups, in Write on the boat, yo-yo. Shaving, Shot Put, Skating Boarding, Skiing, Skijet, Skydiving, Football Juggling, Football Penalty, Still Life Ring, Sumo, Surfing, Swing, Table Tennis Shooting, Tai Chi, Tennis Swing, Discus Throwing, Trampoline Jumping, Typing, No Even bar, volleyball spikes, walking with dogs, wall push-ups, writing on the boat, yo-yo. Shaving, Shot Put, Skating Boarding, Skiing, Skijet, Skydiving, Football Juggling, Football Penalty, Still Life Ring, Sumo, Surfing, Swing, Table Tennis Shooting, Tai Chi, Tennis Swing, Discus Throwing, Trampoline Jumping, Typing, No Even bar, volleyball spikes, walking with dogs, wall push-ups, writing on the boat, yo-yo.
3. "Video Analysis and Action Recognition Based on 3D Convolution" Project Actual Combat
Code link: https://pan.baidu.com/s/1rEVP8jJB2HGKukfFK2nIGQ
extraction code: agpd
3.1. Test results and project configuration
Remember to create a data folder in the same directory as the project folder and place the decompressed data with the data downloaded above before you can run the test. 100 generations of models have been trained for everyone to test. Run inference.py to test.
For training, you need to set the path in mypath and create a new data_process folder. Then first run dataset.py to process the data, and then use train.py to train, but because of the dataset, it may take a few days to train.
These are the results of data preprocessing. The video is preprocessed into pictures, and the data set is divided into training set, test set and validation set.
3.2. Video data preprocessing method
dataset Here we get two paths, the first is where we read the data (self.root_dir), and the second is the path to save the preprocessed data (self.output_dir).
Then a resize operation is required, and the specifications of each frame will be the same after the operation.
Next, create three folders, and then traverse the category names in the 101 category folder, which is the data label, and split the data set proportionally: train, val, test
Then we fetch the data. We don't fetch it frame by frame. This is too similar. We fetch one frame every 4 frames. To determine if the total number of extracted features is less than 16, we will take -1, take one frame every 3 frames, then less than 16 and then -1, and then less than 16 and then -1.
Then process the pictures taken out, resize them, etc., and then write them into the storage folder.
3.3. Data Batch production method
First read in the data
and then take 16 pictures, self.crop(), that means we may have more than 16 videos frame by frame, but we want 16 at a time, which 16 will be taken, this is where we come from Started to intercept, instead of taking 16 randomly. Then take the h and w of the interception area again, here is 112*112.
The 16 here means reading 16 frames (16 pictures) at a time (one sample). batch_size = 6, which means taking 6 samples at a time for training.
3.4 Modules involved in 3D convolutional network
Initialize the network parameters, convolution-pooling-convolution-pooling, now 3d is changed to nn.Conv3d, and then kernel_size = (3,3,3), one more dimension than before, one more 3, each One feature is extracted from 3 frames. Then MaxPool3d has one more 3d than before, kernel_size = (1, 2, 2) where 1 is not compressed in the time dimension, and everything else becomes one-half of the original. The following convolutional pooling is normal, and the time dimension will also be compressed. The 8192 of the final fully connected layer is the size of the feature map obtained after convolution pooling, and then converted into 4096 dimensions, and then converted into several categories, and finally add dropout and relu activation functions, and it's done.
3.5, training network model
Network structure changes and output at the beginning of forward propagation
Model save operation