【Video classification】training_extensions/action_recognition reproduce

reference:

https://github.com/openvinotoolkit/training_extensions/tree/develop/pytorch_toolkit/action_recognition

0. Environment

ubuntu16.04
python3.6

# pip安装
torch==1.1.0
torchvision==0.3.0
numpy>=1.15.2
onnx>=1.3.0
opencv-python>=3.4.3.18
pandas>=0.23.4
tensorboardX>=1.4
tqdm>=4.26.0
pretrainedmodels>=0.7.4
networkx==2.3


# apt安装
sudo apt-get install ffmpeg

 

1. Data preparation

(1) Download data

Official website: https://www.crcv.ucf.edu/data/UCF101.php , there are two things we need below. The first part is avi data, and the second part is divided into training set and test set.

UCF101 video classification data set: http://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar

Create a new data directory and extract the UCF101.rar file into it:

apt-get install unrar
unrar x UCF101.rar

 

Download the second data https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-DetectionTask.zip . Put it where you want, I put it in the data directory.

Reference: https://blog.csdn.net/qq_41185868/article/details/108474259

Put all the in the utils/ucf101_json.py file:

data.ix
替换为
data.iloc

(2) Generate json file

Run to generate json file:

python utils/ucf101_json.py ./data/ucf-101/ucfTrainTestlist

The generated data is in the ./data/ucf-101/ucfTrainTestlist directory.

(3) Convert video to jpg

CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_01.json \
    --raw_dir ./data/ucf-101/UCF-101/ \
    --destination_dir ./data/data/UCF101_jpg \
    --video-size 480 \
    --video-format frames \
    --video-quality 1 \
    --threads 6

CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_02.json \
    --raw_dir ./data/ucf-101/UCF-101/ \
    --destination_dir ./data/data/UCF101_jpg \
    --video-size 480 \
    --video-format frames \
    --video-quality 1 \
    --threads 6

CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_03.json \
    --raw_dir ./data/ucf-101/UCF-101/ \
    --destination_dir ./data/data/UCF101_jpg \
    --video-size 480 \
    --video-format frames \
    --video-quality 1 \
    --threads 6

(4) Generate n_frames

Reference: https://github.com/kenshohara/3D-ResNets-PyTorch/tree/CVPR2018

Create a script file for generating n_frames and place it in the utils directory:

# -*- coding: UTF-8 -*-
'''
@author: mengting gu
@contact: [email protected]
@time: 2021/1/7 15:47
@file: n_frames_ucf101_hmdb51.py
@desc: 
'''

from __future__ import print_function, division
import os
import sys
import subprocess

def class_process(dir_path, class_name):
    class_path = os.path.join(dir_path, class_name)
    if not os.path.isdir(class_path):
        return

    for file_name in os.listdir(class_path):
        video_dir_path = os.path.join(class_path, file_name)
        image_indices = []
        for image_file_name in os.listdir(video_dir_path):
            if 'image' not in image_file_name:
                continue
            image_indices.append(int(image_file_name[6:11]))

        if len(image_indices) == 0:
            print('no image files', video_dir_path)
            n_frames = 0
        else:
            image_indices.sort(reverse=True)
            n_frames = image_indices[0]
            print(video_dir_path, n_frames)
        with open(os.path.join(video_dir_path, 'n_frames'), 'w') as dst_file:
            dst_file.write(str(n_frames))


if __name__=="__main__":
    dir_path = sys.argv[1]
    for class_name in os.listdir(dir_path):
        class_process(dir_path, class_name)

Run the command:

python utils/n_frames_ucf101_hmdb51.py ./data/data/UCF101_jpg

 

(5) File structure

After preparing the data, the file structure: 

.../
    data/ (root dir)
        data
        	UCF101_jpg/  (jpg files)
        ucf-101/
            UCF-101/ (video files)
            ucfTrainTestlist/ (annotation path)
            	classInd.txt
            	ucf101_01.json
	   

2. Training and evaluation

Like the pre-training model of imagenet for classification tasks, video classification mainly uses the model trained by kinetics as the pre-training model.

(1) Prepare the pre-training model

Prepare the pre-trained model:

Here, take ResNet34-VTN as an example, download the corresponding pre-training model and put it under models .

 

(2) File structure

.../
    data/ (root dir)
        ...
	    models/
	    	resnet_34_vtn_rgbd_kinetics.pth  
	    	se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth

(3) Training

Pre-trained model: resnet_34_vtn_rgbd_kinetics.pth 

CUDA_VISIBLE_DEVICES="0" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model resnet34_vtn_rgbdiff -b16 --lr 1e-5 --seq 16 --pretrain-path ./models/resnet_34_vtn_rgbd_kinetics.pth --video-path ./UCF101_jpg

 

(4) Evaluation

Pre-trained model used for evaluation: se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth

CUDA_VISIBLE_DEVICES="2" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model se-resnext101-32x4d_vtn_rgbdiff -b128 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --no-train --no-val --test --pretrain-path ./models/se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth --video-path ./UCF101_jpg

(5) Parameters

What model and parameters are used are based on the following figure:

 

--dataset: ucf101 or kinetics;

--model: Related to Model and Input, take the 93.44% model with the best performance as an example, se-resnext101-32x4d_vtn_rgbdiff;

-b:batch_size;

--lr: learning rate;

--pretrain-path: is the model downloaded in Checkpoint;

--video-path: If you don't use the default "./data/data/utf-101/jpeg2", you can set it yourself.

Where --model and --pretrain-path do not correspond to each other will cause problems.

Training se_resnext_101_32x4d_vtn_rgbd, b=2 requires about 7-8GB of video memory.

3. Single frame image reasoning

import sys
import time
from argparse import ArgumentParser
from collections import deque
from copy import deepcopy

import cv2
import numpy as np
import torch
import torch.nn.functional as F

from action_recognition.model import create_model
from action_recognition.options import add_input_args
from action_recognition.spatial_transforms import (CenterCrop, Compose,
                                                   Normalize, Scale, ToTensor, MEAN_STATISTICS, STD_STATISTICS)
from action_recognition.utils import load_state, generate_args
import os

TEXT_COLOR = (255, 255, 255)
TEXT_FONT_FACE = cv2.FONT_HERSHEY_DUPLEX
TEXT_FONT_SIZE = 1
TEXT_VERTICAL_INTERVAL = 45
NUM_LABELS_TO_DISPLAY = 2


class TorchActionRecognition:
    def __init__(self, encoder, checkpoint_path, num_classes=400, **kwargs):
        # model_type = "{}_vtn".format(encoder)
        model_type = "{}".format(encoder)
        args, _ = generate_args(model=model_type, n_classes=num_classes, layer_norm=False, **kwargs)
        self.args = args
        self.model, _ = create_model(args, model_type)

        self.model = self.model.module
        self.model.eval()
        self.model.cuda()

        checkpoint = torch.load(str(checkpoint_path))
        load_state(self.model, checkpoint['state_dict'])

        self.preprocessing = make_preprocessing(args)
        self.embeds = deque(maxlen=(args.sample_duration * args.temporal_stride))

    def preprocess_frame(self, frame):
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        return self.preprocessing(frame)

    def infer_frame(self, frame):
        embedding = self._infer_embed(self.preprocess_frame(frame))
        self.embeds.append(embedding)
        sequence = self.get_seq()
        return self._infer_logits(sequence)

    def _infer_embed(self, frame):
        with torch.no_grad():
            frame_tensor = frame.unsqueeze(0).to('cuda')
            tensor = self.model.resnet(frame_tensor)
            tensor = self.model.reduce_conv(tensor)
            embed = F.avg_pool2d(tensor, 7)
        return embed.squeeze(-1).squeeze(-1)

    def _infer_logits(self, embeddings):
        with torch.no_grad():
            ys = self.model.self_attention_decoder(embeddings)
            ys = self.model.fc(ys)
            ys = ys.mean(1)
        return ys.cpu()

    def _infer_seq(self, frame):
        with torch.no_grad():
            result = self.model(frame.view(1, self.args.sample_duration, 3,
                                           self.args.sample_size, self.args.sample_size).to('cuda'))
        return result.cpu()

    def get_seq(self):
        sequence = torch.stack(tuple(self.embeds), 1)
        if self.args.temporal_stride > 1:
            sequence = sequence[:, ::self.args.temporal_stride, :]

        n = self.args.sample_duration
        if sequence.size(1) < n:
            num_repeats = (n - 1) // sequence.size(1) + 1
            sequence = sequence.repeat(1, num_repeats, 1)[:, :n, :]

        return sequence


def make_preprocessing(args):
    return Compose([
        Scale(args.sample_size),
        CenterCrop(args.sample_size),
        ToTensor(args.norm_value),
        Normalize(MEAN_STATISTICS[args.mean_dataset], STD_STATISTICS[args.mean_dataset])
    ])


def draw_rect(image, bottom_left, top_right, color=(0, 0, 0), alpha=1.):
    xmin, ymin = bottom_left
    xmax, ymax = top_right

    image[ymin:ymax, xmin:xmax, :] = image[ymin:ymax, xmin:xmax, :] * (1 - alpha) + np.asarray(color) * alpha
    return image


def render_frame(frame, probs, labels):
    order = probs.argsort(descending=True)

    status_bar_coordinates = (
        (0, 0),  # top left
        (650, 25 + TEXT_VERTICAL_INTERVAL * NUM_LABELS_TO_DISPLAY)  # bottom right
    )

    draw_rect(frame, status_bar_coordinates[0], status_bar_coordinates[1], alpha=0.5)

    for i, imax in enumerate(order[:NUM_LABELS_TO_DISPLAY]):
        text = '{} - {:.1f}%'.format(labels[imax], probs[imax] * 100)
        text = text.upper().replace("_", " ")
        cv2.putText(frame, text, (15, TEXT_VERTICAL_INTERVAL * (i + 1)), TEXT_FONT_SIZE,
                    TEXT_FONT_FACE, TEXT_COLOR)

    return frame


def run_demo(model, labels, input_path, save_path):

    fps = 30
    tick = time.time()

    for file in sorted(os.listdir(input_path)):
        frame = cv2.imread(os.path.join(input_path, file))
        if frame is None:
            print("图像为空:"+str(frame))
            break
        print("Now processing file : {}".format(file))
    # while video_cap.isOpened():
    #     ok, frame = video_cap.read()
    #
    #     if not ok:
    #         break

        logits = model.infer_frame(frame)
        probs = F.softmax(logits[0], dim=0)
        frame = render_frame(frame, probs, labels)

        tock = time.time()
        expected_time = tick + 1 / fps
        if tock < expected_time:
            delay = max(1, int((expected_time - tock) * 1000))
        tick = tock

        cv2.imwrite(os.path.join(save_path, file), frame)
        # cv2.imshow("demo", frame)
        # key = cv2.waitKey(delay)
        # if key == 27 or key == ord('q'):
        #     break


def main():
    parser = ArgumentParser()
    parser.add_argument("--encoder", help="What encoder to use ", default='resnet34')
    parser.add_argument("--checkpoint", help="Path to pretrained model (.pth) file", required=True)
    parser.add_argument("--input-video", type=str, help="Path to input img or video", required=True)
    parser.add_argument("--save-path", type=str, help="Path to save img", required=True)
    parser.add_argument("--labels", help="Path to labels file (new-line separated file with label names)", type=str,
                        required=True)
    add_input_args(parser)
    args = parser.parse_args()

    with open(args.labels) as fd:
        labels = fd.read().strip().split('\n')

    extra_args = deepcopy(vars(args))
    input_data_params = set(x.dest for x in parser._action_groups[-1]._group_actions)
    for name in list(extra_args.keys()):
        if name not in input_data_params:
            del extra_args[name]

    input_path = args.input_video
    save_path = args.save_path
    try:
        model = TorchActionRecognition(args.encoder, args.checkpoint, num_classes=len(labels), **extra_args)
        # cap = cv2.VideoCapture(args.input_video)
        run_demo(model, labels, input_path, save_path)
    except Exception as error:
        print("an error occur : "+error)

if __name__ == '__main__':
    sys.exit(main())

To prepare an image and model, run the following command:

CUDA_VISIBLE_DEVICES="0" python vtn_jpg_demo.py --encoder  resnet34_vtn --checkpoint ./data/models/resnet_34_vtn_rgb_ucf101_s1.pth --input-video ./data/ourdata/input --save-path ./data/ourdata/output --labels ./data/ucf-101/ucfTrainTestlist/classInd.txt

got the answer:

 

 

4. Attention

1) There are various errors about the model, it is very likely that the model called is inconsistent with the previous model name;

2) Obviously the data is placed in the corresponding directory of data, and the calling model is wrong, because data/data+(video_path) is set in the code;

3) There is no n_frames file, please see my 1.4;

4) Test the pre-training model, the possibility of errors:

      When preparing data, that is, the default -q=4 used in the process of calling python utils/preprocess_videos.py, which is the worst saving method of image quality;

      In reasoning, the parameters are wrong, too much or too little, --no-mean-norm --no-std-norm.

Note: These are the costs of a lot of time, and some can't be remembered.

 

Guess you like

Origin blog.csdn.net/qq_35975447/article/details/112261368