reference:
0. Environment
ubuntu16.04
python3.6
# pip安装
torch==1.1.0
torchvision==0.3.0
numpy>=1.15.2
onnx>=1.3.0
opencv-python>=3.4.3.18
pandas>=0.23.4
tensorboardX>=1.4
tqdm>=4.26.0
pretrainedmodels>=0.7.4
networkx==2.3
# apt安装
sudo apt-get install ffmpeg
1. Data preparation
(1) Download data
Official website: https://www.crcv.ucf.edu/data/UCF101.php , there are two things we need below. The first part is avi data, and the second part is divided into training set and test set.
UCF101 video classification data set: http://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar
Create a new data directory and extract the UCF101.rar file into it:
apt-get install unrar
unrar x UCF101.rar
Download the second data https://www.crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-DetectionTask.zip . Put it where you want, I put it in the data directory.
Reference: https://blog.csdn.net/qq_41185868/article/details/108474259
Put all the in the utils/ucf101_json.py file:
data.ix
替换为
data.iloc
(2) Generate json file
Run to generate json file:
python utils/ucf101_json.py ./data/ucf-101/ucfTrainTestlist
The generated data is in the ./data/ucf-101/ucfTrainTestlist directory.
(3) Convert video to jpg
CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_01.json \
--raw_dir ./data/ucf-101/UCF-101/ \
--destination_dir ./data/data/UCF101_jpg \
--video-size 480 \
--video-format frames \
--video-quality 1 \
--threads 6
CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_02.json \
--raw_dir ./data/ucf-101/UCF-101/ \
--destination_dir ./data/data/UCF101_jpg \
--video-size 480 \
--video-format frames \
--video-quality 1 \
--threads 6
CUDA_VISIBLE_DEVICES="1" python utils/preprocess_videos.py --annotation_file ./data/ucf-101/ucfTrainTestlist/ucf101_03.json \
--raw_dir ./data/ucf-101/UCF-101/ \
--destination_dir ./data/data/UCF101_jpg \
--video-size 480 \
--video-format frames \
--video-quality 1 \
--threads 6
(4) Generate n_frames
Reference: https://github.com/kenshohara/3D-ResNets-PyTorch/tree/CVPR2018
Create a script file for generating n_frames and place it in the utils directory:
# -*- coding: UTF-8 -*-
'''
@author: mengting gu
@contact: [email protected]
@time: 2021/1/7 15:47
@file: n_frames_ucf101_hmdb51.py
@desc:
'''
from __future__ import print_function, division
import os
import sys
import subprocess
def class_process(dir_path, class_name):
class_path = os.path.join(dir_path, class_name)
if not os.path.isdir(class_path):
return
for file_name in os.listdir(class_path):
video_dir_path = os.path.join(class_path, file_name)
image_indices = []
for image_file_name in os.listdir(video_dir_path):
if 'image' not in image_file_name:
continue
image_indices.append(int(image_file_name[6:11]))
if len(image_indices) == 0:
print('no image files', video_dir_path)
n_frames = 0
else:
image_indices.sort(reverse=True)
n_frames = image_indices[0]
print(video_dir_path, n_frames)
with open(os.path.join(video_dir_path, 'n_frames'), 'w') as dst_file:
dst_file.write(str(n_frames))
if __name__=="__main__":
dir_path = sys.argv[1]
for class_name in os.listdir(dir_path):
class_process(dir_path, class_name)
Run the command:
python utils/n_frames_ucf101_hmdb51.py ./data/data/UCF101_jpg
(5) File structure
After preparing the data, the file structure:
.../
data/ (root dir)
data
UCF101_jpg/ (jpg files)
ucf-101/
UCF-101/ (video files)
ucfTrainTestlist/ (annotation path)
classInd.txt
ucf101_01.json
2. Training and evaluation
Like the pre-training model of imagenet for classification tasks, video classification mainly uses the model trained by kinetics as the pre-training model.
(1) Prepare the pre-training model
Prepare the pre-trained model:
Here, take ResNet34-VTN as an example, download the corresponding pre-training model and put it under models .
(2) File structure
.../
data/ (root dir)
...
models/
resnet_34_vtn_rgbd_kinetics.pth
se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth
(3) Training
Pre-trained model: resnet_34_vtn_rgbd_kinetics.pth
CUDA_VISIBLE_DEVICES="0" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model resnet34_vtn_rgbdiff -b16 --lr 1e-5 --seq 16 --pretrain-path ./models/resnet_34_vtn_rgbd_kinetics.pth --video-path ./UCF101_jpg
(4) Evaluation
Pre-trained model used for evaluation: se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth
CUDA_VISIBLE_DEVICES="2" python main.py --root-path ./data --result-path ./logs/ --dataset ucf101_1 --model se-resnext101-32x4d_vtn_rgbdiff -b128 --lr 1e-5 --seq 16 --st 2 --no-mean-norm --no-std-norm --no-train --no-val --test --pretrain-path ./models/se_resnext_101_32x4d_vtn_rgbd_ucf101_s1.pth --video-path ./UCF101_jpg
(5) Parameters
What model and parameters are used are based on the following figure:
--dataset: ucf101 or kinetics;
--model: Related to Model and Input, take the 93.44% model with the best performance as an example, se-resnext101-32x4d_vtn_rgbdiff;
-b:batch_size;
--lr: learning rate;
--pretrain-path: is the model downloaded in Checkpoint;
--video-path: If you don't use the default "./data/data/utf-101/jpeg2", you can set it yourself.
Where --model and --pretrain-path do not correspond to each other will cause problems.
Training se_resnext_101_32x4d_vtn_rgbd, b=2 requires about 7-8GB of video memory.
3. Single frame image reasoning
import sys
import time
from argparse import ArgumentParser
from collections import deque
from copy import deepcopy
import cv2
import numpy as np
import torch
import torch.nn.functional as F
from action_recognition.model import create_model
from action_recognition.options import add_input_args
from action_recognition.spatial_transforms import (CenterCrop, Compose,
Normalize, Scale, ToTensor, MEAN_STATISTICS, STD_STATISTICS)
from action_recognition.utils import load_state, generate_args
import os
TEXT_COLOR = (255, 255, 255)
TEXT_FONT_FACE = cv2.FONT_HERSHEY_DUPLEX
TEXT_FONT_SIZE = 1
TEXT_VERTICAL_INTERVAL = 45
NUM_LABELS_TO_DISPLAY = 2
class TorchActionRecognition:
def __init__(self, encoder, checkpoint_path, num_classes=400, **kwargs):
# model_type = "{}_vtn".format(encoder)
model_type = "{}".format(encoder)
args, _ = generate_args(model=model_type, n_classes=num_classes, layer_norm=False, **kwargs)
self.args = args
self.model, _ = create_model(args, model_type)
self.model = self.model.module
self.model.eval()
self.model.cuda()
checkpoint = torch.load(str(checkpoint_path))
load_state(self.model, checkpoint['state_dict'])
self.preprocessing = make_preprocessing(args)
self.embeds = deque(maxlen=(args.sample_duration * args.temporal_stride))
def preprocess_frame(self, frame):
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
return self.preprocessing(frame)
def infer_frame(self, frame):
embedding = self._infer_embed(self.preprocess_frame(frame))
self.embeds.append(embedding)
sequence = self.get_seq()
return self._infer_logits(sequence)
def _infer_embed(self, frame):
with torch.no_grad():
frame_tensor = frame.unsqueeze(0).to('cuda')
tensor = self.model.resnet(frame_tensor)
tensor = self.model.reduce_conv(tensor)
embed = F.avg_pool2d(tensor, 7)
return embed.squeeze(-1).squeeze(-1)
def _infer_logits(self, embeddings):
with torch.no_grad():
ys = self.model.self_attention_decoder(embeddings)
ys = self.model.fc(ys)
ys = ys.mean(1)
return ys.cpu()
def _infer_seq(self, frame):
with torch.no_grad():
result = self.model(frame.view(1, self.args.sample_duration, 3,
self.args.sample_size, self.args.sample_size).to('cuda'))
return result.cpu()
def get_seq(self):
sequence = torch.stack(tuple(self.embeds), 1)
if self.args.temporal_stride > 1:
sequence = sequence[:, ::self.args.temporal_stride, :]
n = self.args.sample_duration
if sequence.size(1) < n:
num_repeats = (n - 1) // sequence.size(1) + 1
sequence = sequence.repeat(1, num_repeats, 1)[:, :n, :]
return sequence
def make_preprocessing(args):
return Compose([
Scale(args.sample_size),
CenterCrop(args.sample_size),
ToTensor(args.norm_value),
Normalize(MEAN_STATISTICS[args.mean_dataset], STD_STATISTICS[args.mean_dataset])
])
def draw_rect(image, bottom_left, top_right, color=(0, 0, 0), alpha=1.):
xmin, ymin = bottom_left
xmax, ymax = top_right
image[ymin:ymax, xmin:xmax, :] = image[ymin:ymax, xmin:xmax, :] * (1 - alpha) + np.asarray(color) * alpha
return image
def render_frame(frame, probs, labels):
order = probs.argsort(descending=True)
status_bar_coordinates = (
(0, 0), # top left
(650, 25 + TEXT_VERTICAL_INTERVAL * NUM_LABELS_TO_DISPLAY) # bottom right
)
draw_rect(frame, status_bar_coordinates[0], status_bar_coordinates[1], alpha=0.5)
for i, imax in enumerate(order[:NUM_LABELS_TO_DISPLAY]):
text = '{} - {:.1f}%'.format(labels[imax], probs[imax] * 100)
text = text.upper().replace("_", " ")
cv2.putText(frame, text, (15, TEXT_VERTICAL_INTERVAL * (i + 1)), TEXT_FONT_SIZE,
TEXT_FONT_FACE, TEXT_COLOR)
return frame
def run_demo(model, labels, input_path, save_path):
fps = 30
tick = time.time()
for file in sorted(os.listdir(input_path)):
frame = cv2.imread(os.path.join(input_path, file))
if frame is None:
print("图像为空:"+str(frame))
break
print("Now processing file : {}".format(file))
# while video_cap.isOpened():
# ok, frame = video_cap.read()
#
# if not ok:
# break
logits = model.infer_frame(frame)
probs = F.softmax(logits[0], dim=0)
frame = render_frame(frame, probs, labels)
tock = time.time()
expected_time = tick + 1 / fps
if tock < expected_time:
delay = max(1, int((expected_time - tock) * 1000))
tick = tock
cv2.imwrite(os.path.join(save_path, file), frame)
# cv2.imshow("demo", frame)
# key = cv2.waitKey(delay)
# if key == 27 or key == ord('q'):
# break
def main():
parser = ArgumentParser()
parser.add_argument("--encoder", help="What encoder to use ", default='resnet34')
parser.add_argument("--checkpoint", help="Path to pretrained model (.pth) file", required=True)
parser.add_argument("--input-video", type=str, help="Path to input img or video", required=True)
parser.add_argument("--save-path", type=str, help="Path to save img", required=True)
parser.add_argument("--labels", help="Path to labels file (new-line separated file with label names)", type=str,
required=True)
add_input_args(parser)
args = parser.parse_args()
with open(args.labels) as fd:
labels = fd.read().strip().split('\n')
extra_args = deepcopy(vars(args))
input_data_params = set(x.dest for x in parser._action_groups[-1]._group_actions)
for name in list(extra_args.keys()):
if name not in input_data_params:
del extra_args[name]
input_path = args.input_video
save_path = args.save_path
try:
model = TorchActionRecognition(args.encoder, args.checkpoint, num_classes=len(labels), **extra_args)
# cap = cv2.VideoCapture(args.input_video)
run_demo(model, labels, input_path, save_path)
except Exception as error:
print("an error occur : "+error)
if __name__ == '__main__':
sys.exit(main())
To prepare an image and model, run the following command:
CUDA_VISIBLE_DEVICES="0" python vtn_jpg_demo.py --encoder resnet34_vtn --checkpoint ./data/models/resnet_34_vtn_rgb_ucf101_s1.pth --input-video ./data/ourdata/input --save-path ./data/ourdata/output --labels ./data/ucf-101/ucfTrainTestlist/classInd.txt
got the answer:
4. Attention
1) There are various errors about the model, it is very likely that the model called is inconsistent with the previous model name;
2) Obviously the data is placed in the corresponding directory of data, and the calling model is wrong, because data/data+(video_path) is set in the code;
3) There is no n_frames file, please see my 1.4;
4) Test the pre-training model, the possibility of errors:
When preparing data, that is, the default -q=4 used in the process of calling python utils/preprocess_videos.py, which is the worst saving method of image quality;
In reasoning, the parameters are wrong, too much or too little, --no-mean-norm --no-std-norm.
Note: These are the costs of a lot of time, and some can't be remembered.