mmaction2实验记录2——视频数据处理的方法

使用配置文件i3d_nl_dot_product_r50_32x2x1_100e_kinetics400_rgb.py来执行动作识别任务。

该配置下，数据的流水线处理方法为：

train_pipeline = [
    dict(type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(
        type='MultiScaleCrop',
        input_size=224,
        scales=(1, 0.8),
        random_crop=False,
        max_wh_scale_gap=0),
    dict(type='Resize', scale=(224, 224), keep_ratio=False),
    dict(type='Flip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
    dict(type='ToTensor', keys=['imgs', 'label'])
]

下面通过源码分析来记录具体的处理方法。

所有的数据处理代码放在"/home/cb/algorithm/mmaction2-master/mmaction/datasets/pipelines/loading.py"文件中

1、type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1

通过规定的采样方式来确定视频中需要读取的帧数，在一个视频中获取num_clips个视频片段，每个视频片段中有clip_len个帧，每个帧的采样间隔为frame_interval。

class SampleFrames:

...

    def __call__(self, results):

        total_frames = results['total_frames']

        clip_offsets = self._sample_clips(total_frames)  # 返回一个列表，列表中的每个值代表每个视频片段的起始帧，该起始帧为随机选取得到，选取方式为将视频按照视频片段数进行平均，然后在平均后的区间上选取一个随机的偏移值作为该片段的开始
        frame_inds = clip_offsets[:, None] + np.arange(  # 将起始帧加上每一帧的偏移量得到所有视频的帧数
            self.clip_len)[None, :] * self.frame_interval
        frame_inds = np.concatenate(frame_inds)  # 将列表转换为array

        if self.temporal_jitter:
            perframe_offsets = np.random.randint(
                self.frame_interval, size=len(frame_inds))
            frame_inds += perframe_offsets

        frame_inds = frame_inds.reshape((-1, self.clip_len))
        start_index = results['start_index']
        frame_inds = np.concatenate(frame_inds) + start_index  # 将所有的帧数加上起始帧编号
        results['frame_inds'] = frame_inds.astype(np.int)  # 将处理过的数据和信息加载到results字典中
        results['clip_len'] = self.clip_len
        results['frame_interval'] = self.frame_interval
        results['num_clips'] = self.num_clips
        return results

2、type='RawFrameDecode'

将要读取的帧数读取为数据，读取视频中规定的帧数并打包在一起组成视频数据（之前的数据格式为视频帧存放的地址和帧数编号，经过编码可以将这些读取为照片的数据，ndarray格式）

class RawFrameDecode:

...

    def __call__(self, results):

        directory = results['frame_dir']
        filename_tmpl = results['filename_tmpl']
        modality = results['modality']

        imgs = list()  # 将所有的帧读取到这个列表当中
        for i, frame_idx in enumerate(results['frame_inds']):
            # 避免数据重复读取
            if frame_idx in cache:
                if modality == 'RGB':
                    imgs.append(cp.deepcopy(imgs[cache[frame_idx]]))
                else:
                    imgs.append(cp.deepcopy(imgs[2 * cache[frame_idx]]))
                    imgs.append(cp.deepcopy(imgs[2 * cache[frame_idx] + 1]))
                continue
            else:
                cache[frame_idx] = i

            frame_idx += offset
            if modality == 'RGB':
                filepath = osp.join(directory, filename_tmpl.format(frame_idx))  # 获取每一帧的绝对地址
                img_bytes = self.file_client.get(filepath)  # 读取帧
                # Get frame with channel order RGB directly.
                cur_frame = mmcv.imfrombytes(img_bytes, channel_order='rgb')  # 将读取的bytes数据转换为ndarray数据
                imgs.append(cur_frame)
            elif modality == 'Flow':
                ...
            else:
                raise NotImplementedError

        results['imgs'] = imgs  # 记录所有的信息到字典中，读取得到的视频数据，原始视频尺寸大小
        results['original_shape'] = imgs[0].shape[:2]
        results['img_shape'] = imgs[0].shape[:2]

        return results

3、type='Resize', scale=(-1, 256)

对视频进行等比例缩放，将每一帧的最小边缩放到256

4、type='MultiScaleCrop', input_size=224, scales=(1, 0.8), random_crop=False, max_wh_scale_gap=0

对视频进行多比例裁剪，裁剪选用的比例为1或0.8，选用输入帧中宽和高中最小的边作为基准尺寸，然后根据scales和max_wh_scale_gap来对基准进行调整。其scales规定了基准调整的比例，max_wh_scale_gap决定宽和高包容的间隔，当其为0时，h和w只能选择相同的尺寸，当其为1时，h和w可以选择相邻间距为1的尺度。

5、type='Resize', scale=(224, 224), keep_ratio=False

再将视频帧缩放到需要的、规定的视频帧输入的大小

6、type='Normalize', **img_norm_cfg

对视频帧数据进行归一化操作，并将在之前的视频帧由列表格式转换为ndarray的时间维度

7、type='FormatShape', input_format='NCTHW'

将视频数据的维度转换为规定的顺序片段个数*通道数*帧数*高*宽

8、type='ToTensor', keys=['imgs', 'label']

将数据转换为Tensor

mmaction2实验记录2——视频数据处理的方法

猜你喜欢