Preface This paper aims to address the problem of training existing CNN architectures on large-scale images while being computationally and memory constrained. Propose PatchGD, which is based on the assumption that instead of performing gradient-based updates on the entire image at once, it is better to perform model updates on only a small part of the image at a time, ensuring that most of it is covered during iterations.
PatchGD widely enjoys better memory and computational efficiency when training models on large-scale images. Especially in the case of limited computational memory, this method is more stable and efficient than standard gradient descent when processing large images.
Transformer, target detection, semantic segmentation exchange group
Welcome to pay attention to the public account CV technical guide , focusing on computer vision technical summary, latest technology tracking, interpretation of classic papers, CV recruitment information.
Computer Vision Introduction 1v3 Tutorial Class
Paper: https://arxiv.org/pdf/2301.13817.pdf
Thesis starting point
Existing deep learning models using CNNs are mainly trained and tested on relatively low resolution ranges (less than 300 × 300 pixels). This is partly because of widely used image benchmark datasets. Using these models on high-resolution images leads to a quadratic increase in the size of the associated activations, which in turn leads to a large increase in training computation and memory footprint. Furthermore, CNNs cannot handle such large images when available GPU memory is limited.
There is very limited work addressing the problem of using CNNs for very large images. One of the most common methods is to reduce the resolution of the image through downscaling. However, this leads to a large loss of information related to small-scale features and can adversely affect the semantic context associated with the image. Another strategy is to divide the image into overlapping or non-overlapping tiles, and then process these tiles sequentially. However, this approach does not guarantee that the semantic links between blocks will be preserved, and it hinders the learning process. Several similar strategies exist to attempt to learn the information contained in large images, however, their inability to capture global context limits their use.
This paper proposes a scalable training strategy aimed at building neural networks with very large images, very low memory computation, or a combination of both.
innovative ideas
This paper argues that "large images" should not be interpreted simply in terms of the number of pixels they contain, but that images should be considered too large to train using CNNs if the corresponding computational memory budget is small.
Hence PatchGD, which performs model updates using only part of the image at a time, while also ensuring that it sees almost the full context over the course of multiple steps.
method
General description
At its core, PatchGD is building or populating Z-blocks. Regardless of which parts of the input are used to perform the model update, Z builds an encoding of the full image from information obtained for different parts of the image in the previous few update steps.
The use of Z blocks is shown in Figure a. The input image is first divided into m×n blocks, and each block is processed using θ1 as an independent image. The output of the model is combined with the corresponding positions of the patches, and they are passed to the model as batches for processing, which are used to fill the various parts of Z.
To build an end-to-end CNN model, a small sub-network consisting of convolutional and fully-connected layers is added, which processes the information contained in Z and converts it into a probability vector required for classification tasks. The pipeline of model training and reasoning is shown in Figure b below. During training, the model components θ1 and θ2 are updated. Based on a fraction of patches sampled from the input image, the corresponding encoding is computed using the latest state of θ1, and the output is used to update the corresponding entry in the populated Z. The partially updated Z is then used to further calculate the loss function value and update the model parameters through backpropagation.
Mathematical formulation
PatchGD avoids model updates for the entire image sample at once, but uses only part of the images to compute gradients and update model parameters. Therefore, its model update step can be expressed as:
Among them, i represents the index of the mini-batch iteration within an epoch, and j represents the inner iteration. In each inner iteration, k patches are sampled from the input image X and an update of the gradient is performed.
Algorithm 1 describes model training on a batch of B images. As the first step in the model training process, initialize Z for each input image:
Algorithm 2 describes the filling process of Z:
result
本文在UltraMNIST和PANDA前列腺癌分级评估两个数据集上进行实验验证。其中,UltraMNIST 是一个分类数据集,每个样本包含 3-5 个不同比例的 MNIST 数字,这些数字位于图像中的随机位置,数字之和介于 0-9 之间。PANDA 数据集包含高分辨率组织病理学图像。
使用 ResNet50 架构在 UltraMNIST 分类任务中,对于512 × 512图像,PatchGD 的性能优于 GD 以及 GD-extended 的大幅提升:
同理,使用MobileNetV2 架构的对比情况:
PANDA 数据集上使用 Resnet50的验证情况:
欢迎关注公众号CV技术指南,专注于计算机视觉的技术总结、最新技术跟踪、经典论文解读、CV招聘信息。
【技术文档】《从零搭建pytorch模型教程》122页PDF下载
QQ交流群:444129970。群内有大佬负责解答大家的日常学习、科研、代码问题。
模型部署交流群:732145323。用于计算机视觉方面的模型部署、高性能计算、优化加速、技术学习等方面的交流。
ECCV2022 | 重新思考单阶段3D目标检测中的IoU优化
Vision Transformer和MLP-Mixer联系和对比
从零搭建Pytorch模型教程(三)搭建Transformer网络
从零搭建Pytorch模型教程(四)编写训练过程--参数解析
从零搭建Pytorch模型教程(五)编写训练过程--一些基本的配置
从零搭建pytorch模型教程(八)实践部分(一)微调、冻结网络
从零搭建pytorch模型教程(八)实践部分(二)目标检测数据集格式转换