After giving up backpropagation, the heavy research on forward gradient learning that Hinton participated in is here!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【Transformer】WeChat Technology Exchange Group

Reprinted from: Heart of the Machine

Researchers such as Turing Award winner Geoffrey Hinton made forward gradient learning practical.

We know that in the field of artificial intelligence, backpropagation is the most basic concept.

Backpropagation (Backpropagation, BP) is a common method used in combination with optimization methods (such as gradient descent) to train artificial neural networks. This method computes the gradient of the loss function for all weights in the network. This gradient is fed back to the optimization method to update the weights to minimize the loss function.

337c38e20d0c33e76c63b14ede259a25.png

In short, the core idea of ​​BP is actually negative feedback. We try to use this method to realize the automatic iteration and calibration of the neural network system in the face of a given goal. With the improvement of computing power, data and more technical improvements, in the field of AI, people use multi-layer neural networks trained by backpropagation to compete with humans in some tasks.

Many people attribute the discovery of this technology to Geoffrey Hinton, a pioneer of deep learning and winner of the Turing Award in 2019, but Hinton himself said that his contribution lies in clearly proposing that backpropagation can learn interesting internal representations, and making this idea Generalizing: "I do this by having a neural network learn word vector representations so that it can predict the next word in the sequence based on the vector representations of previous words."

An example of this is in the paper "Learning representations by back-propagating errors" published in Nature in 1986.

ef59bfd8ee9fc257763e1cc139f050e6.png

In any case, the technique of backpropagation drives the development of modern deep learning. However, Geoffrey Hinton, who was once dubbed the "father of backpropagation", has often said in recent years that he is conceiving the next generation of neural networks. He is "very skeptical" about backpropagation and proposes that "it should be abandoned and start over."

It can be said that since 2017, Hinton has begun to look for a new direction. The heart of the machine has previously introduced Hinton's thinking on the forward-forward network ( nearly 10,000 people watched Hinton's latest speech: forward-forward neural network training algorithm, the paper has been made public ).

Recently, we have seen another important development. Recently, the papers completed by Mengye Ren, Simon Kornblith, Renjie Liao, and Geoffrey Hinton were accepted by ICLR 2023, the top conference on artificial intelligence.

b941e8930f85c915fbf278387089bc35.jpeg

Forward gradient learning is usually used to calculate directional gradients containing noise, and is a deep neural network learning method that conforms to biological mechanisms and can replace backpropagation. However, the standard forward gradient algorithm suffers from large variance when the number of parameters to be learned is large.

Based on this, researchers such as Turing Award winner Geoffrey Hinton proposed a series of new architectures and algorithm improvements that make forward gradient learning practical for standard deep learning benchmark tasks.

6fa583a10be071f1fd151eda9e41a709.jpeg

  • Paper link: https://arxiv.org/abs/2210.03310

  • GitHub link: https://github.com/google-research/google-research/tree/master/local_forward_gradient

The study shows that the variance of forward gradient estimators can be significantly reduced by applying perturbation to activations instead of weights. The research team further improved the scalability of the forward gradient by introducing a large number of local greedy loss functions (each loss function involves only a small number of learnable parameters) and a new architecture LocalMixer (inspired by MLPMixer) that is more suitable for local learning. The method proposed in this study achieves comparable performance to backpropagation on MNIST and CIFAR-10, and significantly outperforms previous algorithms without backpropagation on ImageNet.

692d20e850c900975f5e0413816f5917.png

Currently, most deep neural networks are trained using the backpropagation algorithm (Werbos, 1974; LeCun, 1985; Rumelhart et al., 1986), which efficiently Computes the gradient of the weight parameters. Although artificial neural networks were originally inspired by biological neurons, backpropagation has long been considered biologically incompatible because the brain does not form symmetrical reverse connections or perform simultaneous computations. From an engineering perspective, backpropagation is incompatible with the parallelism of large-scale models and limits potential hardware designs. These problems point to the need for a radically different learning algorithm for deep networks.

Researchers such as Hinton re-examined the alternative method of weight perturbation-activity perturbation (activity perturbation, Le Cun et al., 1988; Widrow & Lehr, 1990; Fiete & Seung, 2006), and explored the effect of this method on visual task training. general applicability.

The study shows that activity perturbations can produce lower-variance gradient estimates than weight perturbations and can provide a continuous-time rate-based explanation for the algorithm proposed in this study.

The research team addressed the scalability issue of forward gradient learning by designing an architecture with a large number of local greedy loss functions, where the network is isolated into local modules, thereby reducing the amount of learnable parameters for each loss function. Unlike previous work that only adds local losses along the depth dimension, this study finds that patch-wise and channel-group-wise loss functions are also critical. Finally, inspired by MLPMixer (Tolstikhin et al., 2021), this study designs a network called LocalMixer. LocalMixer has a linear token mixing layer and grouped channels for better compatibility with local learning.

The study evaluates its local greedy forward gradient algorithm on supervised and self-supervised image classification problems. On MNIST and CIFAR-10, the proposed learning algorithm is comparable to backpropagation performance, while on ImageNet, its performance is significantly better than other schemes using asymmetric forward and backward weights. Although the algorithm proposed in this study does not yet reach the performance of the backpropagation algorithm on larger-scale problems, the local loss function design may be a biologically plausible learning algorithm and will also be a key factor in the parallel computing of next-generation models .

This study analyzed the characteristics of the forward gradient estimator's expectation and variance, and focused the analysis on the gradient of the weight matrix. The specific theoretical analysis results are shown in Table 1 below. When the batch size is N, the independent perturbation (independent perturbation) can reduce the variance to 1/N, while shared perturbation (shared perturbation) has a constant variance term dominated by the squared gradient norm. However, when performing independent weight perturbations, matrix multiplication cannot be batched, since the activation vector for each sample is multiplied with a different weight matrix. In contrast, the independent activity perturbation algorithm allows batch matrix multiplication.

5838982d1b016e66832ad3155753eba8.png

Compared with weight perturbation, the variance of activity perturbation is smaller because the number of perturbation elements is the number of output units instead of the size of the whole weight matrix. The only downside to activity perturbations is that storing intermediate activations requires a certain amount of memory.

Furthermore, the study finds that in networks with ReLU activations, ReLU sparsity can be exploited to further reduce variance, since non-activated units have zero gradient and thus should not be perturbed.

Extend with a local loss function

Learning with perturbations can suffer from the "curse of dimensionality": variance increases with the dimensionality of the perturbation, and deep networks typically have millions of parameters changing simultaneously. One way to limit the number of learnable dimensions is to divide the network into submodules, each with a separate loss function. Therefore, this study suppresses the variance by increasing the number of local loss functions, including:

1) Blockwise loss. First, the study divides the network depth into multiple modules. Each module consists of several layers, and at the end of each module a loss function is computed which is used to update the parameters in that module. This approach amounts to adding a "stop gradient" operator between modules, a locally greedy loss function explored by Belilovsky et al. (2019) and Löwe et al. (2019).

d7e4cc8215976e4f2548998b08d4d8cd.png

2) patchwise loss. Sensory input signals such as images have spatial dimensions. Along these spatial dimensions, the study applies individual losses on a block-by-block basis. In the Vision Transformer architecture (Vaswani et al., 2017; Dosovitskiy et al., 2021), each spatial marker represents a patch in the image. In modern deep networks, parameters at each spatial location are usually shared to improve data efficiency and reduce memory bandwidth utilization. Although simple weight sharing is not biologically plausible, the study still considers shared weights in this work. It is possible to simulate the effect of weight sharing by adding a knowledge distillation (Hinton et al., 2015) loss between patches.

3) Groupwise loss. Finally, the study turns to the channel dimension. To create multiple losses, this study divides channels into groups, each group is attached to a loss function (Patel et al., 2022). To prevent groups from communicating with each other, channels only connect to other channels within the same group.

1a598be91b5d98bf4778217bd62742f7.png

LocalMixer residual block with local loss.

feature aggregator

Simply applying the loss separately to the spatial and channel dimensions leads to suboptimal performance, since each dimension contains only local information. For losses on standard tasks like classification, the model needs a global view of the input to make decisions. Standard architectures achieve this global view by performing a global average pooling layer before the final classification layer. Therefore, this study explores strategies for aggregating information from other groups and spatial blocks before a local loss function.

d29aab2948392af2611448683c7c2a83.png

Feature aggregator design.

accomplish

Network Architecture: This study proposes a LocalMixer architecture that is more suitable for local learning. It is inspired by MLPMixer (Tolstikhin et al., 2021), which consists of a fully connected network and a residual block. This study utilizes a fully connected network to enable each spatial patch to perform computations without interfering with other patches, which is more in line with local learning objectives. Figure 1 shows the high-level architecture, and Figure 2 shows a detailed diagram of a residual block.

Normalized. There are various ways to perform normalization in neural networks across different tensor dimensions (Krizhevsky et al., 2012; Ioffe & Szegedy, 2015; Ba et al., 2016; Ren et al., 2017; Wu & He , 2018). The study opted for a local variant of layer normalization, which normalizes within each block of local spatial features (Ren et al., 2017). For grouped linear layers, each group is normalized individually (Wu & He, 2018).

The study found experimentally that this local normalization performs better in contrastive learning and is about the same as layer normalization in supervised learning. Local normalization is also more biologically plausible since it does not perform global communication.

Typically, normalization layers are placed after linear layers. In MLPMixer (Tolstikhin et al., 2021), layer normalization is placed at the beginning of each residual block. The study found it best to place normalization before and after each linear layer, as shown in Figure 2. Experimental results show that this design choice does not have much impact on backpropagation, but it allows forward gradient learning to learn faster and achieve lower training errors.

Effective implementation of replication loss. Due to the design of feature aggregation and replication loss, naive implementations of groups can be very inefficient in terms of memory consumption and computation. However, each spatial group actually computes the same aggregated features and loss function. This means that most of the computation can be shared across loss functions when performing backpropagation and forward gradients. The study implemented custom JAX JVP/VJP functions (Bradbury et al., 2018) and observed significant memory savings and computational speed improvements for copy loss, which would otherwise be infeasible to run on modern hardware, as shown in the figure below Show.

adf830e4bd276b127da7b8bf79302c55.png

Memory and compute usage when replicating simple and fused implementations of loss.

experiment

The study compares the proposed algorithm with alternatives including backpropagation, feedback alignment, and other forward-gradient global variants. Backpropagation is an incredible oracle in biology because it computes true gradients, whereas this method computes noisy gradients. Feedback alignment computes approximate gradients by using a set of random backward weights.

The results of various experiments are as follows:

aef50607f508112ea7d7dde37d11ac6d.png

Self-supervised learning for image classification.

0b751e9436893cbd28605c1de7236d4c.png

Self-supervised contrastive learning with linear readout.

ec208ed1aedafc417e7bbb438292870f.png

Effect of adding local loss at different positions on forward gradient performance.

cc3af584bb1edcae7e5c3a0cbe1a7540.png

M/8/* error rates when training on CIFAR-10 with different number of groups.

Summarize

It is generally believed that perturbation-based learning does not scale to large deep networks. This study shows that this is somewhat true, as the gradient estimate variance grows with the number of perturbed hidden dimensions, and is even worse for shared weight perturbations.

But optimistically, the study shows that a large number of local greedy losses can help to better promote the scale of gradient learning, exploring the local loss of blockwise, patchwise and groupwise and the combination of these three, with a total of four points in a larger network One of the losses that performed best. Local activity-perturbed forward gradients perform better on larger networks than previous backpropagation-free algorithms. The idea of ​​localized losses opens up opportunities for different loss designs and sheds light on how to find biologically plausible learning algorithms in the brain and in alternative computing devices.

Click to enter —>【Transformer】WeChat Technology Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch487786450f08ebc488a1bb4d0a60b1a9.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130498209