A review and summary of the super important ideas of deep learning in recent years

Reprinted: DASOU

The author of this article, Denny Britz , summarizes the most important ideas for deep learning by time . It is recommended for newcomers to watch. Almost all the most important ideas since 12 years are listed . These ideas can be said to feed countless people. Everyone is based on These have published countless papers, which in turn are:

  • AlexNet and Dropout : AlexNet directly opened the era of deep learning, and laid the basic structure of the CNN model in CV. Needless to say, Dropout has become the basic configuration.

  • Atari of Deep Reinforcement Learning: The pioneering work of Deep Reinforcement Learning, DQN also opened a new path, and everyone began to try it on various games.

  • Seq2Seq+Atten : The impact of this in the field of NLP is not to be said. For a while, it was even said that any NLP task can be solved by Seq2Seq+Atten, and this article actually laid the foundation for the pure Attention Transformer in the future.

  • Adam Optimizer : Not much to say, the training model has a good heart.

  • Generative Adversarial Networks (GANs) : This has also been a mess in the past few years since 2014. Everyone is doing all kinds of GANs. It was not until last year that StyleGAN, an integrated model, came out. Deepfake, which has caused various controversies, is one of the results. Recently, people have seen people use it to make fake information.

  • Residual Networks : Like Dropout, Adam has become a basic configuration, and the model depends on it.

  • Transformers : The pure Attention model, which directly replaced the LSTM in NLP, and gradually achieved good results in other fields, and also laid the foundation for the subsequent BERT pre-training model.

  • BERT and fine-tuned NLP models : Using a very scalable Transformer, plus a lot of data, plus a simple self-supervised training objective, you can get a very powerful pre-trained model that can sweep across a variety of tasks. The most recent one is GPT3. Since the API was given, all kinds of fancy demos have been displayed on the Internet. Simply, all kinds of automatic completions.

The author will review here some ideas that have been widely used in the field of deep learning through the test of time, of course not comprehensive coverage. Even so, the deep learning techniques described below already cover the basics needed to understand modern deep learning research. If you're new to the field, great, this would be a very good starting point for you.

Influenced by the author's personal knowledge and familiarity with the field, this list may not be comprehensive, as many very noteworthy subfields are not mentioned. But mainstream fields that most people recognize, including machine vision, natural language, speech, and reinforcement learning, are included.

And the author only discusses research that has official or semi-official open source implementations that can be run. Some researches that are huge and difficult to reproduce, such as DeepMind's AlphaGo or OpenAI's Dota 2 AI, will not be mentioned.

2012: Processing the ImageNet dataset with AlexNet and Dropout

Related papers:

ImageNet Classification with Deep Convolutional Neural Networks [1]:

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

Improving neural networks by preventing co-adaptation of feature detectors[2]:

https://arxiv.org/abs/1207.0580

One weird trick for parallelizing convolutional neural networks [14]:

https://arxiv.org/abs/1404.5997

Implementation code:

Pytorch version :

https://pytorch.org/hub/pytorch_vision_alexnet/

TensorFlow版:

https://github.com/tensorflow/models/blob/master/research/slim/nets/alexnet.py

 7cb866d98d81e5974979d8baf74cbd6b.png

be8d3eb97cb79a0512f61de3134db8f5.png

Illustration source: [1]

It is generally believed that AlexNet started the great wave of deep learning and artificial intelligence research in recent years. AlexNet is actually a deep convolutional network based on LeNet proposed by Yann LeCun in his early years. Uniquely, AlexNet achieves a very large improvement by combining the powerful performance of GPU and its superior algorithm, far surpassing other previous methods for classifying the ImageNet dataset. It also proves that neural networks do indeed work! AlexNet is also one of the first algorithms to use Dropout [2], and since then Dropout has become a key component to improve the generalization ability of various deep learning models.

The AlexNet architecture is a series of modules composed of convolutional layers, nonlinear ReLU and max pooling, and these have now been accepted and become the standard machine vision network structure. Today, because libraries like PyTorch are so powerful, the AlexNet implementation is very simple compared to some of the latest architectures and can now be implemented in a few lines of code. It's worth noting that many implementations of AlexNet use a variant of it, incorporating a trick mentioned in the paper One weird trick for parallelizing convolutional neural networks.

2013: Playing Atari with Deep Reinforcement Learning

Related papers:

Playing Atari with Deep Reinforcement Learning [7]:

https://arxiv.org/abs/1312.5602

Implementation code:

PyTorch version:

https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

TensorFlow版:

https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial

c9bd651b1a9fe8b6290dad02ec2b548e.png

Illustration source:

https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning

Based on recent developments in image recognition and GPUs, DeepMind successfully trained a neural network that can play Atari games based on raw pixel input. Moreover, the same neural network can learn to play seven different games without setting any game rules, which also proves the generality of the method.

Youtube video:

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Among them, reinforcement learning differs from supervised learning (such as image classification) in that reinforcement learning agents must learn to maximize the sum of rewards over time steps (such as a game), not just predict labels. Since its agent can directly interact with the environment, and each action affects the next action, the training data is not IID. This also makes the training of many reinforcement learning models unstable, but this problem can be solved with techniques such as experience replay.

Although there are no obvious algorithmic innovations, the research cleverly combines various existing techniques, such as training convolutional neural networks on GPUs and experience replay, with some data processing tricks, to achieve impressive results beyond everyone's expectations. result. This also gives people more confidence to extend deep reinforcement learning techniques to solve more complex tasks such as: Go, Dota 2, StarCraft 2, etc.

And since this paper, Atari games have also become a testing standard for reinforcement learning research. The original method, although surpassing human performance, was only able to achieve such performance on 7 games. Over the next few years, these ideas have been expanded to beat humans at more and more games. Only recently has technology conquered all 57 games and surpassed all human levels, with Montezuma's Revenge being known for its long-term planning and considered one of the most difficult games to conquer.

2014: Encoder-Decoder network plus attention mechanism (Seq2Seq+Atten model)

Related papers:

Sequence to Sequence Learning with Neural Networks [4]:

https://arxiv.org/abs/1409.3215

Neural Machine Translation by Jointly Learning to Align and Translate [3]:

https://arxiv.org/abs/1409.0473

Code:

PyTorch version:

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

TensorFlow版:

https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt

2c134c3f89209b818057cc520bd32602.png

babadf87c0b28a0708667b828069db14.png

Source for illustration: The open source Seq2Seq framework in Tensorflow:

https://ai.googleblog.com/2017/04/introducing-tf-seq2seq-open-source.html

Many of the most impressive results in deep learning are on vision-related tasks and powered by convolutional neural networks. Although the field of NLP has had some success in language modeling and translation with LSTMs and encoder-decoder architectures, it was not until the advent of attention mechanisms that the field achieved truly impressive achievements.

When processing language, each token (which can be a character, word, or something in between) is fed into a recurrent network (like an LSTM) that stores previously processed inputs. In other words, it's like a time series of sentences, with each token representing a time step. And these recurrent models tend to "forget" earlier inputs when processing sequences, making it difficult to deal with long-distance dependencies. It also becomes difficult to optimize recurrent models with gradient descent, as gradients are propagated through many time steps, which can lead to exploding and vanishing gradients.

Introducing an attention mechanism helps alleviate this problem by providing the network with an adaptive way to “look back” to earlier time steps through direct connections. These connections allow the network to decide which inputs are important when generating a particular output. A simple example of translation: when generating an output word, usually one or more specific input words are selected by the attention mechanism as the output reference.

2014 – Adam Optimizer

Related papers:

Adam: A Method for Stochastic Optimization  [12]:

https://arxiv.org/abs/1412.6980

Code:

Python version:

https://d2l.ai/chapter_optimization/adam.html

PyTorch version:

https://pytorch.org/docs/master/_modules/torch/optim/adam.html

TensorFlow版:

https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/optimizer_v2/adam.py#L32-L281

0dabbe8a68ee95fef3298b98cc805712.png

b85573385951f6ab3a2500622c0257bf.png

Y axis – probability of optimal solution

X-axis – budget for hyperparameter optimization (#modeltraining)

Source: http://arxiv.org/abs/1910.11758

Neural networks are generally trained by minimizing the loss function with an optimizer, and the role of the optimizer is to figure out how to adjust the network parameters so that it can learn the specified target. Most optimizers are based on Stochastic Gradient Descent (SGD) (https://ruder.io/optimizing-gradient-descent/). It should be noted, however, that many optimizers themselves also contain tunable parameters such as the learning rate. So, finding the right settings for a particular problem not only reduces training time, but also finds better local optima of the loss function, which often results in better results for the model.

In the past, research labs with deep pockets usually had to run a very expensive hyperparameter search to come up with a learning rate adjustment scheme for SGD. While it can outperform the previous best, it often means spending a lot of money to tune the optimizer. These details are usually not mentioned in the paper, so poor researchers who don't have the same budget to tune the optimizer will always be stuck with poor results, and nothing can be done.

And Adam is a boon to these researchers, which can automatically adjust the learning rate through the first and second moments of the gradient. And the experimental results show that it is very reliable, and it is not very sensitive to the choice of hyperparameters. In other words, Adam can be used right out of the box without the extensive tuning required by other optimizers. While a fine-tuned SGD might yield better results, Adam made the study easier. Because when something goes wrong, you know it's unlikely to be a problem with tuning.

2014/2015 - Generative Adversarial Networks (GAN)

Related papers:

Generative Adversarial Networks [6] :

https://arxiv.org/abs/1406.2661

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks [17]:

https://arxiv.org/abs/1511.06434

Code:

PyTorch version:

https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html

TensorFlow版:

https://www.tensorflow.org/tutorials/generative/dcgan

1f8530a2490aa28dcfe8a7ae4de1fba4.png

d0da73fe35a44613e7707c05e2977808.png

Figure 2: Visualization of model samples. The rightmost column shows the nearest neighbors to the training example to demonstrate that the model is not remembering the training set. Samples are drawn randomly rather than carefully selected. Unlike most other visualizations of deep generative models, these images show actual samples in the model's distribution, not the conditional mean given a sample of hidden units. And these samples are uncorrelated because their sampling process does not depend on Markov chain mixing, a) MNIST b) TFD c) CIFAR-10 (full connection model) d) CIFAR-IO (convolutional discriminator and deconvolution product generator)

Source: https://developers.google.com/machine-learning/gan/gan_structure

The goal of generative models (e.g., Variational Autoencoders, VAEs) is to generate samples of data that resemble real ones, such as non-existent faces. Here, the model has to model the entire data distribution (many pixels!) and not just classify cats or dogs like a discriminative model, so such models are difficult to train. Generative Adversarial Networks (GANs) are one such model.

The basic idea of ​​GAN is to train two networks simultaneously, the generator and the discriminator. The goal of the generator is to produce samples that can fool the discriminator, which is trained to distinguish between real and generated images. As training progresses, the discriminator will become better at identifying fake images, and the generator will become better at fooling the discriminator, producing more realistic samples, which is where adversarial networks are. The first GANs produced blurry low-resolution images and were rather unstable to train. But as technology advances, variants and improvements like DCGAN [17], Wasserstein GAN [25], CycleGAN [26], StyleGAN (v2) [27], etc. can produce higher resolution photorealistic images and videos.

2015 – Residual Networks (ResNet)

Related papers:

Deep Residual Learning for Image Recognition [13]:

https://arxiv.org/abs/1512.03385

Code:

PyTorch version:

https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

TensorFlow版:

https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/applications/resnet.py

afaf117663c3d0a0c5cba7b45fde27fd.png

On the basis of AlexNet, researchers have invented architectures with better performance based on convolutional neural networks, such as VGGNet [28], Inception [29] and so on. And ResNet is the most important breakthrough in this series of progress. To this day, ResNet variants have been used as baseline model architectures for various tasks, as well as the basis for more complex architectures.

What makes RseNet special, in addition to winning the ILSVRC 2015 classification challenge, is its depth compared to other network architectures. The deepest network mentioned in the paper has 1000 layers, and although it is slightly worse than the 101 and 152 layers on the benchmark task, it still performs well. Training such a deep network is actually quite challenging because of the vanishing gradient problem, and sequence models have the same problem. Until now, few researchers thought that training such deep networks could yield such stable results.

ResNet uses shortcut connections to help gradient transfer. One understanding is that ResNet only needs to learn the "difference" from one layer to another, which is simpler than learning a full transformation. Furthermore, residual connections in ResNet are a special case of Highway Networks [30], which in turn are inspired by the gating mechanism in LSTM.

2017 - Transformers

Related papers:

Attention is All You Need [5] :

https://arxiv.org/abs/1706.03762

Code:

PyTorch version:

https://pytorch.org/tutorials/beginner/transformer_tutorial.html

TensorFlow版:

https://www.tensorflow.org/tutorials/text/transformer

HuggingFace Transformers库:

https://github.com/huggingface/transformers

fabdf41428adcae540dd20b0e4b2d740.png

c87c2e0b507a02077cc40d6a5a5e39be.png

Figure 1: Transformer - Model Architecture

Source: https://arxiv.org/abs/1706.03762

The Seq2Seq+Atten model (described earlier) performs well, but due to its recursive nature, it needs to be calculated in time series. So it's hard to parallelize, processing only one step at a time, and each step in turn depends on the previous one. This also makes it difficult to use on long-sequence data, even with attention mechanisms, it is still difficult to model complex long-distance dependencies, and most of its work is still implemented in recursive layers.

Transformers solve these problems directly, dropping the recursive part and replacing them with multiple feed-forward self-attention layers that process all inputs in parallel and find relatively short (easy to optimize with gradient descent) paths between input and output . This makes it very fast to train, easy to scale, and able to handle more data. In order to add input position information (implicit in recursive models), Transformers also use position encoding. To learn more about how Transformers work, I recommend reading this illustrated blog.

(http://jalammar.github.io/illustrated-transformer/)

It would be an insult to say that Transformers are doing better than almost everyone expected. Because in the next few years, it not only performed better, but also directly killed RNN, becoming the standard architecture for most NLP and other sequential tasks, and even used in machine vision.

2018 – BERT and fine-tuned NLP models

Related papers:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [9]:

https://arxiv.org/abs/1810.04805

Code:

Fine-tune BERT's HuggingFace implementation:

https://huggingface.co/transformers/training.html

7cd18a04b0023e2aa94fd202ca1e1979.png

e98571668a052e7f9a4a9ae7c97fef9a.png

Pre-training refers to training a model to perform a certain task, and then using the learned parameters as initialization parameters to learn a related task. This is actually quite intuitive, a model that has learned to classify images of cats or dogs should have learned some basics about images and furry animals. When this model is fine-tuned to classify foxes, it can be expected to perform better than a model learned from scratch. Likewise, a model that has learned to predict the next word in a sentence should have learned something about human language models. Then its parameters would also be a good initialization for related tasks such as translation or sentiment analysis.

Pre-training and fine-tuning have been successful in computer vision and NLP. Although it has long been the standard in computer vision, it seems that there are some challenges in how to play a better role in NLP. Most of the best results are still from fully supervised models. With the emergence of methods such as ELMo [34], ULMFiT [35], NLP researchers can finally start to do pre-training work (previously word vectors are actually counted), especially the application of Transformer, it has produced a series of Methods such as GPT and BERT.

BERT is a relatively new achievement of pre-training, and many people believe that it has created a new era of NLP research. Instead of being trained to predict the next word like most pretrained models, it predicts the words that are masked (deliberately removed) in sentences, and whether two sentences are adjacent. Note that these tasks do not require labeled data, it can be trained on any text, and it can be a lot of text! So the pre-trained model can learn some general properties of the language, and then it can be fine-tuned, Used to solve supervised tasks such as question answering or sentiment prediction. BERT's performance in various tasks is very good, and it will be slaughtered when it comes out. And companies like HuggingFace have jumped on the bandwagon, making fine-tuned BERT models for NLP tasks easy to download and use. Since then, BERT has been continuously praised by new models such as XLNet [31], RoBERTa [32] and ALBERT [33], and now basically the whole field knows it.

2019/2020 and beyond – BIG language models, self-supervised learning

Throughout the history of deep learning, perhaps the most obvious trend is what Sutton calls the bitter lesson. As it says, algorithms that can take advantage of better parallelism (more data) and have more model parameters can beat some so-called "smarter techniques" time and time again. This trend seems to be continuing into 2020, with OpenAI's GPT-3 model, a massive language model with 175 billion parameters, showing unexpected generalization despite its simple training objectives and architecture ( Various effects are very good demo).

Following the same trend are methods such as contrastive self-supervised learning, such as SimCLR (https://arxiv.org/abs/2002.05709), which make better use of unlabeled data. Techniques for learning transferable general knowledge are becoming increasingly valuable as models become larger and faster to train, enabling efficient use of the vast unlabeled datasets available online.

Related reports:

https://dennybritz.com/blog/deep-learning-most-important-ideas/

references

[1] ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton (2012)

Advances in Neural Information Processing Systems 25

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

For time series, everything you can do.

What is the spatiotemporal sequence problem? Which models are mainly used for such problems? What are the main applications?

Public number: AI snail car

Stay humble, stay disciplined, stay progressive

b18631ca43708b91365e1b7a4f0c3b2d.png

Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)

Send [1222] to get a good leetcode brushing note

Send [AI Four Classics] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/123469550