What are the bottlenecks in the field of deep learning?

663601e6cfbd4d5f064d524398cd9302.png

来源:知乎 算法进阶
本文约8000字,建议阅读15分钟
本文从计算机视觉的角度说一下自己对深度学习瓶颈的看法。

In recent years, deep learning has become the most dazzling star in the computer field, and has derived many practical applications, mainly in the areas of reasoning and decision-making. However, how to make more achievements in deep learning beyond image, speech and natural language processing, such as the understanding of human emotions, the imitation of consciousness and motivation, will involve many deeper problems, which are currently the black box of deep learning that cannot be opened. Rubik's Cube.

Author: mileistone

source:

https://www.zhihu.com/question/40577663/answer/309571753

Because I am familiar with computer vision, I will talk about my views on the bottleneck of deep learning from the perspective of computer vision.

1. Deep learning lacks theoretical support


The ideas of most articles are put forward by intuition, and there is little theoretical support behind them. Validating an effective idea through experiments is not necessarily the optimal direction. Just like sgd in the optimization problem, each step is optimal, but from a global perspective, it is not optimal.

Without theoretical support, the progress in the field of computer vision is like SGD, which is effective but slow; if there is theoretical support, the progress in the field of computer vision will be as effective and rapid as Newton's method.

The CNN model itself has many hyperparameters, such as how many layers are set, how many filters are set for each layer, whether each filter is depth wise or point wise, or ordinary conv, how big is the kernel size of the filter, and so on.

The combination of these hyperparameters is a large number, and it is almost impossible to verify it only by experiments. In the end, we can only try some combinations based on intuition, so the current CNN model can only say that the effect is very good, but it is definitely not optimal, whether it is effect or efficiency.

Taking efficiency as an example, resnet works very well now, but the amount of calculation is too large and the efficiency is not high. However, it is certain that the efficiency of resnet can be improved, because there must be redundant parameters and redundant calculations in resnet, as long as we find these redundant parts and remove them, the efficiency will naturally increase. One of the simplest methods that most people will use is to reduce the number of channels in each layer.

If a set of theories can estimate the capacity of the model, the capacity of the model required for a task. Then when we face a task, using a model that matches the capacity can make the effect better and the efficiency better.

2. More and more engineer-like thinking in the field


Because deep learning itself lacks theory, deep learning theory is a hard nut to crack. Deep learning frameworks are becoming more and more foolish. There are open source implementations of various models on the Internet. Now many people in the industry use deep learning as Lego.

Faced with a task, git clone the open source implementations of the best current models, look at the building block instructions (that is, papers) of these models, and think about which building blocks can be changed, and whether the order of the building blocks can be changed Similarly, can adding a few building blocks make the effect better, reducing a few building blocks can make the efficiency higher, and so on.

After thinking about it, the experiment was run, and the effect of the experiment was good. When the article was published, the effect of the experiment was not as expected, so I tried it again.

This whole process is very engineer-like thinking, basically relying on the feeling of trial and error, and deep thinking about the absence. Few people think about what is wrong with the model from a theoretical point of view, and what improvements should be made to the model in response to this problem.

To give an extreme example, a piece of data is actually a linear function, but we always try to fit it with a quadratic function. If we find that the fitting result is not good, we then use a cubic function to fit it. If it fails three times, or four times, then we give up. . We seldom think about what distribution this data is. For such a distribution, is there a function that can fit it? If so, which function is the most suitable.

Deep learning should be a science, and it needs to be faced with scientific thinking, so as to get better results.

3. Adversarial samples are a problem of deep learning, but not the bottleneck of deep learning


I think that although adversarial examples are a problem of deep learning, they are not the bottleneck of deep learning. There are also adversarial examples in machine learning. Compared with deep learning, machine learning has more theoretical support, but it still fails to solve the problem of adversarial examples.

The reason why we think that adversarial samples are the bottleneck of deep learning is because images are very intuitive. When we see two almost identical pictures, the deep learning model finally gives two completely different classification results, which has a great impact on us. big.

If you modify the value of an element in a feature whose original category is A, and then change the classification of svm to B, we will feel disapproved, "You changed the value of an element in this feature, and its classification result changes normally ah".

作者:PENG Bo
https://www.zhihu.com/question/40577663/answer/413331053

Personally, I think that the current bottleneck of deep learning may lie in scaling. Yes, you heard that right.

We already have massive amounts of data and massive computing power, but it is difficult for us to train large-scale deep network models (GB to TB-level models), because BP is difficult to parallelize on a large scale. If data parallelism is not enough, the speedup ratio will be greatly reduced after using model parallelism. Even after adding many improvements, the bandwidth requirements of the training process are still too high.

That's why nVidia's DGX-2 only has 16 V100s, but it will sell for 2.5 million. Because although the same total computing power can be assembled with much less money, it is difficult to build a machine that can efficiently use so many graphics cards.

9a7588240bbdb3d8c6e87f2522e1a61e.jpeg

And the GPUs inside DGX-2 are not fully interconnected:

215814f5c74f2776918499d64ce878bd.jpeg

Another example is the training of AlphaGo Zero. Only a small number of TPUs are actually used for training. Even if there are tens of thousands of TPUs, there is no way to efficiently use them to train the network.

If deep learning can continuously increase the training speed by stacking machines without brains (just like mining can stack mining machines), so that a super-large-scale multi-tasking network can be used to learn all kinds of data at the PB EB level, then what can be achieved The effect is likely to be surprising.

Then we look at the current bandwidth:

https://en.wikipedia.org/wiki/List_of_interface_bit_rates

In 2011, PCI-E 3.0 x16 was released, which is 15.75 GB/s. Now consumer computers are still at this level, and 4.0 is still not out, but it may be because everyone is not motivated (games do not require so much bandwidth).

NVLink 2.0 is 150 GB/s, which is still not enough for large-scale parallelization.

5b20d12981238163542e6e646c2b4fc5.jpeg

You may say that the bandwidth will gradually increase.

Great, so here comes the weirdest question, which I think is worth pondering:

The AI ​​chip has spent so much effort and is still limited in bandwidth, so why is the human brain not limited in bandwidth?

My opinion is:

  • The parallelization of the human brain is done so well that only kB-level bandwidth is needed between neurons. It is worth learning for AI chip and algorithm researchers.

  • The learning method of the human brain is much rougher than that of BP, so it can be parallelized on such a large scale.

  • The learning method of the human brain is decentralized. In my opinion, it is closer to an energy-based method.

  • Other characteristics of the human brain can be imitated by the current transfer learning + multi-task learning + continuous learning.

  • The human brain also uses language to assist thinking. Without language, it is difficult for the human brain to learn complex things quickly.

作者:Giant
https://www.zhihu.com/question/40577663/answer/1974793135

My research field is mainly natural language processing (NLP). From the perspective of NLP, combined with my own scientific research and work experience, I will summarize the 8 typical bottlenecks behind the prosperity and fascination of deep learning.

1. High dependence on labeled data

As we all know, whether it is traditional classification, matching, sequence labeling, text generation tasks, or recent cross-modal tasks such as image understanding, audio sentiment analysis, Text2SQL, etc., wherever deep learning models are used, there is a high dependence on labeled data . .

This is why the effect of the deep learning model is not satisfactory due to insufficient data in the early stage or the cold start stage. Models need more examples than humans to learn new things.

Although there have been some low-resource or even zero-resource works recently (such as two papers on dialogue generation [1-2]), in general these methods are only applicable to certain specific fields and are difficult to directly promote.

2. The model is domain-dependent and difficult to migrate directly

Immediately after the previous topic, when we obtained large-scale labeled data through long-term iterations through labeling teams or crowdsourcing, and trained the model, but when the business scenario was changed, the effect of the model plummeted again.

Or the model only performs well on the paper dataset and cannot reproduce similar effects on the rest of the data. These are very common questions.

Improving the transferability of models is a very valuable topic in deep learning, which can greatly reduce the cost of data labeling. For example, a classmate of mine is very experienced in running karts. Now that the new QQ Speed ​​mobile game is released, he can learn by analogy after two rounds, and easily get on Xingyao and Chariot, without having to start from the most primitive drifting practice.

Although the NLP pre-training + fine-tuning method alleviates this problem, the transferability of deep learning needs to be further enhanced.

3. The Big Mac model has high resource requirements

Although giant models with astonishing effects have appeared frequently in the NLP field in the past two years, they have discouraged ordinary researchers. Regardless of the cost of tens of thousands of pre-training (BERT->1.2w$, GPT2->4.3w$) or even millions of dollars, the use of pre-training weights has high requirements on GPU and other hardware.

Because the number of parameters of the large model is increasing exponentially: BERT (110 million), T5 (11 billion), GPT3 (150 billion), Pangu (200 billion)... Developing high-performance small models is another important aspect of deep learning. direction of value.

Fortunately, there have been some good lightweight works in the NLP field, such as TinyBERT[3], FastBERT[4], etc.

4. The model lacks common sense and reasoning ability

As mentioned by the subject, the current understanding of human emotions by deep learning is still at the shallow semantic level, without good reasoning ability, and unable to truly understand user demands. On the other hand, how to effectively integrate common sense or background knowledge into model training is also one of the bottlenecks that deep learning needs to overcome.

One day in the future, in addition to writing poems, solving equations, and playing Go, the deep learning model can also answer short common-sense questions from parents, and it will truly be considered "intelligent".

5. Limited application scenarios

Although NLP has many subfields, the best directions for development are still classification, matching, translation, and search, and the application scenarios of most tasks are still limited.

For example, chatbots are generally used as the bottom-up module of the question-and-answer system, and reply with a standard anthropomorphic speech when the FAQ or intent module does not hit the user's question. However, if chatbots are directly applied in the open domain, it is easy to turn from artificial intelligence to artificial mental retardation, which makes users disgusted.

6. Lack of efficient hyperparameter automatic search scheme

There are many hyperparameters in the field of deep learning. Although there are some automated parameter tuning tools such as Microsoft's nni[5], they still rely on the personal experience of algorithm engineers; due to the long training time, the parameter verification process requires a high time cost.

In addition, AutoML still requires large-scale computing power to quickly produce results, so attention also needs to be paid to increasing the computing scale.

7. Some papers are only oriented towards the competition SOTA

It will be the practice of many researchers (including me once) to brush a well-known game to SOTA and then post an article. A typical pipeline is:

  • Swipe the list to the first place at any cost of resources;

  • Start to backtrack and explain why this method works so well (sort of like a self-justification).

Of course, this is not to say that this method is not good, but that we should not only aim at making rankings when we do research. Because in many cases, it is really meaningless to improve the 0.XX% score after the decimal point, and it is difficult to bring any benefits to the existing deep learning development.

This also explains that the interviewer asked "how to get a good result in a certain competition", and was disgusted when he heard the way of "multi-mode integration" and other stacked models. Because the actual scene is limited by factors such as resources and time, it is generally not done this way.

8. Not very interpretable

The last point is also a common problem in this field. The entire deep learning network is like a black box, lacking clear and transparent interpretability.

For example, why is the confidence level of being classified as a gibbon as high as 99.3% by adding a little noise perturbation (equivalent to an adversarial example) to the giant panda picture?

cd593f7411e599b4de268bdd643c947a.jpeg

Visualizing the features learned by some models (CNN, Attention, etc.) may help us understand how the model learns. Previously, the field of machine learning also used dimensionality reduction techniques (t-SNE, etc.) to understand the distribution of high-dimensional features.

More deep learning interpretability research can refer to [6].

Recently , the 2018 Turing Award winners Bengio, LeCun and Hinton were invited by ACM to gather together to review the basic concepts of deep learning and some breakthrough achievements, and also talked about the challenges facing the future development of deep learning.

Author: Zhihu user
https://www.zhihu.com/question/40577663/answer/224699031

After reading some answers, I feel that what everyone said is very reasonable, but I always feel that the bottleneck mentioned by many people is the bottleneck of "machine learning", not the bottleneck of "deep learning". I will give you a strong answer below.

Deep learning, deep is the appearance, not the goal. Universal approximation theory proves that only one hidden layer is needed to fit any function, which shows that the focus is not deep. Deep learning compared to traditional machine learning: deep learning is all about learning representations. That is to say, the essential characteristics (representation) of the data are learned through a well-designed hierarchical structure.

Speaking of bottlenecks, deep learning is also a type of machine learning, and it also has the bottleneck of machine learning itself. For example, it is highly dependent on data. It is the "behavioral intelligence" of data rather than artificial intelligence with real self-awareness. The answers to these questions above say a lot.

Besides that, it has some unique bottlenecks.

  1. For example, the feature structure is difficult to change. The format of the data (size, length, color channel, text dictionary format, etc.) is demanding. The trained feature extractor is not so easy to transfer to other tasks.

  2. It's very unstable. For example, in NLP tasks, when doing text generation (QA), image annotation and other work, sometimes the generated content will make you overwhelmed. But often it can be surprising. So its uncontrollability makes it not very widely used in engineering applications. Many applications that sacrifice recall and precision cannot be implemented with deep learning, otherwise they are prone to danger. In contrast, the rule based method is much more reliable. At least if something goes wrong, you can debug it.

  3. It is difficult to hotfix, and if something goes wrong, it basically depends on re-parameter training. There are many potential difficulties encountered in the application process.

  4. The optimization of deep models relies too much on personal experience. The world's three major metaphysics: Western astrology, Eastern Zhouyi, and deep learning.

  5. The model structure is becoming more and more complex, and it is becoming more and more difficult to integrate different systems. It's as if super-soldiers are being raised all the time, but they don't speak the language to form a super-army.

  6. Sensitive information issues. If the data used in the training model is not desensitized, it is possible to try out sensitive information through some methods.

  7. attack problem. The existence of Adversarial Sample has now been confirmed. Creating some adversarial examples can directly kill existing algorithms. However, it feels that the generation of adversarial samples is caused by the fact that the feature extraction does not learn the flow characteristics of the data. In other words, a certain degree of overfit brings this problem,

  8. However, the biggest problem at present is the demand for massive data. Due to the need to learn the true distribution, our data is only a small fraction sampled from the true distribution. If you want the model to really approximate the true distribution, you need as much data as possible. The demand for data volume has come up, and there are many questions: Where does the data come from? Where does the data exist? How to wash data? Who will label the data? How to train on large amount of data? How to trade off between cost (equipment, data) and effect?

  9. Extended by Clause 8. Is deep learning that requires massive amounts of data really "artificial intelligence"? Anyway, I don't believe it. The human brain can generalize with limited knowledge, rather than just using human-designed guidelines to direct machine learning to the distribution of feature spaces. Therefore, real artificial intelligence should not have such a large demand for data and computing! (This is actually a machine learning problem)

In short, there are many factors that limit its application. But from an optimistic point of view, you are not afraid of problems, and they can always be solved.

Author: anonymous user
https://www.zhihu.com/question/40577663/answer/311095389

Computational graphs are getting more and more complex, and designs are getting more and more counter-intuitive.

Whether the innovations of Dropout/BN/Residual are tricks or tricks, at least you can make up a good-looking intuitive explanation to fool you, and they are also successfully applied in completely different scenarios and tasks. Last year, there were basically no new and useful tricks of this level. The population of alchemists is getting bigger and bigger, but the universal trick has not been discovered, which shows that the field has reached a bottleneck, and the peaches that are easy to pick have been picked.

Has the potential of the structure been tapped? Or have we not found a more general and representative task as a breeding ground for new tricks? These are the questions that DL research needs to answer. Now it seems that the form is not optimistic. Traditional DL research relies on changing a few lines and adding a few more layers. It is more and more difficult to issue high-quality papers for a specific task.

My personal opinion is that if DL wants to truly wear the hat of artificial intelligence, it must do things that are intelligently modified. Now it is artificially divided into NLP/CV/ASR according to the application scenarios. After all, there is an upper limit for rough fitting. It also has nothing in common with the way humans acquire intelligence.

Author: He Zhiyuan
https://www.zhihu.com/question/40577663/answer/224656397

Simply speak your mind. In my opinion, most of the current deep learning models, no matter how complex the construction of the neural network, are actually doing the same thing:

Use a large amount of training data to fit an objective function y=f(x).

x and y are actually the input and output of the model, for example:

  • Image classification problem. At this time, x is generally an image numerical matrix of width*height*channel number, and y is the category of classification.

  • Speech recognition problem. x is the voice sampling signal, and y is the text corresponding to the voice.

  • machine translation. x is a sentence in the source language, and y is a sentence in the target language.

And "f" represents the model in deep learning, such as CNN, RNN, LSTM, Encoder-Decoder, Encoder-Decoder with Attention, etc. Compared with traditional machine learning models, models in deep learning usually have two characteristics:

  • The model has a large capacity and many parameters;

  • End-to-end (end-to-end).

With the help of GPU computing acceleration, deep learning can optimize large-capacity models end-to-end, thereby surpassing traditional methods in performance. This is the basic methodology of deep learning.

So, what are the downsides of this approach? Personally, I think there are the following points.

1. The efficiency of training f is not high

The efficiency of training is manifested in two aspects. First, it takes a long time to train the model. As we all know, deep learning needs to use GPU to accelerate training, but even this training time is in hours or days. If the amount of data used is large and the model is complex (such as face recognition and speech recognition models with a large sample size), the training time will be calculated in weeks or even months.

Another disadvantage in training efficiency is that the utilization rate of samples is not high. To give a small example: the picture is yellow. For humans, they only need to look at a few "training samples" to learn to identify pornography, and it is very simple to judge which pictures are "pornographic". However, training a deep learning pornography model often requires tens of thousands of positive + negative samples, such as Yahoo's open source yahoo/open_nsfw. In general, deep learning models tend to need far more examples than humans to learn the same thing. This is because humans already have a lot of "prior knowledge" in this field, but for deep learning models, we lack a unified framework to provide them with corresponding prior knowledge.

So in practical applications, how to solve these two problems? For the problem of long training time, the solution is to add GPU; for the problem of sample utilization, it can be solved by adding labeled samples. But no matter adding GPU or adding samples, money is needed , and money is often an important factor restricting actual projects.

2. The unreliability of the fitted f itself

We know that deep learning can greatly outperform traditional methods in performance. However, such performance indicators are often in a statistical sense, and cannot guarantee the correctness of individual cases. For example, a picture classification model with a 99.5% accuracy rate means that it correctly classifies 9950 of the 10,000 test pictures. However, for a new picture, even if the confidence of the classification output by the model is very high, we cannot The result is guaranteed to be correct. Because the confidence level and the actual correct rate are not equivalent in nature. In addition, the unreliability of f is also manifested in the poor interpretability of the model. In the deep model, it is usually difficult for us to clearly understand the meaning of each parameter.

A typical example is " against generated samples" . As shown below, the neural network recognizes the original picture as a "panda" with a confidence level of 60%, but when we add a small noise to the original image, the neural network recognizes the picture as a "gibbon" with a confidence level of 99% . This shows that the deep learning model is not as reliable as imagined.

244a66da539634a97f172504c1c1b90b.jpeg

In some key fields, such as the medical field, if a model can neither guarantee the correctness of the results nor explain the results well, it can only serve as an "assistant" for humans and cannot be widely used.

3. Can f achieve "strong artificial intelligence"

The last question is actually a bit metaphysical, not a specific technical issue, but it doesn't hurt to discuss it.

Many people care about artificial intelligence, because they care about the realization of "strong artificial intelligence". Following the method of deep learning, we seem to be able to understand human intelligence in this way: x is the various sensory inputs of people, y is the output of human behavior, such as speaking and behavior, and f represents human intelligence. So, can human intelligence be trained by violently fitting f? Different people have different opinions on this question, and my personal tendency is that it cannot be done. Human intelligence may be more similar to conceptual abstraction, analogy, thinking and creation, rather than directly taking out a black box f. Deep learning methods may need further development to simulate real intelligence.

Author: Zhang Xu
https://www.zhihu.com/question/40577663/answer/225319588

After learning a little fur, join in the fun.

1. Deep learning requires a large amount of data, and if the amount of data is too small, it will cause serious overfitting.

2. Deep learning does not have obvious advantages when dealing with tabular data. Currently, it is relatively good at computer vision, natural language processing and speech recognition. In the context of tabular data, everyone is more willing to use models such as xgboost.

3. The theoretical support is weak, and almost no one has done work on the mathematical basis of deep learning. Everyone swarmed with model water papers.

4. Continuing from the previous article, parameter adjustment has basically fallen into the alchemy mode, and deep learning parameter adjustment is already a metaphysics.

5. The consumption of hardware resources is large, and GPU is already a must, but the price is high, so deep learning is also called a rich man's game.

6. It is still difficult to deploy and land, especially in mobile application scenarios.

7. Unsupervised learning is still difficult. At present, deep learning training is basically based on gradient descent to minimize the loss function, so labels are required. Labeling large amounts of data is expensive. Of course, there are also unsupervised learning networks that are developing rapidly, but strictly speaking, GAN and VAE are all self-supervised learning.

Seeing that the first article was questioned in the comments, I would like to express my opinion: a relatively strong learner generally does not worry about underfitting. The neural network has a large number of parameters, as long as there are enough training rounds, it can theoretically fit the training set completely. But this is not what we want, and the generalization ability of such a model will be very poor. The reason for this result is that the amount of data is too small to represent the distribution behind the entire data. In this case, the neural network is forced to fit the distribution of the data subset of the training set almost indiscriminately, resulting in overfitting.

Author: zzzz

https://www.zhihu.com/question/40577663/answer/224756448

I think the biggest bottleneck of deep learning is also its biggest advantage, both:


1.end-to-end training
2.universal approximation

Its advantage lies in its strong fitting ability.


The disadvantage is that we have almost no control over the middle fitting process. All we want it to learn can only be through a large amount of data, a more complex network (inception module, more layers), and more constraints (dropout, regularization ), it is expected that it will finally learn a judgment that is equivalent to our cognition.

To give a specific example, we want to judge whether the image is a human face.

One of the general judgment criteria is whether the image covers 2 eyes, 1 nose, and 1 mouth, and whether the position information between them conforms to geometric logic. This is exactly the idea of ​​traditional dpm, although each of the above steps (subtask) may be wrong, resulting in the overall performance will not be particularly good. But relatively speaking, each subtask only needs less training data, the intermediate results will be more intuitive, and the final results meet our human judgment standards.

But this thing is done by deep learning. Except for a few "prior knowledge" (prior knowledge), you can define it through the network structure (for example, cnn is actually a feature of the local coherent+position invariant of the default feature), other cognition The network can only learn by itself through a large amount of data. Some simple elements such as face size, position, and rotation can also be simulated through data augmentation, but for skin color, background pattern, and hair factors, it is necessary to find additional data to expand the network's understanding of the problem. But even so, we can't be sure what high-level knowledge the network has summarized. When I show him an image of Erlang God that is not in the training data, what kind of judgment will it make.

This is why data is the most important item in deep learning . When your data is not diverse enough, it may only learn some hacky trivial solutions; but when the data is comprehensive enough, it is more likely to summarize features that are more expressive than simple nose and eyes, but we cannot understand it .

Original link:

https://www.zhihu.com/question/40577663/answer/902429604

Editor: Wang Jing

26e7a4590d25da0c48fc58f06d4675e8.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131297629