[Deep Learning] What are the bottlenecks in the field of deep learning in the interpretation of nearly 10,000 characters?

1. Introduction

Although deep learning has made remarkable progress in areas such as image, speech, and natural language processing, there are still some challenges in the deeper problems of understanding human emotions, imitating consciousness, and motivation.

On the one hand, concepts such as human emotion, consciousness, and motivation are very complex, involving many subjective and abstract factors. Deep learning models are mainly based on statistical pattern recognition of large-scale data, and it is often difficult to model and understand these subjective and abstract concepts.

On the other hand, current deep learning models mainly rely on methods such as supervised learning and reinforcement learning for training, requiring a large amount of labeled data or learning through interaction with the environment. However, deeper issues such as emotion, awareness, and motivation often lack explicit labels or are difficult to learn from simple reward signals. This makes it much more difficult to apply deep learning on these problems.

In addition, the interpretability of deep learning models is also an important issue. Deep learning models are often viewed as black boxes whose internal decision-making processes are difficult to explain and understand. Model interpretability is critical for understanding a model's decision-making process and assessing its accuracy and bias when it comes to deep questions about human emotion, consciousness, and motivation.

To overcome these challenges, researchers are actively exploring various methods. One approach is to combine deep learning with techniques from other fields, such as symbolic reasoning, knowledge representation and reasoning, etc., to enhance the understanding and reasoning capabilities of the model. Another approach is to introduce data and information from more domains, such as social media data, physiological signals, and emotional expressions, to provide more comprehensive emotional understanding and motivational imitation.

Although deep learning has not yet fully opened the black box on deep-level issues such as human emotion, consciousness, and motivation, with the continuous development of technology and the deepening of research, we are expected to make more breakthroughs and progress in these areas.

Because I am familiar with computer vision, I will talk about my views on the bottleneck of deep learning from the perspective of computer vision.

2. Deep learning lacks theoretical support

The ideas in most articles are indeed based on intuition, lacking the theoretical support behind them. Ideas that are proven to work through experimentation don't necessarily mean they're the best direction to go. Just like stochastic gradient descent (SGD) in optimization problems, each step may be optimal, but it is not necessarily the global optimal solution as a whole.

In the absence of theoretical support, the progress in the field of computer vision may be as effective but slow as stochastic gradient descent; with theoretical support, the progress in the field of computer vision will be as effective and rapid as Newton's method.

The convolutional neural network (CNN) model itself has many hyperparameters, such as the setting of the number of layers, the number of filters per layer, whether each filter is a depthwise convolution, point convolution or regular convolution, and the kernel size of the filter etc.

The combination of these hyperparameters is a huge number space, and it is almost an impossible task to verify only by experiments. In the end, some of these combinations can only be tried intuitively, so the current CNN model can only be said to have good results, but it is definitely not optimal, both in terms of effectiveness and efficiency.

Taking efficiency as an example, the current ResNet model works well, but the calculation is too large and the efficiency is not high. However, what is certain is that there are redundant parameters and calculations in ResNet. As long as we find these redundant parts and remove them, the efficiency will increase. One of the simplest and commonly used methods is to reduce the number of channels per layer.

If there is a set of theories that can estimate the capacity of the model and the model capacity required for the task, then when faced with a task, using the matching model capacity will be able to achieve good results and high efficiency.

By researching and analyzing the theoretical relationship between deep learning models and computer vision tasks, the performance and efficiency of the models can be better optimized. A theoretical framework can provide estimates of model capacity, and the optimal model capacity required for a task. In this way, when faced with a specific task, we can choose an appropriate model capacity that matches the task to strike a balance between effectiveness and efficiency.

The existence of theoretical support enables us to design and improve models more targetedly. Through theoretical analysis, we can determine the redundant parameters and calculations that exist in the model, and take corresponding optimization measures. For example, based on theoretical guidance, we can precisely reduce the number of channels in each layer and remove redundant parts in the model, thereby improving computational efficiency.

In addition, theoretical support can also help us better understand the working principle and decision-making process of deep learning models. Deep learning models are often treated as black boxes, and their inner workings are often difficult to explain. However, through theoretical analysis, we can reveal the key factors, features, and decision paths in the model, thereby improving the interpretability and understandability of the model.

By bringing theory and experiment together, we can push the field of computer vision further. Experimental verification is still an important means of evaluating and validating models, but combining theoretical guidance can make experiments more targeted and efficient. Driven by theory, we can discover problems faster, solve them faster, and accelerate progress in the field of computer vision.

All in all, theoretical support plays an important role in the field of deep learning and computer vision. By establishing a theoretical framework and analyzing the relationship between models and tasks, we can optimize the performance and efficiency of models and improve the understanding of the model's decision-making process. Combining theory and experimentation will advance the field of computer vision and accelerate our understanding of and search for solutions to vision tasks.

3. More and more engineer-like thinking in the field

The field of deep learning still lacks a complete theoretical framework, which makes deep learning theory a challenging problem. With the continuous evolution of the deep learning framework, more and more people regard it as a tool, and by using the implementation of various open source models to complete tasks, it is as simple as building Lego blocks.

When faced with a specific task, people often choose the open source implementation of the current best model, and read related papers as a guide for building the model. They think about how to improve some of the components, adjust the order of the blocks, or add/remove some blocks to improve the effect or increase the efficiency, etc.

The process involves more of an engineer's mind, based mostly on intuition and trial and error. Few people think about what is wrong with the model and how to improve the model from a theoretical point of view.

As an extreme example, suppose we have a set of data that is actually generated by a linear function, but we try to fit it with a quadratic function, find that the fit is not good, then try a cubic function, and give up if it doesn't work. Few people think about what the distribution of this set of data looks like, whether there is a function that fits that distribution, and if so, which function is the most suitable.

Deep learning should be a science, and it needs to be explored with a scientific way of thinking. Only in this way can we achieve better results. When facing deep learning problems, we need to think about more theoretical aspects and how to improve the model for the problem.

Although deep learning currently lacks sound theoretical support, we can promote the development of deep learning by combining practice and theory. Through more in-depth theoretical research, we can better understand the behavior, properties and limitations of deep learning models, and on this basis, we can propose more effective and efficient models and algorithms. At the same time, we also need to continuously verify and improve the effectiveness of the theory in practice, and promote the gradual development of deep learning into a science.

4. Adversarial samples are a problem of deep learning, but not the bottleneck of deep learning

The existence of adversarial examples is indeed a problem that affects deep learning, but it is not necessarily the only bottleneck of deep learning. It is worth noting that adversarial examples are not a problem specific to deep learning, but similar situations exist in the field of machine learning. Although machine learning has more theoretical support for the adversarial example problem, this does not fully address the challenge of adversarial examples.

The reason adversarial examples caught our attention is because images are a form of intuition, and we are shocked and confused when we see two nearly identical images and deep learning models give wildly different classification results.

In contrast, if we modify the value of an element in a feature so that the classification result of the support vector machine (SVM) changes from A to B, we may not be too surprised, because we think that such Modifications will have a normal impact on classification results.

Nonetheless, the problem of adversarial examples has attracted greater attention in deep learning, partly because deep learning models have shown impressive performance on complex tasks, and the impact of adversarial examples is evident for tasks such as image classification of. This prominent effect makes adversarial examples an important challenge in the field of deep learning.

In order to solve the problem of adversarial examples, we need to continue research and exploration, not only to improve from a practical point of view, but also to strengthen theoretical analysis and explanation. The development of deep learning needs more theoretical support and explanation to better understand the decision-making process, characteristics and limitations of the model, so as to better solve problems such as adversarial examples, and promote the further development of the field of deep learning.

Although we have a lot of data and computing power, it is still difficult to train large deep network models (GB to TB level), because the backpropagation algorithm (BP) is difficult to effectively parallelize at a large scale. The combined use of data parallelism and model parallelism has limited effect, and even with various improvements, the bandwidth requirements of the training process are still high.

This explains why a system like nVidia's DGX-2, which only has 16 V100 GPUs, can cost as much as $2.5 million. Although it is possible to purchase graphics cards with the same total computing power for less money, it is difficult to build a system that can efficiently utilize so many graphics cards.

This problem is caused by parallelization limitations in the training process of deep learning models. The BP algorithm needs to transfer gradient information between different GPUs, which requires very high bandwidth and communication. The current hardware and software architecture cannot fully meet such requirements, so there are still bottlenecks in large-scale deep learning model training.

To address this issue, researchers are continuously working to improve parallel computing methods and algorithms for deep learning. For example, some new parallel computing strategies and communication optimization techniques are being proposed and studied to achieve more efficient deep learning model training. In addition, hardware manufacturers are also developing new accelerators and high-performance computing systems to meet the needs of large-scale deep learning.

Although scaling is an important challenge currently facing deep learning, with the continuous development and innovation of technology, it is believed that more solutions will appear in the future to overcome this bottleneck and promote the further development of deep learning.

insert image description here
And the GPUs inside DGX-2 are not fully interconnected:

insert image description here
Another example is the training of AlphaGo Zero. Only a small number of TPUs are actually used for training. Even if there are tens of thousands of TPUs, there is no way to efficiently use them to train the network.

If deep learning can continuously increase the training speed by stacking machines without brains (just like mining can stack mining machines), so that a super-large-scale multi-tasking network can be used to learn all kinds of data at the PB EB level, then what can be achieved The effect is likely to be surprising.

Then we look at the current bandwidth:

https://en.wikipedia.org/wiki/List_of_interface_bit_rates

In 2011, PCI-E 3.0 x16 was released, which is 15.75 GB/s. Now consumer computers are still at this level, and 4.0 is still not out, but it may be because everyone is not motivated (games do not require so much bandwidth).

NVLink 2.0 is 150 GB/s, which is still not enough for large-scale parallelization.

You may say that the bandwidth will gradually increase.

Great, so here comes the weirdest question, which I think is worth pondering:

The AI ​​chip has spent so much effort and is still limited in bandwidth, so why is the human brain not limited in bandwidth?

  • The parallelization of the human brain is done so well that only kB-level bandwidth is needed between neurons. It is worth learning for AI chip and algorithm researchers.
  • The learning method of the human brain is much rougher than that of BP, so it can be parallelized on such a large scale.
  • The learning method of the human brain is decentralized. In my opinion, it is closer to an energy-based method.
  • Other characteristics of the human brain can be imitated by the current transfer learning + multi-task learning + continuous learning.
  • The human brain also uses language to assist thinking. Without language, it is difficult for the human brain to learn complex things quickly.

5. Answers from Zhihu Netizens

5.1 Author: Giant

https://www.zhihu.com/question/40577663/answer/1974793135

My research field is mainly natural language processing (NLP). From the perspective of NLP, combined with my own scientific research and work experience, I will summarize the 8 typical bottlenecks behind the prosperity and fascination of deep learning.

  1. Highly dependent on labeled data

As we all know, whether it is traditional classification, matching, sequence labeling, text generation tasks, or recent cross-modal tasks such as image understanding, audio sentiment analysis, Text2SQL, etc., wherever deep learning models are used, there is a high dependence on labeled data. .

This is why the effect of the deep learning model is not satisfactory due to insufficient data in the early stage or the cold start stage. Models need more examples than humans to learn new things.

Although there have been some low-resource or even zero-resource works recently (such as two papers on dialogue generation [1-2]), generally speaking, these methods are only applicable to some specific fields and are difficult to be directly promoted.

  1. The model is domain-dependent and difficult to transfer directly

Immediately after the previous topic, when we obtained large-scale labeled data through long-term iterations through labeling teams or crowdsourcing, and trained the model, but when the business scenario was changed, the effect of the model plummeted again.

Or the model only performs well on the paper dataset and cannot reproduce similar effects on the rest of the data. These are very common questions.

Improving the transferability of models is a very valuable topic in deep learning, which can greatly reduce the cost of data labeling. For example, a classmate of mine is very experienced in running karts. Now that the new QQ Speed ​​mobile game is released, he can learn by analogy after two rounds, and easily get on Xingyao and Chariot, without having to start from the most primitive drifting practice.

Although the NLP pre-training + fine-tuning method alleviates this problem, the transferability of deep learning needs to be further enhanced.

  1. The Big Mac model has high resource requirements

Although giant models with astonishing effects have appeared frequently in the NLP field in the past two years, they have discouraged ordinary researchers. Regardless of the tens of thousands of pre-training (BERT->1.2w , GPT 2 − > 4.3w, GPT2->4.3w,GPT 2 >4.3 w ) or even millions of dollars, only using pre-trained weights has high requirements on GPU and other hardware.

Because the number of parameters of the large model is increasing exponentially: BERT (110 million), T5 (11 billion), GPT3 (150 billion), Pangu (200 billion)...Developing high-performance small models is another valuable aspect of deep learning direction.

Fortunately, there have been some good lightweight works in the NLP field, such as TinyBERT[3], FastBERT[4], etc.

  1. Models lack common sense and reasoning

As mentioned by the subject, the current understanding of human emotions by deep learning is still at the shallow semantic level, without good reasoning ability, and unable to truly understand user demands. On the other hand, how to effectively integrate common sense or background knowledge into model training is also one of the bottlenecks that deep learning needs to overcome.

One day in the future, in addition to writing poems, solving equations, and playing Go, the deep learning model can also answer short common-sense questions from parents, and it will truly be considered "intelligent".

  1. Limited application scenarios

Although NLP has many subfields, the best directions for development are still classification, matching, translation, and search, and the application scenarios of most tasks are still limited.

For example, chatbots are generally used as the bottom-up module of the question-and-answer system, and reply with a standard anthropomorphic speech when the FAQ or intent module does not hit the user's question. However, if chatbots are directly applied in the open domain, it is easy to turn from artificial intelligence to artificial mental retardation, which makes users disgusted.

  1. Lack of efficient hyperparameter automatic search scheme

There are many hyperparameters in the field of deep learning. Although there are some automated parameter tuning tools such as Microsoft's nni[5], they still rely on the personal experience of algorithm engineers; due to the long training time, the parameter verification process requires a high time cost.

In addition, AutoML still requires large-scale computing power to quickly produce results, so attention also needs to be paid to increasing the computing scale.

  1. Some papers are only oriented towards the competition SOTA

It will be the practice of many researchers (including me once) to brush a well-known game to SOTA and then post an article. A typical pipeline is:

Swipe the list to the first place at any cost of resources;

Start to backtrack and explain why this method works so well (sort of like a self-justification).

Of course, this is not to say that this method is not good, but that we should not only aim at making rankings when we do research. Because in many cases, it is really meaningless to improve the 0.XX% score after the decimal point, and it is difficult to bring any benefits to the existing deep learning development.

This also explains that the interviewer asked "how to get a good result in a certain competition", and was disgusted when he heard the way of "multi-mode integration" and other stacked models. Because the actual scene is limited by factors such as resources and time, it is generally not done this way.

  1. Not very interpretable

The last point is also a common problem in this field. The entire deep learning network is like a black box, lacking clear and transparent interpretability.

For example, why is the confidence level of being classified as a gibbon as high as 99.3% by adding a little noise perturbation (equivalent to an adversarial example) to the giant panda picture?

Visualizing the features learned by some models (CNN, Attention, etc.) may help us understand how the model learns. Previously, the field of machine learning also used dimensionality reduction techniques (t-SNE, etc.) to understand the distribution of high-dimensional features.

More deep learning interpretability research can refer to [6].

Recently, the 2018 Turing Award winners Bengio, LeCun and Hinton were invited by ACM to gather together to review the basic concepts of deep learning and some breakthrough achievements, and also talked about the challenges facing the future development of deep learning.

5.2 Author: Zhihu User

https://www.zhihu.com/question/40577663/answer/224699031

After reading some answers, I feel that what everyone said is very reasonable, but I always feel that the bottleneck mentioned by many people is the bottleneck of "machine learning", not the bottleneck of "deep learning". I will give you a strong answer below.

Deep learning, deep is the appearance, not the goal. Universal approximation theory proves that only one hidden layer is needed to fit any function, which shows that the focus is not deep. Deep learning compared to traditional machine learning: deep learning is all about learning representations. That is to say, the essential characteristics (representation) of the data are learned through a well-designed hierarchical structure.

Speaking of bottlenecks, deep learning is also a type of machine learning, and it also has the bottleneck of machine learning itself. For example, it is highly dependent on data. It is the "behavioral intelligence" of data rather than artificial intelligence with real self-awareness. The answers to these questions above say a lot.

Besides that, it has some unique bottlenecks.

  1. For example, the feature structure is difficult to change. The format of the data (size, length, color channel, text dictionary format, etc.) is demanding. The trained feature extractor is not so easy to transfer to other tasks.
  2. It's very unstable. For example, in NLP tasks, when doing text generation (QA), image annotation and other work, sometimes the generated content will make you overwhelmed. But often it can be surprising. So its uncontrollability makes it not very widely used in engineering applications. Many applications that sacrifice recall and precision cannot be implemented with deep learning, otherwise they are prone to danger. In contrast, the rule based method is much more reliable. At least if something goes wrong, you can debug it.
  3. It is difficult to hotfix, and if something goes wrong, it basically depends on re-parameter training. There are many potential difficulties encountered in the application process.
  4. The optimization of deep models relies too much on personal experience. The world's three major metaphysics: Western astrology, Eastern Zhouyi, and deep learning.
  5. The model structure is becoming more and more complex, and it is becoming more and more difficult to integrate different systems. It's as if super-soldiers are being raised all the time, but they don't speak the language to form a super-army.
  6. Sensitive information issues. If the data used in the training model is not desensitized, it is possible to try out sensitive information through some methods. attack problem. The existence of Adversarial Sample has now been confirmed. Creating some adversarial examples can directly kill existing algorithms. However, it feels that the generation of adversarial samples is caused by the fact that the feature extraction does not learn the flow characteristics of the data. In other words, a certain degree of overfit brought this problem.
  7. However, the biggest problem at present is the demand for massive data. Due to the need to learn the true distribution, our data is only a small fraction sampled from the true distribution. If you want the model to really approximate the true distribution, you need as much data as possible. The demand for data volume has come up, and there are many questions: Where does the data come from? Where does the data exist? How to wash data? Who will label the data? How to train on large amount of data? How to trade off between cost (equipment, data) and effect?
  8. Extended by Clause 7. Is deep learning that requires massive amounts of data really "artificial intelligence"? Anyway, I don't believe it. The human brain can generalize with limited knowledge, rather than just using human-designed guidelines to direct machine learning to the distribution of feature spaces. Therefore, real artificial intelligence should not have such a large demand for data and computing! (This is actually a machine learning problem)

In short, there are many factors that limit its application. But from an optimistic point of view, you are not afraid of problems, and they can always be solved.

5.3 Author: He Zhiyuan

Simply speak your mind. In my opinion, most of the current deep learning models, no matter how complex the construction of the neural network, are actually doing the same thing:

Use a large amount of training data to fit an objective function y=f(x).

x and y are actually the input and output of the model, for example:

  • Image classification problem. At this time, x is generally an image numerical matrix of width and height channels, and y is the category of classification.
  • Speech recognition problem. x is the voice sampling signal, and y is the text corresponding to the voice.
  • machine translation. x is a sentence in the source language, and y is a sentence in the target language.

And "f" represents the model in deep learning, such as CNN, RNN, LSTM, Encoder-Decoder, Encoder-Decoder with Attention, etc. Compared with traditional machine learning models, models in deep learning usually have two characteristics:

  • The model has a large capacity and many parameters;
  • End-to-end (end-to-end).

With the help of GPU computing acceleration, deep learning can optimize large-capacity models end-to-end, thereby surpassing traditional methods in performance. This is the basic methodology of deep learning.

So, what are the downsides of this approach? Personally, I think there are the following points.

The efficiency of training f is not high:

The efficiency of training is manifested in two aspects. First, it takes a long time to train the model. As we all know, deep learning needs to use GPU to accelerate training, but even this training time is in hours or days. If the amount of data used is large and the model is complex (such as face recognition and speech recognition models with a large sample size), the training time will be calculated in weeks or even months.

Another disadvantage in training efficiency is that the utilization rate of samples is not high. To give a small example: the picture is yellow. For humans, they only need to look at a few "training samples" to learn to identify pornography, and it is very simple to judge which pictures are "pornographic". However, training a deep learning pornography model often requires tens of thousands of positive + negative samples, such as Yahoo's open source yahoo/open_nsfw. In general, deep learning models tend to need far more examples than humans to learn the same thing. This is because humans already have a lot of "prior knowledge" in this field, but for deep learning models, we lack a unified framework to provide them with corresponding prior knowledge.

So in practical applications, how to solve these two problems? For the problem of long training time, the solution is to add GPU; for the problem of sample utilization, it can be solved by adding labeled samples. But no matter adding GPU or adding samples, money is needed, and money is often an important factor restricting actual projects.

The unreliability of the fitted f itself:

We know that deep learning can greatly outperform traditional methods in performance. However, such performance indicators are often in a statistical sense, and cannot guarantee the correctness of individual cases. For example, a picture classification model with a 99.5% accuracy rate means that it correctly classifies 9950 of the 10,000 test pictures. However, for a new picture, even if the confidence of the classification output by the model is very high, we cannot The result is guaranteed to be correct. Because the confidence level and the actual correct rate are not equivalent in nature. In addition, the unreliability of f is also manifested in the poor interpretability of the model. In the deep model, it is usually difficult for us to clearly understand the meaning of each parameter.

A typical example is "against generated samples". As shown below, the neural network recognizes the original picture as a "panda" with a confidence level of 60%, but when we add a small noise to the original image, the neural network recognizes the picture as a "gibbon" with a confidence level of 99% . This shows that the deep learning model is not as reliable as imagined.

In some key fields, such as the medical field, if a model can neither guarantee the correctness of the results nor explain the results well, it can only serve as an "assistant" for humans and cannot be widely used.

Can f achieve "strong artificial intelligence":

The last question is actually a bit metaphysical, not a specific technical issue, but it doesn't hurt to discuss it.

Many people care about artificial intelligence, because they care about the realization of "strong artificial intelligence". Following the method of deep learning, we seem to be able to understand human intelligence in this way: x is the various sensory inputs of people, y is the output of human behavior, such as speaking and behavior, and f represents human intelligence. So, can human intelligence be trained by violently fitting f? Different people have different opinions on this question, and my personal tendency is that it cannot be done. Human intelligence may be more similar to conceptual abstraction, analogy, thinking and creation, rather than directly taking out a black box f. Deep learning methods may need further development to simulate real intelligence.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/131326489