Zidong Taichu: How many high-quality papers are needed to build a large domestic model?

Original: Tan Jing

"This round of visual self-supervision algorithm, have you not kept up?"

The friend in front of me who has an annual salary of nearly 700,000 and is an AI algorithm engineer at a major Internet company,

He answered my concern with a rhetorical question:

"How can self-monitoring keep up?"

He raised his head and added,

"Self-monitoring is not a (technology) that directly lands in business."

This is a day in June 2023.

The world is changing, even in a storm, there is still the possibility of "not keeping up" with the storm. Everyone is worried about falling behind, and there are indeed some people who are falling behind, just at the moment when ChatGPT made a ruthless blow.

There is no chance to train large models, and instead reading papers and looking at the supporting code is a standard action of "keeping up with the times", "fighting anxiety", and "coping with leadership".

Reading papers is a matter of hard work. Even as an observer and writer of large models, I deeply feel that only by studying papers can I not look like a fool when I write or type on the keyboard.

There are many famous quotes about big models on the market, and what they say has nothing to do with revealing the essence of big models.

It is a luxury to work hard, and it is absolutely impossible to make a domestic general-purpose large-scale model without hard work.

"Kung Fu" is a very Chinese philosophical term, which can have a wide range of meanings: vision, innovation, determination, teamwork, dedication...

The large-scale model paper is a good clue, so I read many academic papers of the "Zidong Taichu" large-scale model team.

Here, I would like to thank Dr. Wang Jinqiao, Dean of Wuhan Institute of Artificial Intelligence (researcher at Institute of Automation, Chinese Academy of Sciences), and Dr. Zhang Jiajun (researcher, Institute of Automation, Chinese Academy of Sciences), Vice President.

They answered more than a hundred of my questions one after another, and sometimes the reply time on WeChat was almost early in the morning. This is what makes this series of articles possible.

75f610cac1ce9358720eff4ba0796a3e.jpeg

The technical topic of this post is visual self-supervision. Visual self-supervised learning belongs to a column of self-supervised learning techniques.

When it comes to self-supervised learning, one can’t get around a metaphor from Yann LeCun: “If artificial intelligence is a piece of cake, then most of the cake is self-supervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning . ". Although this statement is still controversial, I personally like it very much.

Go back to April 2021, which is when the following paper was produced.

3234f3a19fe1d372d4b2df6c8d96df98.jpeg

Let me first quote a point of view of Dean Wang Jinqiao to set the tone for the first paper.

Transformer is not necessarily always the best. The underlying principle of Transformer is worth exploring more. More than ten years ago, although the convolutional neural network once unified the visual arena, "unified" does not mean "the best".

I realized that the underlying structure of the algorithm in particular needs to be innovated during the "fresh period".

Tell a history of the development of neural networks. The convolutional network ( CNN ) is brewing, and the residual network ( ResNet ) is the rainstorm.

In 2015 , the ResNet model , the representative work of the master He Yuming and his team , became popular all over the world once it came out. It is a fundamental innovation to the CNN algorithm. Its essence is to solve the problem that CNN cannot successfully train a deep neural network. The problem faced by AI scientists before is that as long as the CNN network built is deep and the number of layers is large (more than a dozen layers), the training will be particularly strenuous.

This paper, which has been cited more than 120,000 times, solved the big problem of model training in one fell swoop.

Although old things are old, the law of innovation remains unchanged.

Now this stage is still the "freshness period" of Transformer , so, will there be a "residual neural network moment" belonging to Transformer ?

After listening to Dean Wang Jinqiao's explanation, I understand that this "moment" will only happen if you have a deep understanding and bold innovation of Transformer . After all, algorithm design is a science of intuition (inspiration) plus experimentation.

Transformer is the basic "element" of ChatGPT. "Transformer first shined in the field of natural language processing, and then expanded into the field of vision in just a few years." Similar statements are mentioned in many large model papers.

After Transformer is invincible in NLP, it goes to the field of vision. The field of vision requires Transformers, as well as the diligence and ingenuity of AI scientists.

Transformer has its own unique way of playing.

At the beginning of the training, when inputting data, a set of exquisite "knife techniques" will be involved, and the image (picture) will be segmented. The simple understanding is to divide the picture into small pieces.

Later, I will use an "artistic photo" of an eagle to explain the subtlety of this knife-like technique. Many AI scientists have also worked hard on the "knife technique", which is an interesting "point" innovation.

When the migration happened, Dean Wang Jinqiao told me the "crux":

"Text information is naturally one-dimensional, which can be well segmented into words or words. Visual information is often two-dimensional, and the direct and equal segmentation method is easy to damage the semantic structure of the visual target, resulting in small pieces of pictures and the semantics of tokens. I can't match it."

The authors of this paper believe that in order to use the Transformer structure for vision tasks, a method of resizing the cut image is needed. Therefore, this paper proposes a deformable Transformer (DPT) structure that can adaptively divide pictures. Compared with the "hard" segmented image, the improvement in performance is "visible to the naked eye".

Take a look at this photo of an eagle, and notice the details of each part of the eagle's body, the tail, the claws...

24e95c72eb7ba659438a6913a627ef76.jpeg

In the picture, the area occupied by the tail of the eagle is larger, and the picture is divided into larger parts; the area occupied by the claws is smaller, and the picture is divided into smaller parts. The law of this "knife method" is easy to find. My understanding is that you should not cut the picture hard, but cut it according to the parts of the eagle.

A block of a picture corresponds to a token ; the "knife method" of cutting a picture is to cut it into the same small block of the picture as much as possible with the same semantics. Academically it is called: divide the areas connected by attention together .

Dean Wang Jinqiao explained that Transformer 's papers for vision can be divided into two categories:

The first category is "assembly innovation". Building blocks on the existing Transformer also has certain benefits for feature extraction capabilities and downstream task performance.

The second category is internal network structure optimization, including self-attention mechanism design, network structure optimization, and position encoding optimization. DPT belongs to this category. DPT can be used for both supervision and self-supervised learning. It is a basic network model structure.

The deformable Transformer of this DPT paper is the core basic technology of the visual coding part of the Zidong Taichu large model.

In recent years, the visual self-supervised learning situation has changed, secretly starting from the use of Transformer for visual tasks . One kind of patience is to take one step at a time before the big model technology is popular. Although not the pioneer of global large-scale model innovation, patience will reward those who work hard.

Dating back to July 2021, this is the most important foundational paper on Zidong Taichu's large model, laying the foundation for cross-modal understanding. Gather three experts, Dr. Liu Jing, Dr. Wang Jinqiao, and Dr. Zhang Jiajun from Wuhan Institute of Artificial Intelligence, to protect the "three modes" and dig deep into the moat.

ffdcea019d16d94fda171a965977bf89.jpeg

In the paper, the three modalities of graphics, text and audio are aligned. The question is, which one should be aligned with when there are three modalities?

The technical route of this paper is the technical route of Zidong Taichu’s large model, which is to align the image and audio modes with the text and unify them in the language space. Looking at the world, large-scale multi-modal technology routes have their own advantages and disadvantages. The American Meta company took the lead on the road of open source large-scale models. The open-source large-scale model LLaMA made its debut, and then open-sourced the multi-modal large-scale model ImageBind. The large-scale model technology route of foreign companies is ambitious, and a model includes as many as six types of modes.

The path of ImageBind is to take the visual representation as the goal and unify it in the visual space.

Different types of data in the multimodal large model contain different "information gold content". Which type to unify with is one of the key points of scientists' decision-making, which is both a strategic point of view and a positioning.

From another perspective, this is not only a difference in academic views, but also a dispute over routes.

The division of the sun and the moon, March, and March. Dean Wang Jinqiao controlled the drumbeat of the R&D rhythm to be calm and intensive, and the paper in October 2021 came.

ae34eb39082f28aa4355106be4a057af.jpeg

Talking about the two pioneering works of Transformer applied to visual tasks, it is inseparable from iGPT and ViT.

ViT is supervised, and iGPT is self-supervised. There is a lot of work focused on improving the effect and efficiency of visual self-supervision. This paper is also in this direction.

The paper uses two methods, reconstruction and contrastive learning.

First, refactor. This method is derived from the method training of NLP masks. In the language model, the model does not know that the covered character is "Tan", and the goal of the loss function is to make the character to be output as close as possible to the covered "Tan" character. The parts covered by NLP are words, and the parts covered by vision are small pieces of pictures. Reconstruction is to hope that after each small block in the picture blocks it, the model can be reconstructed.

Second, Contrastive Learning. Contrastive learning is to find the features and differences between two visual images by comparing their similarities. To put it simply, contrastive learning is that the distance between two identical pictures in the feature space is as close as possible, and the different pictures are as far away as possible. Contrastive learning was the most popular form of visual self-supervised learning at that time.

One photo is heavy, Tan Jing is turning around, another is running, and another is a graduation photo, two-thirds of her body is blocked. The essence of this method is to compare the appearance, as if it is the same person. No, just another person.

Learn the contrast and thus the features of the visual representation. The emergence of the method of contrastive learning represents that AI scientists have a deeper understanding of visual representation learning.

Contrastive learning was kicked off by Google's Hinton team in the early days, and the SimCLR algorithm was proposed at the 2020 CVPR conference. Since then, the team of AI Research He Yuming, a Meta company in the United States, has used MoCo to push the work forward. They compare contrastive learning to looking up a dictionary. As we all know, looking up the dictionary by index is more efficient than looking through the dictionary in order.

(One is to search in sequence in all queues, and the other is to find the token of the text corresponding to the small block of the picture from a global perspective. Then, the momentum update parameters are smoother and maintain model stability.)

At that time, the front sight was aimed at the firepower of the performance of visual self-supervised learning. But this is not the final battleground.

There are currently two problems in visual self-supervised learning, namely insufficient extraction of local information and loss of spatial information. In order to overcome the above problems, the paper proposes MST. MST captures local relationships between image patches while preserving global semantic information.

The method of the paper has two steps: the first step is to improve the Transformer structure (such as the aforementioned paper DPT). In the second step, in order to train the general visual feature expression, that is, to train the visual encoder, the authors of the paper combine contrastive learning and reconstruction training. One training has two objective functions, which is equivalent to satisfying two conditions at the same time.

Finally, with the reconstructed memory ability, contrastive learning ability is improved. Therefore, the effect of self-supervised learning is improved, and the training speed is also faster.

Although contrastive learning algorithms are more complex than language model algorithms, this problem will be solved with development. Its symbol is the birth of a general-purpose visual large model, which can understand all the pictures, and the model also has language capabilities to help the expression of the visual model. understood and expressed in words”.

Dean Wang Jinqiao believes that the mature visual large-scale model at this stage will be dual-modal, that is, both pictures and texts, so that people can understand and explore the world through vision. Of course, this is a goal that is still being chased. Although current general vision models often make mistakes, they have also demonstrated strong generality and the ability to handle complex tasks.

Teacher Wang Jinqiao's original words are: "This classic general vision model will come out before the end of 2023. If we don't make it, OpenAI will do it. That's the competition."

Kung fu cannot be done quickly, and the time has come to March 2022.

a0c6678059fcc15fd2be6f35f5d3c270.jpeg

After reading many review papers, I learned that contrastive learning during this period has become the mainstream method of visual self-supervised learning.

Contrastive learning relies on a large number of single-target images in this period, which has brought limitations to contrastive learning.

What is a single object image? To give an example, the goal of model training is to let the model find a horse, and there are only horses in the picture. This is more like a task in the laboratory, but in the real world, it is often required to solve the task of multi-target images.

If you want me to say, "The ancient west wind thin horse", the thin horse poses in different poses, can the model recognize it? The essence is to understand the goal.

The essence of the relationship between carts and horses in "Jielu is in a human environment, without the noise of carts and horses" is to understand the relationship between the target and the scene, which requires models to learn.

Because the essence of the common mistake is that the big model does not understand the "relationship". In this way, I estimate that in the next stage of image generation, the car may be generated on the horse's head.

Dean Wang Jinqiao believes that the goal of the thesis research is to learn the characteristics of related relationships. After learning, you will master the "relationship". The relationship between image small blocks and small blocks, and the relationship between the semantics of the corresponding tokens .

The method of this paper is to build a large visual model pre-training framework UniVIP , and use a unified pre-training framework to learn the statistical characteristics between different image blocks. In other words, UniVIP is an encoder for visual self-supervised learning. Learning all potential semantic relationships is called implicit knowledge graph.

I wrote this paragraph, but the expression is really boring. Knowledge graphs are good at relationships, and relationships are a class of features.

Vision tasks did a lot of "finding things" in the previous stage, such as target detection. There are also laws hidden in the relationship between these goals, and the model needs to continue to learn.

Dean Wang Jinqiao explained: "You can't just understand the partial content of the picture, use the self-supervised learning pre-training framework UniVIP to learn the relationship between images. (For example, the similarity between scenes and scenes, the similarity between scenes and targets, the target and target distinguishability.)”

Behind the advancement of visual self-supervised learning capabilities, scholars are progressing concurrently.

Not only that, but Dean Wang Jinqiao continued to talk: "In the past, the model only learned one level, but now it learns three levels at one go. The professional term is to learn the unified expression between different granularities. The more you learn, the more you understand. The paper The purpose of the authors is to try to learn all the characteristic information in the visual information."

all the way up. The authors of the paper hope that the large-scale visual model can learn the knowledge of the general world, learn from the past, and move forward all the way to the direction of the general large-scale model.

If the general large model cannot achieve "uniform expression", let alone "general". Small models are useful for small models, but the working method of a bunch of small models "working together" will not be the mainstream.

The vast majority of practitioners start from small models. Dean Wang Jinqiao's point of view is: "Don't be screwed by a screw in front of your eyes, which limits your judgment on the overall situation."

What was successful in the past may not be successful in the future.

Working hard also means fighting tough battles, investing heavily, taking a long time, and being patient.

This October 2022 paper introduces text knowledge into the visual model. It is the authors who continue to move forward on the road of unified expression of visual multi-tasks, beyond the immediate troubles, look forward, and explore in depth .

ac03588d948fc5b0fbeb6fe5beac2e38.jpeg

The focus of this paper in May 2023 is to solve the problems of inefficient training and prediction inconsistency in the current self-supervised learning of masks, so that the data can be fully utilized in the pre-training period, and the predictions tend to be consistent .

f52d9654d556cf4a24dfbe07582c571c.jpeg

The MAE mask uses a random sampling method, and each sampling situation is different, so the large model needs to be trained many times, which is inefficient. ( K is the total number of tokens in the picture ).

In the process, it is necessary to mask pixel blocks of different sizes, for example, 4X4 means to cover 16 pixels each time. There will be a bad situation, maybe these 16 pixels have been sampled many times, or they have not been sampled at all. At this point, we call it per-region sampling imbalance. Therefore, there is a certain uncertainty in the model convergence.

This paper explores balanced sampling. First, by making the mask distribution of each region equal, which means that the number of times and the probability of each region being covered are the same. Second, the sampling of data is also balanced, and the number of times that different areas of different pictures are sampled is also balanced.

In addition, the author proposes a self-consistent loss, that is, self-consistency, which makes the predictions of different input combinations at the same position consistent. From the perspective of modeling, the model satisfies the principle of self-consistency and drives the model to predict consistency.

Mask self-supervised learning is the focus of visual self-supervised learning. How to make masks more efficient is a question that scientists have to answer at this stage. It is important to make good use of MLM technology, and it is more important to improve MLM . The entry point of the method in this paper is to make good (full) use of the data, to make the sampling more balanced, and to make the model converge as soon as possible with as few training times as possible.

I put the August 2020 paper to the end, because this is the low-level work of the large model.

5a7d432407bed5ed6d02768b89b7b7aa.jpeg

The distributed training framework is also called the underlying basic software. This is an academic paper that "has it's cake and eat it too". Why do you say that? I will reveal the answer to the mystery later, and I will present the real material of this paper first.

Large models require computing clusters to complete arduous training tasks. If there are only three difficulties in large models, then the distributed framework will be one of them.

As a typical basic work, in this regard, Google's Jeff Dean team leads the world. This paper comes from the large model of Zidong Taichu team, which is equivalent to publishing the existing and successful engineering practice as an academic paper.

Without a distributed framework, it is impossible to train a large model with a large amount of parameters. From the perspective of paper production, this is a result of running the basic work of a large model on the target detection task, and by the way, it is just a top academic conference paper. The person in charge of this work is Dr. Zhu Yousong.

At the same time, I also learned that this distributed framework used to run on the early computing clusters of Kunpeng Lab.

Target detection refers to locating and identifying objects in images or videos. It is an important issue in the field of computer vision. However, in this paper, the authors of the paper have a meaning of "Xiang Zhuang dancing sword, intended for Pei Gong". Target detection The task is not the purpose, but to use the target detection task to lay the foundation for subsequent visual self-supervision.

Using a larger batch of samples during training will increase the difficulty of training. There are two difficulties: on the one hand, the basic software that supports training must have strength, and on the other hand, gradient optimization technology is required.

The method of gradient optimization technology in the paper is PMD-LAMB , Periodical Moments Decay LAMB, Chinese translation is periodic moment decay optimization. The innovation of the algorithm is that each network update depends on the accumulated historical gradient, and its hysteresis will hinder the rapid convergence of the model; the design uses a periodic moment decay function to control the contribution of the historical gradient to the update amount during the gradient calculation process, so that the calculation The resulting gradient can be effectively controlled to avoid gradient explosion.

This moment decay function is equivalent to a sequenced matrix. This matrix is ​​first in, first out, and last in, last out, maintaining a matrix with a certain scale, like a transitional room. After 6000 samples enter the room, it can effectively control the entry and exit and control the gradient. The decline curve of the loss function is smoother during the training process.

At this stage, the problem of visual self-supervised learning in multi-task unified modeling has not been completely resolved, which is one of the reasons for the lack of versatility of large visual models.

After writing this, I believe readers should gradually understand that the complexity of visual self-supervision is much higher than that of language self-supervision.

Because the sampling space of visual self-supervised learning is large, the random sampling range is large. In the process of masking, the complexity rises exponentially. In the field of NLP , only words are covered. Text is one-dimensional, but vision is two-dimensional or even three-dimensional.

The technical difficulty involved in refining a domestic large-scale model is unprecedented. The large-scale model of "Zidong Taichu" traveled through mountains and rivers to verify existing work from the perspectives of top engineering practice and advanced theory.

Domestic large-scale models are destined to be difficult. As a science and technology author, I am also under great pressure to learn and improve. Scientists are working day and night to face the challenges of the times. Sometimes, they are also my spiritual pillar. Starlight at midnight, dawn in the early morning, every time I can’t keep up with overtime, thinking of them working overtime, I feel much more comfortable and continue to write.

f6ea678d771ed254c1b691212490b3ad.png

Related Reading

43a260b0e482216853c0d988cf39e05d.png

1. In-depth chat with Dr. Zhang Jiajun: What papers are worth reading behind the big model of "Zidong Taichu" (1)

2.  Wuzhiyuan BigTrans: Allowing large-scale language models to have more than 100 language capabilities

One More Thing

4b331fd9a04b374fbd9d35b8bf7d00bb.png

cc46c6e2f4eb616047475ec678dd97ae.png

read more

AI large model and ChatGPT series:

1. ChatGPT is on fire, how to set up an AIGC company and make money?

2.  ChatGPT: Never Bully Arts Students

3.  How does ChatGPT learn by analogy? 

4.  Exclusive丨From the departure of the great gods Alex Smola and Li Mu to the successful financing of AWS startups, look back at the evolution of the "underlying weapon" in the era of ChatGPT large-scale models

5.  Exclusive 丨 Former Meituan co-founder Wang Huiwen is "acquiring" the domestic AI framework OneFlow, looking to add a new general from light years away

6.  Is the ChatGPT large model used in criminal investigation and solving cases only a fictional story?

7.  Game of Thrones of the Large Model "Economy on the Cloud"

8.   CloudWalk's large-scale model: what is the relationship between the large model and the AI ​​platform? Why build an industry model?

9.  In-depth chat丨Fourth Paradigm Chen Yuqiang: How to use AI large models to open up the trillion-scale traditional software market?

10. In-depth chatJD Technology He Xiaodong: A "departure" nine years ago: laying the foundation for multi-modality and competing for large models

AI large model and paper series:

1. The open source "imitation" of ChatGPT actually works? UC Berkeley thesis, persuasion, or move forward?

comic series

1.  Is it joy or sorrow? AI actually helped us finish the Office work

2.  AI algorithm is a brother, isn't AI operation and maintenance a brother?

3.  How did the social bullishness of big data come about?

4.  AI for Science, is it "science or not"?

5.  If you want to help mathematicians, how old is AI? 

6.  The person who called Wang Xinling turned out to be the magical smart lake warehouse

7.  It turns out that the knowledge map is a cash cow for "finding relationships"?

8.  Why can graph computing be able to positively push the wool of the black industry?

9.  AutoML: Saving up money to buy a "Shan Xia Robot"?

10.  AutoML : Your favorite hot pot base is automatically purchased by robots

11. Reinforcement learning: Artificial intelligence plays chess, take a step, how many steps can you see?

12.  Time-series database: good risk, almost did not squeeze into the high-end industrial manufacturing

13.  Active learning: artificial intelligence was actually PUA?

14.  Cloud Computing Serverless: An arrow piercing the clouds, thousands of troops will meet each other

15.  Data center network : data arrives on the battlefield in 5 nanoseconds

16.   Data center network : It’s not scary to be late, what’s scary is that no one else is late

AI framework series:

1. The group of people who engage in deep learning frameworks are either lunatics or liars (1)

2. The group of people who engage in AI frameworks 丨 Liaoyuanhuo, Jia Yangqing (2)

3. Those who engage in AI frameworks (3): the fanatical AlphaFold and the silent Chinese scientists

4. The group of people who engage in AI framework (4): the prequel of AI framework, the past of big data system

Note: (3) and (4) only include published books, titled "I Saw the Storm".

37582dc8f27ee44cbb1f25e610e6be21.jpeg

Guess you like

Origin blog.csdn.net/weixin_39640818/article/details/131297848