To make polymorphic large models, what technical difficulties need to be overcome?

Making polymorphic large models is a frontier topic in the field of artificial intelligence. It aims to build models with wide adaptability and high flexibility to meet the challenges of different fields and tasks. However, to achieve this goal, we need to overcome many technical difficulties. In the process of solving these problems, Mr. He Xiaodong put forward some unique insights and viewpoints, pointing out the direction for the production of polymorphic large models.

6db9787ba73495fa9c6953bbdd8abd5b.jpeg

In the study of multimodal large models, we face several technical difficulties. First, we need to determine at which level the purpose of multimodal fusion is performed. It is not enough to give a language model multimodal capabilities, as this can be achieved by simply calling another model. For example, if we let a language model call the Midjourney model to draw images, although it seems to be able to complete multiple tasks at the task level, but at the model level, the two models are separated and cannot achieve the emergence of multi-modal intelligence. .

The reason why the large model has attracted attention and caused heated discussions is not just because of its large scale, but because people have begun to realize its "emergence" of intelligence. In the past machine learning algorithms, as the size of the model increases, the marginal benefit gradually decreases, that is, the improvement of the effect becomes smaller and smaller. But now it is found that when the model size exceeds tens of billions, its marginal benefits begin to increase, which leads to a sudden and significant improvement in performance, which is called the "emergence" of intelligence. Therefore, "emergence" is the most fascinating aspect of large models.

9ba9846ef87ef12211a3dd1725272776.jpeg

If we want to see intelligence emerge at the multimodal level, it means we need to combine language and vision at the bottom level. The emergence of intelligence can only occur at the lowest level of integration. In other words, we need to build a dense multimodal large model to achieve this emergence.

The second question is, when the model becomes more intelligent, at what level does it start to become intelligent? We often say that a picture is worth a thousand words, so compared to "generating text from pictures", "generating text from pictures" is a more challenging multimodal task. Giving a machine a short text description and letting it generate an image requires a machine with extremely high imagination.

For example, to draw a bird with the "Vensen diagram" model, in the face of a rough description, AI can automatically supplement the details, not only to match the whole, but also to match the local details. The difficulty is that the original visual signal is only pixels, while the language signal is initially only words or characters, the two are difficult to align, and the hierarchy is also different, so we need to find an appropriate hierarchy to enable multimodal information to align. It currently appears that if multimodal models are to be intelligent, this intelligence will emerge at the semantic level. We attended Microsoft's Disruptive Technology Review conference in late 2017 and gave a presentation to Nadella and his management team on their work on text-driven visual content generation.

59aec0709774d55fe3e6152735bea474.jpeg

All in all, making polymorphic large models is a complex and challenging task, but it also brings us great opportunities and potential. By overcoming technical difficulties and adhering to the philosophy emphasized by Mr. He Xiaodong, we can create a more flexible and adaptable model, bringing innovation and breakthroughs in various fields. It is believed that in the near future, polymorphic large models will become an important engine for the development of artificial intelligence, creating a more intelligent and efficient world for us.

Guess you like

Origin blog.csdn.net/huduni00/article/details/131641394