Google founder Brin returned to the workplace and invested in the research and development of AI killers! It is expected to launch the next-generation general-purpose model "Gemini" in the second half of the year, and the final battle with OpenAI!

 

picture

The original author of Xi Xiaoyao's science and technology
| Wang Siruo

Hello everyone, I am Wang Siruo. In the current era of large-scale model melee, the core goal or mission is always the general artificial intelligence AGI, but at present, only OpenAI and Google may be in this direction that requires massive computing power and financial support. Construct a complete technical route.

However, it is a pity that OpenAI chose to close the source and build a solid moat. Everyone is trying to understand any model details about GPT-4 from the speeches of its internal employees, and even analyze the performance of GPT-4 at different time nodes. Reasoning for the improvement of its model is undoubtedly just scratching the surface. A core question is always in front of the community: how to move towards general artificial intelligence? In other words, how to make the model learn and understand the human physical world?

On the road to AGI, Google proposed the next-generation general model "Gemini" and plans to release it later this year. Gemini is the Generalized Multimodal Intelligence Network, which is Google's answer to GPT-4. Threat's last stand, Google co-founder Sergey Brin (Sergey Brin), who had resigned for four years, returned to work and assisted in the creation of the Gemini system. Gemini is undoubtedly the focus of the large-scale model track in the second half of the year.

The news revealed by the Gemini system is very consistent with the author's conception of the future AGI model. Undoubtedly, Gemini will also be a closed-source system, but the good news is that we will be getting closer to AGI. The open-source community may have the opportunity to learn and learn from Google. Explore the large model paradigm.

1. Gemini system

  • On April 20, 2023 , Google CEO Pichai announced the merger of DeepMind and Google Brain to form Google DeepMind, which combines Google's world-class talents, computing power, infrastructure and other resources in the AI ​​field. The goal is simple. Develop a multi-modal model for benchmarking GPT-4, the project code name is "Gemini" (Gemini).

  • On May 10, 2023 , the Google I/O 2023 conference began to preview the Gemini model, emphasizing that Gemini performed well in using tools and integrating APIs, and was committed to achieving innovations in memory and planning. Gemini is still in training but has already demonstrated multimodal capabilities not seen in any previous model.  After fine-tuning and rigorous security testing, Gemini will be available in various sizes and functions to ensure that it can be deployed on different products, applications and devices.

  • On June 14, 2023 , it was revealed that Google uses Youtube videos for model training, and Google researchers have been using YouTube to develop its next large-scale language model, Gemini.

  • On June 14, 2023 , Hassabis, CEO of Google DeepMind, stated that the Gemini system will be more powerful than the system behind ChatGPT. DeepMind's Gemini is still under development and is a large-scale language model for processing text. It is essentially the same as the one that supports ChatGPT. GPT-4 is similar. But Gemini incorporates the capabilities of the AlphaGo system (reinforcement learning + tree search?) and has made some interesting innovations, and is expected to invest tens or hundreds of millions of dollars in a few months to complete the development.

  • On July 11, 2023 , Hassabis said in an interview with the New York Times that we are developing the Gemini system to meet the next era, which will be an extremely powerful general-purpose system that basically interacts through language and has general functions such as mathematics and coding. functions and is able to reason and plan. In this scenario, those professional artificial intelligence systems similar to AlphaGo and AlphaFold will be collectively referred to as tools.

  • On July 11, 2023 , Hassabis said in an interview with The Verge that Gemini is Google's next-generation multi-modal large model, which combines all the best ideas of the world's leading AI research teams (DeepMind and Google AI), and will be more relevant in the next few years. Today's chatbots look trivial compared to what's going on!

  • Wall Street Journal, July 20, 2023 : "Gemini is Google's attempt to build a general artificial intelligence program that could rival OpenAI's GPT-4 model." Demis Hassabis, the Google exec in charge of the project, at a recent company-wide meeting Tell employees the program will be rolled out later this year."

Gemini is a multi-modal intelligent network capable of processing multiple types of data and tasks simultaneously. This includes text, images, audio, video, 3D models, and even diagrams. Gemini is more than a single model. It is a network of models, each of which contributes to the overall functionality of the system. This network architecture enables Gemini to handle a wide variety of tasks without having to build specialized models for each task. Different models in a network collaborate, share information and learn from each other, making Gemini an extremely versatile and powerful AI tool.

Gemini uses a new architecture that fuses a multimodal encoder and decoder. The job of an encoder is to convert different types of data into a common language that a decoder can understand. The decoder then takes over, generating output in different modes depending on the encoded input and the task at hand. Users provide input in various formats --- text, images, audio, video, 3D models, graphics, etc. An encoder takes these inputs and converts them into a common language that a decoder can understand. The encoded input is then fed into the model. The model is task-agnostic, meaning it does not need to know the details of the task it is performing, but simply processes the input according to the task at hand. A decoder takes processed input from a model and produces output. Depending on user preference, the output can be in different ways.

2. Why is the Gemini system different?

In fact, we can get a glimpse of Google's determination to throw all its eggs in Gemini from the merger of DeepMind and Google Brain. DeepMind has always been a "disobedient" existence within Google. Mies Hassabis), while Google Brain has been managed by veteran Jeff Dean. After the merger into Google DeepMind this time, Hassabis served as CEO and Jeff Dean served as chief scientist. The two reported to Pichai at the same time. This time Gemini ( The naming of the Gemini) system is really meaningful~

It appears Gemini will be more than just a new AI model; it's a glimpse into the future of AI, and with its multimodal capabilities and creativity, Gemini will redefine what AI can do and how we interact with it.

GPT-4 VS Gemini system

picture

GPT-4 is primarily a text-based model designed to handle tasks involving text data, such as writing papers, answering questions, or translating languages. Gemini, developed by Google, is a multimodal intelligent network, meaning it is designed to handle multiple types of data and tasks simultaneously. Gemini can handle text, images, audio, video, 3D models, and even graphics. This makes Gemini more general than GPT-4 because it can handle a wider range of tasks and data types.

Gemini is not just a single model, but a network of models. This network architecture enables Gemini to handle a wide variety of tasks without having to build specialized models for each task. Different models in a network collaborate, share information and learn from each other, making Gemini an extremely versatile and powerful AI tool. It's more adaptable. It can handle any kind of data and tasks without specialized models or fine-tuning of any kind. Additionally, it can learn from any domain and dataset without being restricted by predefined categories or labels.

3. Some ideas

Borrow a simple example to illustrate the author's point of view, you are going to learn chess, recite chess records every day behind closed doors, watch the master's game, carefully observe and think about how to play chess. But you still haven't become a good chess player.

You start playing against the chess master, who makes silent moves and corners you every time, and the failures start to keep you learning, however, when you fail, your progress is very slow, it seems to you , you need more games to get to a decent level. But the result is still more wins than losses, you start to muster up the courage to ask the master for advice, the master explains the opening, strategy and tactics to you, let you start from the same starting position repeatedly to learn how to crack them, and at each stage Competing with apprentices at the same level, you finally feel like you are making steady progress and getting the hang of chess.

This simultaneously leads to three paradigms of learning:

  • Imitate learning. Passively learning and observing, learning winning methods from tens of millions of games, a chess master may carefully lay out multiple moves to achieve his goal, and with a rich context, the possible sequence of moves explodes exponentially.

  • Self-learning. Interacting with experts, you will get feedback on the final result of each action, and start to slowly correct your performance. This still requires continuous attempts to get a rough plan.

  • guide learning. Experts allow you to learn solutions to problems through short action sequences and instant feedback. By learning a large number of combinations, you will learn effective problem-solving algorithms.

Most of the existing paradigms build a base model through pre-training (imitation learning) + fine-tune instructions based on the base model and align with human feedback reinforcement learning RLHF (autonomous learning). But this is still far from the realization of AGI. The current alignment scheme of RLHF faces many limitations (the ability of the base model limits the generalization of the model, alignment tax) and there are still all current base models. The phenomenon of Hallucination does not go away.

LeCun concluded that "auto-regressive LLMs are doomed" (autoregressive models will eventually fail), and proposed a world model. Although LeCun's idea has attracted huge controversy, but imitation learning (auto-regressive learning) + autonomous learning (alignment) is similar to "Handan toddler" after all, here is a little bit of the author's idea, we may be able to continuously learn a lot of texts in the world so as to be sure To a certain extent, we understand the world, but it is absolutely impossible for us to become grand masters by watching grand masters play chess. The thinking of experts may not be deduced from their behavior at all. The current alignment can only be played in the Chat scene, and it is actually in the vertical field In the application of , the large model alignment scheme cannot solve the fundamental problem. So naturally, integrating guided learning into imitation learning is more in line with the author's imagination of the next-generation model, that is, integrating reinforcement learning into the model training stage, which can continuously learn from the environment, realize the evolution of intelligence, and then realize the general artificial intelligence AGI.

In further derivation, with such an idea, what should we do?

Demis Hassabis made it clear that AlphaGo's reinforcement learning technology was used on Gemini (somewhat guided learning), and Gemini is a model network. This network architecture enables Gemini to handle a wide variety of tasks without having to build specialized models for each task. Different models in a network collaborate, share information and learn from each other, making Gemini an extremely versatile and powerful AI tool.

It tastes right! The author believes that Gemini will allow Google to usher in its own ChatGPT moment, and this idea partly comes from the past success of DeepMind, such as the development of AlphaFold2, which has really changed the paradigm of a field, and Google uses all its AI power It is impossible for us not to look forward to the Gemini made, and another part comes from Gemini which is very in line with the author's concept of the future general model architecture.

Perhaps, as Hassabis said, in front of Gemini, today's chatbots look insignificant.

Gemini will be released in October with a high probability, but it may be later. This time, Google has bet all its assets on Gemini, we will wait and see!

 Large Model AI Full Stack Handbook

**The industry's first AI full-stack manual is open for download! ! **

Up to 3,000 pages, covering AI directions such as the development of large language model technology, the latest trends and applications of AIGC technology, and deep learning technology. WeChat public account pays attention to "Xi Xiaoyao Technology Talk", reply "789" to download materials

 

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132093450