The violence of AI

background

I am a freelancer and operator of the AI ​​open source project https://github.com/CloudOrc/SolidUI

Technical realization

on-premise

The value of internal private data is getting lower and lower. In order to increase the value of the model, a lot of private data has been stored. In addition, for example, the explosion of LAION and LLM has gradually reduced the value of private data of toB, so the data side can still have Value is symbiotic data, the company's own model plus customer data, the two form a part that cannot be migrated, this is the most valuable data.

open source data

The new cutting-in direction of AI, open-source data sets, has relatively low requirements for computing power, but has global significance and is an important part of AI. LAION is a good example.

Like the three post-00s who launched the open source data set project a few days ago, the goal is to collect all the data in the world, and the first part integrates all arxiv data.

fine tuning

If the pre-training training model (LM) is very powerful, then fine tuning can be completely ignored in 90% of the scenarios. Fine tuning is a co-construction relationship between people and the LLM system through the prompt.

In the most intuitive scenario, in the collaborative office industry, many LLM portals will appear immediately, and customer data will be gathered into these platforms to build a relationship through the LLM system.

Equipment manufacturers, who have been working on cameras in the mobile phone field for many years, urgently need a way to enter this scene.

moat

The first KPI that is often recognized is not the most important. You must find your own other points. Many people will recognize the most conspicuous first KPI.

Business Challenges and Opportunities

historical baggage

The development of AI is not cyclical. If you extend it according to different entry points, you will feel that it is a bit of a new start periodically. We often find that after a few years, no one mentions things that were once popular. From the perspective of AI development, if you implement it at the product level, I think it is very good for entrepreneurs, and it is constantly creating new possibilities, but from a technical point of view, technology has always been replacing The existing practice in the past, although the underlying technology is indeed somewhat continuous, but those who have done real AI entrepreneurs will find that the so-called AI entrepreneurship, even if we don’t talk about all sales or these commercial things, you 80 % of the stuff is product engineering and 20% is the underlying technology. even good. If you are starting a business at this time, and you happen not to be open AI or Anthropic, you may have 10% of the technology, which is not bad. At this time, we will find that all these people who came first, not only the technology, the scale of the industry, but even the distribution of benefits are historical burdens.

Just like recently seeing Baidu's AI search, which integrates a bunch of role scenes, does this value and scale really have that much influence? The details can wait for the public data.

moderation API

Data security, compliance, decoupling through API. Similar to openAI API

scalable loss

It is necessary to switch between different roles at any time to look at problems. It has always been the role of technology, which will make the portrait of entrepreneurship lose dimension.

The operation is actually the closest to the customer, and his perspective will definitely raise expectations, and he will not understand the product and technology, especially the technology will be farther away from the customer. The realization of MVP in terms of product engineering has deviations from technology, so what is lacking in the current era is a character who can put down the burden of history and switch roles at any time.

Github

The soil of GitHub will definitely produce many subversive products here in the future, such as: langchain

Low-code platform is a pseudo-demand

In fact, many times your most accurate users do not need to write code, as long as your model is strong enough, you can realize the content through prompt.

For the code, Github Copilot has been successful enough to take the lead.

The most common thing about low-code is the so-called building something in the business background or background system. What problem does it solve? Is the standard question below the complex question presented What does this mean? We can review these popular low-code platforms, such as airtable and NocoDB, what is their upstream? Their upstream is a SQL database, which is a very standard thing. It is very, very cumbersome for us to build a background system from scratch.

Upstream of LLMs is prompt, a new set of standards.

Relationship between vector database and LLMs?

The most important function of the vector database is retrieval and recommendation. It has little to do with LLM, and it is combined with the embedding model.

OpenAI's embedding model is mainly used for text representation, which can map text to a low-dimensional vector space to achieve tasks such as similarity calculation and classification. The model has the following main functions:

  • Text representation. The embedding model can map the text to a lower-dimensional vector space to realize the digital representation of the text. In this way, the task of text processing can be transformed into operations in vector space, which is easier to be solved by machine learning models.
  • Similarity calculation. In the embedding space, the vectors corresponding to similar texts are closer. Therefore, we can judge the similarity between two texts by calculating the similarity of two vectors (such as cosine similarity), and realize similar text query and other functions.
  • Text Categorization. In the embedding space, the vectors corresponding to the text of the same category will be clustered together. Therefore, we can use the clustering situation in the base vector space to judge the text category and realize text classification.
  • other downstream tasks. Embedding vectors, as digital features of text, can be input into other machine learning models for various downstream tasks such as sentiment analysis and topic inference.

midjourney data flywheel

  • Collect and organize large amounts of data. Midjourney collects massive images, texts and other data by means of web crawlers and manual labeling, and cleans, labels and organizes the data to build a high-quality data set.
  • Develop AI models based on data. Midjourney uses the dataset built by Step1 to develop AI models such as computer vision and natural language processing. These models can complete tasks such as image recognition and semantic understanding.
  • Open the AI ​​model as an API. Midjourney will open the developed AI model to customers in the form of API. Customers can call these APIs in their own products or services to realize corresponding AI capabilities.
  • Users use the API to generate more data. When users call the API to use AI services, more data will be generated, such as user images, text, user interaction data, etc. These newly generated data are collected by Midjourney and input into Step1 to continuously enrich the data and improve the effect of the model.
    Repeat the above steps continuously. Midjourney builds a positive feedback loop between data and models by continuously repeating the process from Step1 to Step4 to achieve rapid progress in data and algorithms. This creates a data flywheel effect.

Therefore, the core of the data flywheel model is to construct a mutually reinforcing cycle of data and algorithms. Data promotes the progress of algorithms, which in turn generate more data; algorithms increase the value of data, and better identify and understand data. This interaction constitutes a data flywheel that can enable the continuous and rapid development of artificial intelligence.

Advantages of this business model:

  • sustainable development. Through a positive feedback loop, data and algorithms can promote each other, continuously improve, and achieve sustainable development.
  • First mover advantage. Being able to build a data flywheel first and seize the market can gain a first-mover advantage through network effects.
  • Two-way network effects. Data network effects and algorithmic network effects interact to generate powerful two-way network effects.

Models and Methods

vertical model

Most of the current vertical models on the market are not even as good as GPT 4 in their field. The importance of specialized models for specific domains and tasks is emphasized.

What is the reason? Because our data and scale are far from saturated for the general-purpose large model, any valuable field will be directly integrated into the general model without any trade-offs, so this is a completely free upgrade , so it will be merged in after all.

Because any new field you add will not only improve the ability of the model in this field, but it will also affect the overall horizontal improvement. This is one point of the large model that we find very attractive. When we were working in a field before, our accumulation could only improve us in this field, but when we made a large-scale model, we found that our model was growing horizontally, which is great. So in summary, the difference between vertical applications should be in the business rather than the model.

multimodal

Multimodal learning refers to the process of machine learning using data from different modalities (patterns). Now the openAI release version focuses on this area, and it is still in its infancy.

Is RLHF really a must?

RLHF itself is a relatively research method, not necessarily stable or necessary. Some recent studies, such as DPO (Direct Preference Optimization), have proved that if there are enough human preference feedback data, we do not necessarily need reinforcement learning, and we can directly optimize the language model through maximum likelihood estimation, omitting the reward model.

Therefore, RLHF is not a necessary means for alignment. Mainly for the following reasons:

  • RLHF needs to design the reward function, which is a difficult point in itself, unstable and wrong design may lead to non-ideal results. Given enough feedback data, we can directly optimize the model to maximize the data likelihood, without reward modeling.
  • RL requires a lot of environment interaction and trial-and-error process to learn, which may be time-consuming and resource-intensive in some environments (such as dialogue systems). Given enough human preference feedback, we can directly learn to fit these feedbacks.
  • The learning process of RL is relatively unstable, susceptible to initialization, hyperparameters, etc., and the final strategy is difficult to explain and understand. Learning preference feedback directly yields more stable and interpretable results.
  • Human feedback itself includes the evaluation of the model's behavior. If there is enough feedback, it actually covers the information to be expressed by the reward signal in RL. So, there is no need to go through additional reward modeling.
    So, overall, RLHF is not a necessary means for alignment. If there is enough human feedback and preference data, we can align and optimize the model through a simpler and more direct learning method, without necessarily introducing reinforcement learning and reward modeling. Human feedback itself contains the needed training signal.

RLHF is a relatively research topic, and simpler and direct methods may be more practical in practical applications. So, it's not necessarily the only or best means of achieving alignment. The focus is on data and optimization goals, and means can be selected according to specific circumstances.

Open source and cognition

Technology equal rights

At the forefront, in fact, technology equal rights, many companies do not do too much, and are limited by the environment. At this time, it depends on who finds an entry point under this opportunity for equal rights. Open source has its own community attributes and is close to users, but it is very useful. It may not be the most accurate user, there is a big deviation, and the direction needs to be constantly adjusted, thinking from the perspective of product engineering.

It is necessary to continuously connect various communities and integrate them from the upstream and downstream supply chains.

Emergence of LLMs

LLMs have strong generative ability and self-labeling. For example, when the computing power is abundant, the ability is very strong, very intelligent, and can extend a lot of content. When the computing power is lacking, it is necessary to repeatedly confirm the generated content.

As the parameters of the model iteration change quantitatively, ChatGPT's cognition of the world has undergone a qualitative change. He no longer simply remembers the pre-trained information. The current information will be understood, then refined into knowledge, and then expressed by GPT for you. And when it comes to GPT-4, it can also be said that he has creativity beyond cognition. This process is quite similar to the evolution of the human brain. After a certain number of neurons have developed, Homo sapiens has the ability to dominate the earth.

LLMs plugin

This is a very powerful entry point. There is no standard now. openAI has not been listed. It wants to set up a similar foundation to formulate world standards. Those who are the first to make plug-ins can be accepted by the community, which means that the follow-up is to jointly build norms. Organizations, eg: langchain, openAI

Guess you like

Origin blog.csdn.net/qq_19968255/article/details/131368366