Generative artificial intelligence notes-AIGC notes

Generative artificial intelligence notes-AIGC notes

More than ten years ago, artificial intelligence was just an unpopular niche field, but now it has become a hot topic on the streets, and almost everything can be linked to artificial intelligence.

Artificial intelligence includes basic layer, technical layer and application layer.

The basic layer is the foundation of the artificial intelligence industry, providing data and computing power support for artificial intelligence;

The technical layer is the core of the artificial intelligence industry, mainly including the research and development and upgrading of various models and algorithms;

The application layer is the software and hardware products or solutions formed by artificial intelligence to meet the needs of specific scenarios.

Decision-making AI and generative AI

Artificial intelligence can be divided from different dimensions. If divided according to its model (artificial intelligence is supported by models), it can be divided into decision-making AI and generative AI.

Decision-making AI

Decision-making AI (also known as discriminative AI) learns the conditional probability distribution in the data, that is, the probability that a sample belongs to a specific category, and then judges, analyzes and predicts new scenarios. Decision-making AI has several main application areas: face recognition, recommendation systems, risk control systems, other intelligent decision-making systems, robots, and autonomous driving.

Generative AI

Generative AI learns the joint probability distribution in the data, that is, the probability distribution of the vector composed of multiple variables in the data, summarizes the existing data, and uses deep learning technology on this basis to create imitation, stitching style content, which is equivalent to automatically generating brand new content.

No matter what type of model it is, its basic logic is the same: the AI ​​model is essentially a function. If you want to find the accurate expression of the function, it is difficult to deduce it by relying on logic alone. This function is actually trained of.

From a macro perspective, decision-making AI is a technology used for decision-making. It uses technologies such as machine learning, deep learning, and computer vision to handle problems in professional fields and help enterprises and organizations optimize decision-making.

Generative AI is an AI technology used to automatically generate new content. It can use technologies such as language models, image models, and deep learning to automatically generate new text, images, audio, and video content.

Therefore, decision-making AI can be said to imitate the human decision-making process, but generative AI focuses on creating new content.

Insert image description here

From big data to big models

Big data refers to massive amounts of data.

In the field of artificial intelligence, large amounts of data are used to train models. Models with a large number of parameters are called "large models"

Large models can analyze and process massive amounts of data and achieve better results in solving problems.

Generative AI is the product of large models.

Currently, "content" production has gone through four iterations:

PGC(professional generated content)

PGC (professional generated content), that is, "professionally produced content", mainly refers to content created by content producers with professional backgrounds

UGC(user generated content)

UGC (user generated content), "user generated content", the source of its content is more popular, and everyone can produce content as a user

AIUGC(artificially intelligent UGC)

AIUGC is a combination of artificial intelligence and UGC. Artificial intelligence participates in the process of user creation of content.

AIGC

Different from PGC, UGC and AIUGC, in the concept of AIGC, "inanimate" artificial intelligence becomes the complete source of content, and "inanimate subject" becomes the producer of content for humans.

AIGC is a production method that uses artificial intelligence to automatically generate content. It can generate diversified content such as text, pictures, audio, video, code, etc. based on generative AI models, training data, etc. This rapid content production method gives There is an exciting injection of new blood into the market.

Insert image description here

Generative AI and decision-making AI focus on different cognitive levels

Generative AI has also penetrated into the 3D field. Magic3D is an application launched by GPU (graphics processing unit) manufacturer NVIDIA. It will first roughly model things in 3D with low resolution, and then optimize it to higher resolution. Rate. OpenAI's Dream Fields can generate 3D models without the need for photos, making it possible to "make something out of nothing".

Generative AI learns the joint probability distribution in the data, summarizes the existing data, and creates new content.

The "previous life" of ChatGPT is closely related to the Transformer model, which was born in 2017.

In 2017, the Google Brain team published a paper titled "Attention Is All You Need" at the Neural Information Processing Systems Conference. The author of this paper proposed a Transformer model based on the attention mechanism for the first time in the article, and used this model for the first time to understand human language. This is natural language processing. The Google Brain team used many publicly available language data sets to train this initial Transformer model, which included 65 million adjustable parameters.

After extensive training, this Transformer model has reached the first level in the industry in terms of English component syntactic analysis, translation accuracy and other scores, leading the world and becoming the most advanced large-scale language model at the time.

In just a few years, the influence of this model has penetrated into various fields of artificial intelligence, including various forms of natural language models, and the AlphaFold 2 model for predicting protein structures. In other words, it is the source of many subsequent powerful AI models.

In 2018, less than a year after the Transformer model was launched, OpenAI made its own technological breakthrough. They published a paper "Improving Language Understanding by Generative Pre training" and also Launched the GPT-1 model with 117 million parameters. The GPT-1 model is a model based on the Transformer structure, but the data set for training it is larger.

The finally trained model GPT-1 achieved better results than the basic Transformer model in the four evaluation dimensions of text classification, question and answer, text similarity assessment, and implication semantic determination. Therefore, it also replaced the Transformer model and transformed into The new industry leader.

Then, GPT-2, GPT-3, GPT-... were launched continuously.

At the 2022 Neural Information Processing System Conference, OpenAI once again announced its new breakthrough to everyone. It launched a new large-scale language pre-training model: ChatGPT. GPT-3.5 is the predecessor of ChatGPT, and it is also a model developed by OpenAI after fine-tuning the GPT-3 model. After the birth of GPT-3.5, ChatGPT was born.

The underlying logic of AIGC

The biggest technical contributor to the development of artificial intelligence is deep learning.

The explosive growth of deep learning is due to the massive amounts of data, the powerful computing power brought by graphics processors, and the continuous improvement of models.

In 2006, computer scientist and cognitive psychologist Geoffrey Hinton first proposed the "deep belief network". Different from traditional training methods, deep belief networks have a "pre-training" process, which can easily allow the weights in the neural network to find a value close to the optimal solution, and then use "fine-tuning" ( fine-tuning) to optimize training of the entire network. This staged training method significantly reduces the time to train deep learning models.

The past and present of deep learning

Machine learning is a branch of artificial intelligence that studies how computers can simulate and implement human learning behavior.

Deep learning is a type of machine learning.

Insert image description here

The concept of deep learning originally originated from artificial neural networks.

Artificial neural networks

Artificial neural network is a model that imitates human neural network for information processing. It has the ability of autonomous learning and self-adaptation.

In 1943, mathematicians Pitts and McCulloch established the first neural network model, the M-P model, which was able to perform logical operations and laid the foundation for the development of neural networks.

Biological neurons are composed of four parts: cell body, dendrites, axons and axon terminals. The M-P model is actually an imitation of the structure of biological neurons.

Insert image description here

Schematic diagram of neurons and M-P model

In the 1980s, the multi-layer perceptron invented by artificial intelligence scientists Rumelhart, Williams, Hinton, Yann LeCun and others solved the problem of [prediction of complex functions] Difficulties have promoted the further development of artificial neural networks.

In the 1990s, Nobel Prize winner Edelman proposed the Darwinism model and established a neural network system theory. He was inspired by Darwin's theory of natural selection and linked it to the brain's way of thinking. He believed that "in the face of an unknown future, the basic requirement for successful adaptation is pre-existing diversity." This is more relevant to what we are talking about now. The combination of multiple model training and prediction methods was of great significance to the development of neural networks in the 1990s.

deep learning

It was not until 2006 that Hinton, known as the "Godfather of Artificial Intelligence," formally proposed the concept of deep learning, believing that existing models could be optimized through a combination of unsupervised learning and supervised learning. The proposal of this point of view has caused great repercussions in the field of artificial intelligence, and many scholars from famous universities such as Stanford University have begun to study deep learning. 2006 is known as the “first year of deep learning”.

In 2009, deep learning was applied to the field of speech recognition.

In 2012, the deep learning model AlexNet won the first place in the ImageNet image recognition competition, and deep learning began to be regarded as synonymous with neural networks.

Also in the same year, the deep neural network developed by Professor Andrew Ng, an authoritative scholar in the field of artificial intelligence, reduced the error rate of image recognition from 26% to 15%. This was a major progress in the field of artificial intelligence in image recognition.

In 2014, DeepFace, a deep learning project developed by Facebook, achieved an accuracy of more than 97% in identifying faces.

In 2016, AlphaGo based on deep learning defeated South Korea's top chess player Lee Sedol in the Go game, causing a sensation around the world. This event not only recognized deep learning, but also made artificial intelligence well known to the public.

In 2017, deep learning began to be applied in various fields, such as urban security, medical imaging, financial risk control, classroom teaching, etc.

Until the recent phenomenal product ChatGPT, it has penetrated into our lives unknowingly.

Classic models of deep learning

Deep learning is a science based on computer neural network theory and machine learning theory. It uses multi-processing layers built on complex network structures, combined with non-linear transformation methods, to abstract complex data models and can well identify images, Sound and text.

Two classic models of deep learning: CNN and RNN

CNN convolutional neural network

The full name of CNN is convolutional neural network, which is convolutional neural network.

The convolutional neural network is divided into the following levels: input layer, convolution layer, pooling layer, and fully connected layer.

Insert image description here

Schematic diagram of the working process of convolutional neural network

input layer

Perform simple processing on images, such as reducing image dimensions to facilitate image recognition

convolution layer

The neurons in the convolutional layer extract features in various dimensions of the image.

The feature extraction is performed on parts of the image: if you need to identify the puppy in the image, the neurons are only responsible for processing the dog's ears, eyes, etc.

The convolutional layer extracts features of different scales from the image, which greatly enriches the dimensions of the acquired features and helps improve the accuracy of the final recognition.

Pooling

Compress and reduce the dimensionality of images to reduce the amount of data that needs to be processed for image recognition.

Fully connected

Connect and combine all the image features extracted previously: for example, combine the extracted local features of the puppy's head, body, legs, etc. to form a complete feature vector containing the puppy, and then identify the category.

Three characteristics of convolutional neural networks

1. Each neuron only needs to focus on a small part of the image instead of the entire image, which reduces the difficulty of recognition.

2. The neurons in the convolutional layer can be applied to different image recognition tasks (for example, neurons trained to recognize puppies can also continue to recognize other similar objects)

3. The image feature dimension is reduced but the main features of the image are retained, reducing the amount of data.

Therefore, convolutional neural networks are particularly suitable for image recognition.

Insert image description here

Schematic diagram of the convolutional neural network image recognition process
RNN recurrent neural network

The full name of RNN is recurrent neural network, which is also a recurrent neural network. Research on recurrent neural networks first appeared in the late 1980s and was proposed by several neural network experts. This model is often used for the recognition and understanding of time series signals (such as speech).

Loop means repetition. When the recurrent neural network model is running, it will perform cyclic and repeated operations on the same sequence. A sequence is an object arranged in a row. The elements in the sequence are interdependent and the order is very important, such as time series data, dialogue, etc. Once the order is out of order, the meaning and function will be greatly changed. The recurrent neural network solves the problem that the convolutional neural network cannot identify continuous events well (a continuous sentence, a language story, "Xiao Ming buys a lot of apples every time he goes to the supermarket because he likes to eat ()". The answer to this question is It is easy to guess "Apple", but it is difficult for artificial neural networks and convolutional neural networks to connect to the context and give answers. Recurrent neural networks are to make up for their shortcomings) and play an irreplaceable role in the field of deep learning. role.

The reason why the recurrent neural network can recognize continuous events is that it not only uses the current input data as the network input, but also the previously sensed data as input. According to the length of memory, starting from the first layer, the activation is passed to the next layer, and so on, and finally the output result is obtained.

A recurrent neural network consists of three parts: input layer, hidden layer and output layer.

The loop occurs in the hidden layer.

A specific prediction function is usually set in the hidden layer. When we input a continuous event to the recurrent neural network model, this function in the hidden layer will perform an operation, and the result of this operation can be used as input to enter the hidden layer again. Operation. In this way, a continuous loop of predictions is formed. This prediction is not only related to the newly input data, but also depends on the input of each loop.

Insert image description here

Schematic diagram of the principle of recurrent neural network
GAN Generative Adversarial Network

The full name of GAN is generative adversarial networks, that is, generative adversarial networks. It was proposed by Ian Goodfellow and others in 2014. Since then, various fancy variants, such as CycleGAN, StyleGAN, etc., have emerged in endlessly. The pictures and videos generated in scenarios such as face changing, clothes changing, etc. are enough to look fake and real. In 2020, the expression transfer model implemented by PaddleGAN can generate a singing video from a photo, making various funny videos such as "Ant Yeah Hey" popular all over the Internet.

Generative adversarial network is a model based on unsupervised learning method, that is, it learns through two neural networks playing games with each other. One of the two neural networks is a generative network and the other is a discriminative network.

The generation network takes random samples from the latent space as input. The generation network receives noise vectors, and then converts this noise vector into virtual data. Its output needs to imitate the real samples in the training set as much as possible, and then sends the virtual data to the discriminant network for classification.

The input of the discriminant network is the real sample and the output of the generative network, and its job is to distinguish the output of the generative network from the real sample. The two networks compete with each other and continuously adjust parameters, and finally the output results of the generated network are the same as those of real samples.

Insert image description here

GAN network architecture diagram
noise vector

What is noise?

To put it simply, as shown in the picture below, we want to identify "water" in the picture, then "desert" is noise.

Insert image description here

Another example is that there are cats and dogs in a picture. If we want to identify cats, then the dogs in the picture are noise.

In deep learning, noise is often added to the input data during training to improve the robustness and generalization ability of the model. This is called data augmentation. By adding noise to the input data, the model is forced to learn features that are robust to small changes in the input, which can help it perform better on new, unseen data.

Robustness, also known as robustness, refers to the characteristics of a control system that maintains certain other properties under certain (structure, size) parameter perturbations.

The input of the GAN generator Generator is random noise, and the purpose is to generate different pictures each time. But if it is completely random, it is not known what characteristics the generated image has, and the results will be uncontrollable, so noise is usually generated from a prior random distribution. Commonly used random distributions:

  • Gaussian distribution: The most widely used probability distribution for continuous variables;
  • Uniform distribution: A simple distribution of a continuous variable x.

Introducing random noise makes the generated pictures diverse. For example, different noise z in the picture below can produce different numbers:

Insert image description here

So the role of noise is to ensure that the generated pictures are different but within the appropriate range (not completely random). That is to ensure that the results are different, controllable and reliable.

Popular principles of generative adversarial networks

In layman's terms, the working principle of GAN is similar to this scenario:

  • A boy tries to take photographer-quality photos, while a girl tries to find flaws in the photos.

  • In this process, boys first take some photos, and then the girls tell the difference between the photos taken by boys and the photos taken by photographers.

  • The boys will then improve their shooting techniques and methods based on the feedback and take some new photos, and the girls will continue to make suggestions for modifications to these new photos.

  • Until an equilibrium state is reached - girls can no longer tell the difference between photos taken by boys and photos taken by photographers.

In this way, GAN can learn the characteristics of a large amount of unlabeled data from multiple dimensions. In the previous model training process, the model only started learning after the annotator labeled the input data;

By utilizing the mutual confrontation between the generative network and the discriminative network, GAN can spontaneously learn the rules of the input data, ensuring that the generated results are close to the real samples in the training set, thereby achieving learning from unlabeled data.

In fact, GAN is the same as all generative models. The goal is to fit the distribution of training data. For image generation tasks, it is to fit the pixel probability distribution of training set images.

Insert image description here

GAN model realizes style transfer of pictures

Transformer: from sequence to sequence seq2seq

Transformer means "converter". This is also the core of Transformer, which is what it can achieve - from sequence to sequence. But from sequence to sequence, it is not simply jumping from one word to another. It has to go through many "processes" in order to achieve the desired effect.

Sequence (translated from English sequence) refers to a series of data with continuous relationships such as text data, voice data, and video data. Different from picture data, there is often no relationship between different pictures. Data such as text, voice and video have continuous relationships. The content of these data at this moment is often related to the content of previous moments, and will also affect the content of subsequent moments.

Insert image description here

Sequence to sequence problem example

Sequence-to-sequence models generally consist of an encoder and a decoder. The workflow can be simply described as encoding the input sequence on the encoder side to generate an intermediate semantic encoding vector, and then decoding this intermediate vector on the decoder side to obtain the target output sequence.

Taking the Chinese-English translation scenario as an example, the corresponding input to the encoder side is a Chinese sequence, and the corresponding output to the decoder side is the translated English sequence.

Insert image description here

Encoding and decoding mechanism structure diagram

In actual application, the input and output data of the sequence-to-sequence model can be in different forms of data, and the corresponding model structures used on the encoder side and the decoder side can be different.

The sequence-to-sequence model seems perfect, but it still encounters some problems during actual use. For example, in a translation scenario, if the sentence is too long, the problem of gradient disappearance will occur. Since the fixed-length vector output by the last hidden layer is used when decoding, the words closer to the end will be "memorized" more deeply, while the words far away from the end will be gradually diluted, and the final model output result will be therefore Unsatisfactory. Faced with these problems, researchers have also proposed corresponding solutions, such as adding attention mechanisms.

Transformer: attention mechanism

The traditional encoder-decoder architecture has limitations on sequence length. The essential reason is that it cannot reflect the degree of attention paid to different words in a sentence sequence. In different natural language processing tasks, different parts of a sentence have different meanings and importance, such as the sentence "I like this book because it tells a lot about growing flowers": If this When doing sentiment analysis on a sentence, we should obviously pay more attention to the word "like" during training; and if we classify based on the content of the book, we should pay more attention to the word "grow flowers". This involves the attention mechanism we are going to talk about next, which actually draws on human attention thinking: human beings start from intuition and can use limited attention to quickly obtain the most valuable information from a large amount of information.

The attention mechanism calculates the correlation between each vector in the output result of the encoder and each vector in the output result of the decoder, obtains a number of correlation scores, and then performs normalization processing to convert them into correlation weights, using To characterize the correlation between the elements of the input sequence and the output sequence. During the training process of the attention mechanism, this weight vector is continuously adjusted and optimized. The ultimate goal is to help the decoder have a reasonable correlation weight reference for each element in the input sequence when generating results.

The self-attention mechanism is a variant of the attention mechanism. It relies less on external information and is better at capturing the internal correlation of data or features. For example, this sentence in English: "He thought it was light before he lifted the backpack." (Before he lifted the backpack, he thought it was light.) Does "light" here mean "light" or "light"? ? This requires us to understand it in context. After seeing "backpack", we should know that "light" here most likely refers to "light". The self-attention mechanism calculates the correlation between each word and all other words. In this sentence, when the word "light" is translated, the word "backpack" has a higher correlation weight.

Transformer model

The Transformer model is an upgrade based on the ordinary encoder-decoder structure. Its encoding end is composed of multiple encoders connected in series, and the decoding end is also composed of multiple decoders.

It also optimizes input encoding and self-attention, such as using a multi-head attention mechanism, introducing a position encoding mechanism, etc., which can identify more complex language situations and handle more complex tasks.

Insert image description here

Transformer network structure diagram

Insert image description here

Transformer codec internal structure diagram

Multi-head attention. To put it simply, the attention of different tags to each other is achieved through multiple attention heads, and multiple attention heads calculate attention weights based on the correlation between tags).

For example, in a sentence, one attention head focuses on the relationship between the previous word and the next word, while the other attention head focuses on the relationship between the verb in the sentence and its corresponding object.

In actual operation, the calculations of these attention heads are performed simultaneously, so that the overall response speed will be accelerated. After the calculation of these attention heads is completed, they will be spliced ​​together, processed and output by the final feedforward neural network layer.

"The monkey ate the banana quickly and it looks hungry." (The monkey ate the banana quickly and it looks hungry.) What does "it" refer to in this sentence? Is it "banana" or "monkey"?

In a multi-head attention mechanism, one of the encoders may focus more on "monkey" when encoding the word "it", while the results of the other encoder may think that the correlation between "it" and "banana" is more Strong, in this case, the final output result of the model may be biased. At this time, the multi-head attention mechanism comes into play. More other encoders notice "hungry". Through the weighted combination of multiple encoding results, the final appearance of the word "hungry" will lead to a gap between "it" and "monkey" Producing greater correlation will eliminate the deviation in semantic understanding to the greatest extent.

The positional encoding mechanism is also unique to Transformer. When inputting, the purpose of adding positional encoding is that when calculating, not only do you need to know which word the attention is focused on, but you also need to know the relative positional relationship between words. For example: "She bought a book and a pen." (She bought a book and a pen.) What do the two "a"s in this sentence modify? Is it "book" or "pen"? Does it mean "one" or "one"?

If you only use the self-attention mechanism, you may ignore the relationship between the two "a"s and the nouns that follow them, and only focus on the correlation between "a" and other words. The introduction of positional encoding can solve this problem well. By adding positional encoding information, each word is appended with a vector representing its position in the sequence. In this way, when calculating correlation, the model can not only consider the semantic correlation between words, but also the positional correlation between words, so that it can more accurately understand the object that each word in the sentence refers to or modifies. .

The multi-head attention mechanism focuses on semantic correlation, and the position encoding mechanism focuses on position correlation.

By introducing multi-head attention mechanism, position coding and other methods, Transformer has the ability to understand semantics to the maximum extent and output corresponding answers, which also lays the foundation for the emergence of subsequent large-scale pre-training models such as GPT models.

GPT series models

When a general neural network is trained, the parameters in the network are first randomly initialized, and then the algorithm is used to continuously optimize the model parameters.

GPT is a typical "pre-training + fine-tuning" two-stage model.

The training method of GPT is that the model parameters are no longer randomly initialized, but a large amount of common data is used for "pre-training" to obtain a set of model parameters;

Then use this set of parameters to initialize the model, and then use a small amount of data in a specific field for training. This process is called "fine-tuning."

Pre-training is a type of transfer learning. Pre-trained language models have brought natural language processing to a new stage - through big data pre-training and small data fine-tuning, the solution of natural language processing tasks no longer needs to rely on a large amount of manual parameter adjustment.

The model structure of the GPT series adheres to the idea of ​​continuously stacking Transformers, using Transformers as feature extractors, using a large training corpus, a large number of model parameters, and powerful computing resources for training, and by continuously improving the scale and size of the training corpus. Quality, improve the number of parameters of the network, and complete iterative updates. The updated iteration of the GPT model also proves that by continuously improving the model

The training of ChatGPT is divided into three steps.

The first step is to generate a fine-tuned model through manual annotation. The annotation team first prepares a certain number of prompt word samples, part of which is prepared by the annotation team itself, and the other part comes from OpenAI's existing data accumulation.

Then, they annotated these samples. In fact, they manually output corresponding responses to these prompt words, thus forming a data set such as "prompt word-reply pairs". Finally, these data sets are used to fine-tune GPT-3.5 to obtain a fine-tuned model.

Insert image description here

ChatGPT model training steps

The second step is to train a reward model that can evaluate response satisfaction. Also prepare a prompt word sample set and let the model obtained in the first step respond to it. For each prompt word, the model is asked to output multiple responses. What the annotation team needs to do is to sort the responses to each prompt word, which implies human expectations for the model's output effect, thereby forming a new annotation data set, which is ultimately used to train the reward model. Through this reward model, the model's responses can be scored, which also provides an evaluation standard for the model's responses.

The third step is to use the reward model trained in the second step to optimize the reply strategy through a reinforcement learning algorithm. What is used here is a policy optimization model that continuously adjusts the current policy based on the actions being taken and rewards received. Specifically, first prepare a prompt word sample set, reply to the prompt words in it, and then use the reward model trained in the second step to score the reply, and adjust the reply strategy based on the scoring results. In this process, humans are no longer involved, but the strategy is updated using “AI training AI”. Eventually, after repeating this process several times, a strategy with better response quality will be obtained.

Diffusion model

Another major contributor to the rapid development of the AIGC field is of course the advancement of AI painting technology. In particular, DALL·E2, a powerful AI painting tool released by OpenAI in April 2022. With this tool, you can generate a completely new image by simply entering a short text.

The core technology behind it is the Diffusion model.

Disadvantages of Generative Adversarial Network Models

Before the emergence of the Diffusion model, the image generation model based on the GAN (Generative Adversarial Network) model has been the mainstream of research, but GAN has some known flaws. It may not be able to learn a complete probability distribution. For example, if a GAN is trained with images of various animals, it may only generate images of dogs; in addition, there are some technical problems such as difficulty in training that hinder its widespread use.

Advantages of Diffusion Model

The Diffusion model uses the latest training technology to transcend the GAN model tuning stage. It can be directly used to perform tasks in specific fields and can achieve shocking generation effects. This also makes the research in the field of Diffusion models flourish. .

The essence and principle of the Diffusion model

Diffusion is translated as "diffusion" in Chinese. Diffusion is a physical phenomenon that refers to a transport phenomenon based on the thermal motion of molecules. It is the process of molecules transferring from a high concentration area to a low concentration area through Brownian motion. It is a process that tends to a thermal equilibrium state and is also an entropy-driven process.

for example,

A drop of ink spreads throughout the container of water. Trying to calculate the distribution of ink molecules within a small volume of a container during this diffusion process is very difficult because the distribution is complex and difficult to sample. However, the ink will eventually diffuse completely into the water, at which point this uniform and simple molecular probability distribution can be described directly in mathematical expressions.

Statistical thermodynamics can describe the probability distribution at each moment in the diffusion process, and each moment is reversible. As long as the step distance is small enough, the simple distribution can be returned to the complex distribution.

The Diffusion model, also known as the diffusion model, was first proposed in the paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" in 2015. The authors developed a new generative model inspired by statistical thermodynamics. The idea is actually very simple: first, continuously add noise to the image in the training data set, so that it eventually becomes a blurred image. This process is similar to adding a drop of ink to water, the ink spreads, and the water turns light blue, and then teaches The model learns how to reverse this process and convert noise into images.

The algorithm implementation of the diffusion model is divided into two processes: forward diffusion process and reverse diffusion process. The forward diffusion process can be described as gradually applying Gaussian noise to an image until the image becomes completely unrecognizable.

For example, through the forward diffusion process, the scenery in the picture becomes blurred until the entire picture becomes a mosaic. This process seems to be full of randomness, but in fact it has a specific meaning. The entire process can be expressed as a Markov chain of the forward diffusion process - a stochastic process that describes the transition from one state to another. The probability distribution of each state in this random process can only be determined by its previous state and has nothing to do with other states. Correspondingly, we can define each picture in the entire forward diffusion process as a state. Then what each picture looks like is only related to its previous picture and follows a certain probability distribution. So we first get a well-defined forward process.

Insert image description here

Diffusion process of diffusion model

So how to apply this process to restore the mosaic image to the original image? The problem is that it is very difficult to derive a clear reverse process from the forward process. This can also be imagined based on the actual situation. A very blurry image with random noise added many times is almost impossible to completely restore to the original image.

Therefore, the diffusion model adopts an approximate method, that is, through neural network learning to approximately calculate the probability distribution of the reverse diffusion process.

After applying this method, even an image that becomes completely blurred after adding noise many times can be restored to an image close to the original appearance, and with the iterative learning of the model, the final result will be more consistent. Require.

Through the two processes of forward diffusion and reverse diffusion, the diffusion model can generate a brand new image based on an original image. This greatly reduces the difficulty of data processing during model training, which is equivalent to using a new mathematical paradigm to define the "generation" process from another perspective. Compared with the GAN model, the diffusion model only needs to train the "generator", the training objective function is simple, and there is no need to train other networks, which greatly enhances the ease of use.

The diffusion model did not receive much attention when it was first proposed.

On the one hand, this is because the GAN model was very popular at that time, and the research focus of researchers still focused on optimization based on GAN;

On the other hand, the result generated by the initial diffusion model is not very ideal, and since the diffusion process is a Markov chain, its disadvantage is that it requires more diffusion steps to obtain better results, which leads to the sample The build is slow.

As the author of the aforementioned paper recalled, "At the time, this model was not surprising."

As everyone knows, more modern image generation technology has quietly sprouted. This new generation model has burst out with unexpected vitality and has truly entered the stage of history. Generative image applications have also entered the modern era of "text to image" .

Stable Diffusion

Stable Diffusion is a text generation image model developed by Stability AI. It has simple interaction and fast generation speed. It greatly lowers the threshold of use while maintaining surprising generation effects, thus setting off another wave of AI painting. Creative boom.

Insert image description here

The working principle of Stable Diffusion is to convert semantics into a language that can be processed by the computer through a text encoder, that is, encoding the text into a mathematical representation that the computer can understand, and then converting these encoded results into a language that meets the semantic requirements through an image generator. image.

Let’s look at the text encoder part first. The computer itself cannot understand human language and needs to use a text encoding technology, namely the CLIP model. The CLIP model is a multi-modal model in the field of deep learning open sourced by OpenAI. CLIP stands for contrastive language-image pre-training, which is a large-scale image and text pre-training model based on contrastive learning. The CLIP model not only has the function of semantic understanding, but also the function of combining text information and image information and coupling it through the attention mechanism. How is the CLIP model trained in Stable Diffusion and plays a role in text-to-image conversion?

To train a CLIP model that can process human language and convert it into computer vision language, you must first have a data set that combines human language and computer vision. In fact, the CLIP model is trained on 400 million images and their corresponding text descriptions collected from the Internet.

The CLIP model consists of an image encoder and a text encoder. The training process of the CLIP model is shown in the figure below. First, a picture and a piece of text are randomly extracted from the accumulated data set. Here, the text and picture do not necessarily match. The extracted pictures and text will be encoded into two vectors through the image encoder and text encoder. The task of the CLIP model is to ensure that the image and text match, and to train on this basis, and finally obtain the optimal parameters of the two encoders.

Insert image description here

CLIP training images and related description examples

Insert image description here

CLIP model training process

For example, a picture of a dog and the text "a dog", the trained CLIP model will use the image encoder and the text encoder to generate similar encoding results to ensure that the text and the picture match. There is also a basis for mutual transformation between the two. At the same time, through the CLIP model, human language and computer vision have a unified mathematical representation, which is the secret of text-generated images. It can be said that the CLIP model plays the most central role in the text encoder part of Stable Diffusion.

The image generator part consists of two stages, one is the image information generation stage, and the other is the image decoding stage.

In the image information generation stage, the diffusion model first uses a random number generation function to generate a random noise, and then combines it with the encoding information generated by the text encoder part using the CLIP model to generate a semantic encoding information containing noise.

This semantic encoding information then generates lower-dimensional image information, which is the so-called latent space information (information of latent space), which represents the existence of latent variables in this image. This is why Stable Diffusion is superior to previous diffusion models in terms of processing speed and resource utilization.

General diffusion models directly generate images at this stage, so more information is generated and processing is more difficult. However, Stable Diffusion first generates latent variables, so less information needs to be processed and the load is smaller.

Technically speaking, how does Stable Diffusion do it? In fact, it is completed by a deep learning segmentation network (Unet) and a scheduling algorithm. The scheduling algorithm controls the progress of the generation, and Unet specifically executes the generation process step by step. In this process, the entire Unet generation iteration process is repeated 50 to 100 times, and the quality of the hidden variables also becomes better during this iterative process.

After the image information is generated, the image decoding stage is reached. The image decoding process actually takes over the hidden variables of the image information, increases its dimensionality, and restores it to a complete picture. The image decoding process is also the final process by which we can actually obtain a picture. Since the diffusion process is iteratively denoising step by step, each step injects semantic information into the latent variables and repeats until the denoising is completed. Through Unet's generation iteration during the image decoding process, the picture becomes what we want step by step.

Insert image description here

To sum it up:

Stable Diffusion first uses the CLIP model to semantically understand the input prompt words and convert them into coding information that is close to image coding. From the perspective of subsequent modules, a piece of text has become a picture with similar semantics; then in the image generator In the module, a complete diffusion, denoising, and image generation process is completed to generate a picture that meets the prompt word requirements. Finally, through the joint action of the text encoder and the image generator, the seemingly magical thing of "words" turning into "paintings" and "texts turning into pictures" happens.

AGI

In recent years, artificial intelligence solutions have made incredible progress in key areas such as natural language processing, visual recognition, and text, image and video generation. Now, artificial intelligence is trying to make a huge leap in matching human intelligence, from "weak artificial intelligence" that can only adapt to specific fields, to artificial intelligence that is more versatile and can be said to be more powerful - AGI (artificial general intelligence) , general artificial intelligence) moving forward. AGI will undoubtedly become the next rapidly developing direction.

AGI can also be called "strong artificial intelligence" (strong AI), which refers to artificial intelligence that has the same intelligence as humans or exceeds human intelligence, and can exhibit all intelligent behaviors of normal humans. In comparison, all artificial intelligence we have now and in the past still belongs to "weak artificial intelligence" or "narrow artificial intelligence". Although the ability to solve a specific problem can be very strong, even surpassing that of humans, it is difficult to solve other problems. question. For example, we teach machines to recognize faces, but this ability, as well as the process and basic methods of acquiring this ability, do little to help it control body balance and navigation.

In the future, if artificial intelligence wants to reach the level of AGI, it will need to have more powerful capabilities, such as: the ability to reason and make decisions when there are uncertain factors; the ability to represent knowledge, including common sense knowledge; planning, The ability to learn and communicate using natural language; the ability to integrate the above abilities to achieve established goals.

AGI will be the next important leap in the field of artificial intelligence research.

Future research directions for AGI

First, cross-modal perception.

We call each information source field that we usually come into contact with as a modality. These sources can be text, sound, image, taste, touch, etc.

However, most current artificial intelligence systems can only use one of them as a sensor to perceive the world. For different modes, different proprietary models need to be designed.

The inability to truly connect various models is a major pain point in moving towards AGI. Therefore, it is critical to study how to enable artificial intelligence systems to achieve cross-modal perception.

Second, multi-task collaboration.

Humans are able to handle multiple tasks simultaneously and coordinate and switch between different tasks. When people face robots, they give simple instructions, such as "please heat up my lunch", "please bring me the remote control", etc. These instructions sound simple, but when executed, they include understanding the instructions and decomposing the tasks. , planning routes, identifying objects, and a series of actions. There is a special system or model design for each subdivided action. This requires robots to have the ability to collaborate on multiple tasks. Therefore, multi-task collaboration is one of the most important research directions of AGI. Let "versatility" be reflected in not only being able to complete multiple tasks at the same time, but also being able to quickly adapt to new tasks that are different from its training situation.

Third, self-learning and adaptation.

Human beings have the ability to learn and adapt, and can improve their capabilities through continuous learning and experience accumulation. Therefore, studying how to make artificial intelligence systems have self-learning and adaptive capabilities is also a necessary step to achieve AGI.

It mainly includes three directions: incremental learning, transfer learning and domain adaptation.

Incremental learningJust like people continue to learn and receive new knowledge every day, and will not forget the knowledge they have learned, incremental learning means that a learning system can continuously It can effectively learn new knowledge from new samples and retain most of the previously learned knowledge. It solves the problem of "catastrophic forgetting" in deep learning.

Transfer learning is a very common human ability. For example, we may find that learning to identify apples may help identify pears, or learning to play the electronic keyboard may help Yu learns to play the piano. In machine learning, we can take a model developed for task A as a starting point and reuse it in developing a model for task B, i.e., improve the learning of a new task by transferring knowledge from related tasks that have been learned. The core of transfer learning is to find and rationally utilize the similarities between the source domain and the target domain.

Domain adaptation can be regarded as a type of transfer learning, which aims to use well-labeled data in the source domain to learn an accurate model and apply it to the target domain with no labels or only a small amount of labels. The core problem it wants to solve is the mismatch between the joint probability distributions of the source domain and target domain data.

Fourth, emotional understanding.

Being able to understand and express emotions is the most important characteristic of human beings. It often affects the next step of events in communication and collaboration. Allowing artificial intelligence systems to understand emotions, including emotional expression, emotion analysis, and emotion generation, is a key direction for realizing AGI.

Fifth, super computing power.

Implementing AGI requires huge computing resources and supercomputing power.

Advantages and bottlenecks of AIGC

Advantage

AIGC applications represented by ChatGPT have been deeply involved in the transformation of enterprise business processes, integrating and integrating the automatic generation of text, pictures, videos, codes and other content with the original enterprise management system to streamline and optimize The originally complex business process has greatly improved the organization's business operation efficiency.

The impact of AIGC on business processes is undoubtedly positive. Whether AIGC is used for content writing, intelligent customer service, schedule management, or in business areas such as marketing, sales, finance, and human resources, it can streamline or optimize the business to varying degrees. process, shorten the business process cycle, improve business process efficiency, and ultimately play a role in reducing costs and increasing efficiency for enterprises and organizations.

bottleneck

1. The content is not accurate, not credible, the text is directly pieced together, and there is no logic, etc.

2. The operation of AIGC requires huge computing power support. The development of technology is also intensifying the demand for computing power, which will inevitably generate huge costs and even require exploration of changes in computing methods.

3. The development and application of AIGC may also cause unemployment and replace some blue-collar and white-collar jobs, thus causing widespread anxiety and panic in society. Issues such as data security and privacy protection, and copyright disputes are also bottlenecks for the further development of AIGC.

4. AIGC’s generation capabilities come from data and models, and the data source itself can cause hidden concerns. The development of AIGC also needs to face and solve data issues (security of data transmission, security of data protection, etc.).

Guess you like

Origin blog.csdn.net/yinweimumu/article/details/134866469