Four pictures of the development history of Daoqing AI large model (1943-2023)

Four pictures of the development history of Daoqing AI large model (1943-2023)

The most popular thing now is GPT, which is large-scale language model (LLM). "LLM" is the abbreviation of "Large Language Model", which is usually used to refer to natural language processing models with huge parameters and complex architectures, such as GPT-3 (Generative Pre-trained Transformer 3). Model. These models perform well in processing text and language tasks, but their large parameter volume and computational requirements make them known as large models. Of course, there are also some models that automatically generate images, but their influence is not as great as GPT.

Definition of large model

The concept of large models is relative to the previous small models. These two generated pictures illustrate this point very well. In the past, the small model was a small island, an ecosystem composed of some animals and plants. However, the occupied area (computing power), species (parameters), and ecosystem structure (model structure) are relatively limited. The large model can be seen as an island group. It is not just a simple splicing of small islands, but interconnected to form a large whole, or even a large country.

Insert image description here
From a professional perspective, large models usually refer to deep neural network models with a large number of parameters and complex structures. As computing power increases and large-scale data sets increase, researchers are beginning to build neural networks with hundreds of millions or even hundreds of billions of parameters. These models are often called large models. Compared with the previous small model, the large model has significant improvements in the number of parameters and complexity.

Here are some of the main differences between the big model and the previous small model:

1. Number of parameters: Large models have more parameters than previous small models. This means that large models can better capture complex patterns and features in the data, resulting in better performance on a variety of tasks.

2. Complexity: Large models usually have deeper network structures, including more hidden layers. This depth can help the model learn higher-level abstract features, thereby improving its representation capabilities.

3. Generalization ability: A large model may perform better on training data, but its generalization ability may be relatively poor. Large models can easily overfit, i.e. perform well on training data but perform poorly on new data.

4. Computing resources: Training and deploying large models requires more computing resources, including more computing time and memory. This makes large models potentially more expensive in practical applications.

5. Data volume: Large models usually require more data for training to prevent overfitting. This may require more annotation work and data collection.

6. Domain applicability: Large models may perform well in certain specific fields, especially tasks that require complex feature representation, such as natural language processing and computer vision. However, for some simple tasks, large models can be too complex, resulting in a waste of resources.

7. Interpretability: The complexity of a large model may make its internal decision-making process difficult to explain. This may raise privacy and transparency issues in some applications.

8. New algorithms and technologies: The development of large models has promoted the research of new algorithms and technologies, including regularization methods, optimization techniques and parameter initialization strategies to reduce overfitting and improve training efficiency.

In general, large models may perform very well on specific tasks, but they also face some challenges and limitations, such as computational resource requirements, generalization capabilities, etc. When choosing between using a large model or a small model, you need to weigh different factors and make a decision based on your specific problem and needs.

Macroscopic perspective on the development history of artificial models

Judging from the entire model development history, the development of this technology has not been smooth sailing. During the period after its birth, there were periods of rapid development and periods of 'darkness'. I simply drew a process to help everyone understand this period. The entire development period can be divided into six stages,
1. Initial development period: 1943-1960s
2. Reflection development period: 1970s
3. Application development period: 1980s
4. Stable development period: 1990s —2010-5
Prosperous Development Period: 2011-2018
6LLM Prosperous Development Period: 2018-2023
Of course, there are still some disputes about this historical division, but it does not affect the understanding of the entire process. The core idea of ​​​​this idea is described based on the model system based on neural networks. The development of models actually changes with the continuous innovation of technology. The model itself is even directly linked to the computing power of the computer. Since the initial introduction of neural networks, many scholars have actually been interested in this new topic. However, due to limitations of computing power, there was a period of stagnation for more than ten years. Thanks to the emergence of Moore's Theorem, CPUs with strong computing capabilities can be used to build complex computing models. But that time was still in the initial model stage of machine learning. It was not until the emergence of GPU that the model flourished and LLM emerged.
Insert image description here

A microscopic perspective on the development history of artificial models

From a micro perspective, it is the result of one study after another. The figure below summarizes the historical model development stages. In the form of words, I list the key development nodes from beginning to end.
Insert image description here

1 Initial development period: 1943-1960s

After the concept of artificial intelligence was proposed, symbolism and connectionism (neural network) were developed, and a number of eye-catching research results were successively achieved, such as machine theorem proof, checkers program, human-machine dialogue, etc., which set off the development of artificial intelligence. The first climax.

In 1943, American neuroscientist Warren McCulloch and logician Water Pitts proposed a mathematical model of neurons, which is one of the cornerstones of the modern artificial intelligence discipline.

In 1950, Alan Mathison Turing proposed the "Turing test" (testing whether a machine can show intelligence that is indistinguishable from humans), and the idea of ​​​​making machines produce intelligence began to enter people's vision. .

In 1950, Claude Shannon proposed computer games.

In 1956, the term artificial intelligence (AI) was officially used at the Summer Symposium on Artificial Intelligence at Dartmouth College. This is the first artificial intelligence research in human history, marking the birth of the artificial intelligence discipline.

In 1957, Frank Rosenblatt simulated and implemented a neural network model he invented called the "Perceptron" on an IBM-704 computer.

In 1958, David Cox proposed logistic regression.
LR is a linear classification and discrimination model similar to the perceptron structure. The main difference is that the activation function f of the neuron is sigmoid, and the goal of the model is (maximum likelihood) to maximize the probability of correct classification.

In 1959, Arthur Samuel gave machine learning a clear concept: Field of study that gives computers the ability to learn without being explicitly programmed. (Machine learning is the study of how to give computers the ability to learn without being explicitly programmed).

In 1961, Leonard Merrick Uhr and Charles M Vossler published a pattern recognition paper titled A Pattern Recognition Program That Generates, Evaluates and Adjusts its Own Operators, which described a pattern recognition program designed using machine learning or self-organizing processes. attempt.

In 1965, IJ Good published an article on the threat that artificial intelligence may pose to mankind in the future, which can be regarded as the pioneer of the "AI threat theory". He believes that the super intelligence of machines and the inevitable explosion of intelligence will eventually be beyond human control. Later, the terrifying predictions about artificial intelligence made by famous scientist Hawking, inventor Musk and others echoed Goode’s warning half a world ago.

In 1966, MIT scientist Joseph Weizenbaum published an article titled "ELIZA-a computer program for the study of natural language communication between man and machine" on ACM, describing how ELIZA's program enables humans and computers to communicate to a certain extent. It is possible to conduct natural language conversations. The implementation technology of ELIZA is to decompose the input through keyword matching rules, and then generate replies based on the reorganization rules corresponding to the decomposition rules.

In 1967, Thomas et al. proposed the K nearest neighbor algorithm (The nearest neighbor algorithm). The core idea of ​​KNN is that given a training data set, for a new input instance Xu, find the K instances closest to the instance in the training data set, and use the category to which the largest number of these K instances belong is used as the new instance Xu category.

In 1968, Edward Feigenbaum proposed the first expert system DENDRAL and gave a preliminary definition of a knowledge base, which also gave birth to the second wave of artificial intelligence. The system has very rich chemical knowledge and can help chemists infer molecular structures based on mass spectrometry data. Expert Systems are an important branch of AI. Together with natural language understanding and robotics, they are listed as the three major research directions of AI. It is defined as using computer models of human expert reasoning to deal with complex problems in the real world that require expert explanations, and to draw the same conclusions as the experts. It can be regarded as a "knowledge base" and an "inference engine". machine)” combination.

In 1969, Marvin Minsky, a representative of "symbolism", raised the problem of linear inseparability of XOR in his book "Perceptron": a single-layer perceptron cannot divide the original XOR data. To solve this problem, higher-dimensional nonlinearity needs to be introduced. Linear network (MLP, requires at least two layers), but there is no effective training algorithm for multi-layer networks. These arguments dealt a heavy blow to neural network research, and neural network research entered a 10-year low period.

2 Reflection and development period: 1970s

The breakthrough progress in the early stages of the development of artificial intelligence has greatly raised people's expectations for artificial intelligence, and people have begun to try more challenging tasks. However, the lack of computing power and theory has made unrealistic goals unrealistic, and the development of artificial intelligence has entered a new stage. Lows.

In 1974, in his doctoral thesis of Harvard University, Paul Werbos first proposed training artificial neural networks through error backpropagation (BP), but it did not attract much attention during that period. The basic idea of ​​the BP algorithm is not to use the error itself to adjust the weight (like the perceptron), but to use the derivative (gradient) of the error to adjust. The process of backpropagating the gradient of the error and updating the model weights to reduce the learning error, fit the learning target, and achieve the 'universal approximation function of the network'.

In 1975, Marvin Minsky proposed a knowledge representation learning framework theory for artificial intelligence in his paper "A Framework for Representing Knowledge".

In 1976, Randall Davis built and maintained large-scale knowledge bases and proposed that the use of integrated object-oriented models can improve the integrity of knowledge base (KB) development, maintenance and use.

In 1976, Edward H. Shortliffe of Stanford University and others completed the first medical expert system MYCIN for the diagnosis, treatment and consultation services of blood infectious diseases.

In 1976, Ph.D. Lennart of Stanford University published a paper "Artificial Intelligence Methods Discovered in Mathematics - Heuristic Search", describing a program called "AM" to develop new concepts in mathematics under the guidance of a large number of heuristic rules. , ultimately rediscovering hundreds of common concepts and theorems.

In 1977, the logic-based machine learning system of Hayes. Roth and others made great progress, but it could only learn a single concept and failed to be put into practical application.

In 1979, a computer program created by Hans Berliner defeated the backgammon world champion and became a landmark event. (Subsequently, behavior-based robotics developed rapidly under the promotion of Rodney Brooks, Sutton and others, and became an important development branch of artificial intelligence. The self-learning backgammon program created by Gerry Tesoro and others It also laid the foundation for the subsequent development of reinforcement learning.)

3 Application development period: 1980s

Artificial intelligence has entered a new climax in application development. The expert system simulates the knowledge and experience of human experts to solve problems in specific fields, achieving a major breakthrough in artificial intelligence from theoretical research to practical application, and from the discussion of general reasoning strategies to the application of specialized knowledge. Machine learning (especially neural networks) explores different learning strategies and various learning methods, and has begun to slowly recover in a large number of practical applications.

In 1980, the first international symposium on machine learning was held at Carnegie Mellon University (CMU) in the United States, marking the rise of machine learning research around the world.

In 1980, Drew McDermott and Jon Doyle proposed non-monotonic logic and later robotic systems.

In 1980, Carnegie Mellon University developed an expert system called XCON for DEC Corporation, which saved the company $40 million per year and achieved great success.

In 1981, RP Paul published the first robotics textbook, "Robot Manipulator: Mathematics, Programmings and Control", marking the maturity of the robotics discipline.

In 1982, David Marr published his masterpiece "Visual Computing Theory", proposing the concept of computer vision and constructing a systematic vision theory, which also had a profound impact on cognitive science (Cognitive Science).

In 1982, John Hopfield invented the Hopfield network, which was the prototype of the earliest RNN. The Hopfield neural network model is a single-layer feedback neural network (the neural network structure can be mainly divided into feedforward neural network, feedback neural network and graph network), with feedback connections from output to input. Its emergence has inspired the field of neural networks and has been widely used in machine learning, associative memory, pattern recognition, optimized computing, parallel implementation of VLSI and optical devices in artificial intelligence.

In 1983, Terrence Sejnowski, Hinton and others invented Boltzmann Machines, also known as stochastic Hopfield networks. It is essentially an unsupervised model used to reconstruct input data to extract data. Features for predictive analysis.

In 1985, Judea Pearl proposed the Bayesian network. He is famous for advocating the probabilistic approach to artificial intelligence and developing Bayesian networks. He is also famous for developing a causal and counterfactual model based on structural models. Celebrated for his theory of reasoning. Bayesian network is an uncertainty processing model that simulates the causal relationship in the human reasoning process. For example, the common naive Bayes classification algorithm is the most basic application of Bayesian network.
The Bayesian network topology is a directed acyclic graph (DAG). The random variables involved in a certain research system are drawn in a directed graph according to whether they are conditionally independent to describe the conditions between the random variables. Dependence, using circles to represent random variables and arrows to represent conditional dependencies forms a Bayesian network.
For any random variable, its joint probability can be obtained by multiplying its respective local conditional probability distributions.

In 1986, Rodney Brooks published a paper "Robust Hierarchical Control System for Mobile Robots", marking the creation of the behavior-based robotics discipline, and the robotics community began to turn its attention to practical engineering topics.

In 1986, Geoffrey Hinton and others proposed the concept of combining multi-layer perceptron (MLP) and backpropagation (BP) training (this method still had many challenges in terms of computing power at the time, basically Related to the gradient algorithm of chain derivation), this also solves the problem that single-layer perceptrons cannot perform nonlinear classification, and opens a new round of climax of neural networks.

In 1986, Ross Quinlan proposed the ID3 decision tree algorithm.

The decision tree model can be regarded as a combination of multiple rules (if, then). It is completely different from the neural network black box model in that it has good model interpretability.
The core idea of ​​the ID3 algorithm is to build a decision tree through a top-down greedy strategy: select features for division based on information gain (the meaning of information gain is the degree to which the uncertainty of data D is reduced after introducing the information of attribute A. That is The greater the information gain, the stronger the ability to distinguish D), and the decision tree is constructed recursively.

In 1989, George Cybenko proved the "universal approximation theorem". Simply put, a multi-layer feedforward network can approximate any function, and its expressive power is equivalent to a Turing machine. This fundamentally eliminates Minsky's doubts about the expressiveness of neural networks.
The "universal approximation theorem" can be regarded as the basic theory of neural networks: if a feedforward neural network has a linear layer and at least one layer of activation functions with "squeezing" properties (such as sigmoid, etc.), the given network is sufficient With a sufficient number of hidden units, it can approximate any borel measurable function from one finite-dimensional space to another with arbitrary accuracy.

In 1989, LeCun (the father of CNN) invented the Convolutional Neural Network (CNN) by combining the backpropagation algorithm and the weight-sharing convolutional neural layer, and successfully applied the convolutional neural network to the U.S. Post Office for the first time. in handwritten character recognition systems.
Convolutional neural networks usually consist of an input layer, a convolution layer, a pooling layer and a fully connected layer. The convolutional layer is responsible for extracting local features in the image, the pooling layer is used to significantly reduce the parameter magnitude (dimensionality reduction), and the fully connected layer is similar to the traditional neural network part and is used to output the desired results.

4 Stable development period: 1990s-2010

Due to the rapid development of Internet technology, innovative research on artificial intelligence has been accelerated, and artificial intelligence technology has been further put into practical use. All fields related to artificial intelligence have made great progress. In the early 2000s, because expert system projects required encoding too many explicit rules, which reduced efficiency and increased costs, the focus of artificial intelligence research shifted from knowledge-based systems to machine learning.

In 1995, Cortes and Vapnik proposed the classic support vector machine (Support Vector Machine) of connectionism. It shows many unique advantages in solving small sample, nonlinear and high-dimensional pattern recognition, and can be extended to function fitting, etc. among other machine learning problems.
Support Vector Machine (SVM) can be regarded as an improvement on the perceptron. It is a generalized linear classifier based on the VC dimension theory of statistical learning theory and the principle of minimum structural risk. The main difference from the perceptron is: 1. The goal of the perceptron is to find a hyperplane to separate the samples as accurately as possible (there are countless). The goal of the SVM is to find a hyperplane that not only separates the samples as correctly as possible, but also makes each sample The sample is the farthest away from the hyperplane (there is only one maximum margin hyperplane), and the generalization ability of SVM is stronger. 2. For linearly inseparable problems, unlike perceptrons that add nonlinear hidden layers, SVM uses kernel functions to essentially implement nonlinear transformation of the feature space so that it can be linearly classified.

In 1995, Freund and Schapire proposed the AdaBoost (Adaptive Boosting) algorithm. AdaBoost uses the Boosting integrated learning method - serially combining weak learners to achieve better generalization performance. Another important integration method is bagging parallel combination represented by random forest. Based on the "deviation-variance decomposition" analysis, the Boosting method mainly optimizes the deviation, and the Bagging method mainly optimizes the variance.
The basic idea of ​​the Adaboost iterative algorithm is to train different classifiers in series by adjusting the weight of each training sample in each round (misclassified samples have a higher weight). Finally, the accuracy of each classifier is used as the weight of their combination, and they are weighted together to form a strong classifier.

In 1997, the International Business Machines Corporation (IBM) Deep Blue supercomputer defeated the world chess champion Kasparov. Deep Blue realizes intelligence in the field of chess based on brute force exhaustion. It generates all possible moves, then performs the deepest possible search, and continuously evaluates the situation to try to find the best move.

In 1997, Sepp Hochreiter and Jürgen Schmidhuber proposed the long short-term memory neural network (LSTM).
LSTM is a complex structure of recurrent neural network (RNN). The structure introduces forgetting gate, input gate and output gate: the input gate determines how much input data of the network at the current moment needs to be saved to the unit state, and the forgetting gate determines the previous moment. How much of the unit state needs to be retained to the current moment, and the output gate controls how much of the current unit state needs to be output to the current output value. Such a structural design can solve the problem of gradient disappearance during long sequence training.

In 1998, Tim Berners-Lee of the World Wide Web Consortium proposed the concept of Semantic Web. Its core idea is to make the entire Internet a universal information exchange medium based on semantic links by adding semantics (Meta data) that can be understood by computers to documents on the World Wide Web (such as HTML). In other words, it is to build an intelligent network that can realize barrier-free communication between people and computers.

In 2001, John Lafferty first proposed the conditional random field model (Conditional random field, CRF).
CRF is a discriminant probability graphical model based on the Bayesian theoretical framework. Given the conditional random field P ( Y | It performs particularly well in many natural language processing tasks such as word segmentation and named entity recognition.

In 2001, Dr. Breiman proposed Random Forest.
Random forest is an integrated learning method that combines multiple differentiated weak learners (decision trees) Bagging in parallel to combine decisions by establishing multiple well-fitting and differentiated models to optimize generalization performance. Diverse differences can reduce dependence on certain feature noises, reduce variance (overfitting), and combined decision-making can eliminate some biases between learners.
The basic idea of ​​the random forest algorithm is to construct a training set for each weak learner (decision tree) with replacement sampling, and randomly extract a subset of its available features, that is, train N with the diversity of training samples and feature spaces. Different weak learners are finally combined with the predictions (categories or regression prediction values) of N weak learners, and the maximum number of categories or average is taken as the final result.

In 2003, David Blei, Andrew Ng and Michael I. Jordan proposed LDA (Latent Dirichlet Allocation) in 2003.
LDA is an unsupervised method that is used to infer the topic distribution of documents. The topic of each document in the document set is given in the form of a probability distribution. Topic clustering or text classification can be performed based on the topic distribution.

In 2003, Google published three foundational papers on big data, providing ideas for core issues of big data storage and distributed processing: unstructured file distributed storage (GFS), distributed computing (MapReduce) and structured data Storage (BigTable) and laid the theoretical foundation for modern big data technology.

In 2005, Boston Dynamics launched a power-balanced four-legged robot dog that has strong versatility and can adapt to more complex terrains.

In 2006, Jeffrey Hinton and his student Ruslan Salakhtinov formally proposed the concept of deep learning (Deeping Learning), starting a wave of deep learning in academia and industry. 2006 is also known as the first year of deep learning, and Jeffrey Hinton is also known as the father of deep learning.

The concept of deep learning originates from the research of artificial neural networks. Its essence is to use multiple hidden layer network structures to learn high-order representations of the intrinsic information of the data through a large number of vector calculations.

In 2010, Sinno Jialin Pan and Qiang Yang published the article "A Survey on Transfer Learning".
Generally speaking, transfer learning is to use existing knowledge (such as trained network weights) to learn new knowledge to adapt to specific target tasks. The core is to find the similarity between existing knowledge and new knowledge.

5. Booming development period: 2011 to present

With the development of information technologies such as big data, cloud computing, the Internet, and the Internet of Things, computing platforms such as ubiquitous sensing data and graphics processors have promoted the rapid development of artificial intelligence technology represented by deep neural networks, greatly leaping the gap between science and application. Artificial intelligence technologies such as image classification, speech recognition, knowledge question and answer, human-machine games, and driverless driving have achieved major technological breakthroughs and ushered in a new climax of explosive growth.

In 2011, the IBM Watson question-and-answer robot participated in the Jeopardy quiz answering competition and eventually won the championship. Watson is a computer question and answer (Q&A) system that integrates natural language processing, knowledge representation, automatic reasoning, machine learning and other technologies.

In 2012, the AlexNet neural network model designed by Hinton and his student Alex Krizhevsky won the ImageNet competition. This was the first time in history that a model performed so well on the ImageNet data set, and ignited the research enthusiasm for neural networks.
AlexNet is a classic CNN model that has been greatly improved in terms of data, algorithm and computing power. It innovatively applies methods such as Data Augmentation, ReLU, Dropout and LRN, and uses GPU to accelerate network training.

In 2012, Google officially released the Google Knowledge Graph (Google Knowledge Graph), which is a knowledge base of Google compiled from multiple information sources. It uses Knowledge Graph to superimpose a layer of relationships on ordinary string searches to assist While users can find the information they need faster, they can also go one step closer with knowledge-based searches to improve the quality of Google searches. Knowledge graph is a structured semantic knowledge base and a representative method of symbolism. It is used to describe concepts and their relationships in the physical world in symbolic form. Its general component unit is RDF triple (entity-relationship-entity), and entities are connected to each other through relationships to form a networked knowledge structure.

In 2013, Durk Kingma and Max Welling proposed the Variational Auto-Encoder (VAE) in the article "Auto-Encoding Variational Bayes" at ICLR.
The basic idea of ​​VAE is to transform real samples into an ideal data distribution through the encoder network, and then pass the data distribution to the decoder network to construct generated samples. The process of model training and learning is to make the generated samples close enough to the real samples.

In 2013, Google's Tomas Mikolov proposed the classic Word2Vec model in "Efficient Estimation of Word Representation in Vector Space" to learn distributed representation of words. Because of its simplicity and efficiency, it attracted great attention from industry and academia.
The basic idea of ​​Word2Vec is to learn the relationship between each word and adjacent words, thereby representing the words into low-dimensional dense vectors. The semantic information of words can be learned through such distributed representation. Intuitively, words with similar semantics are close to each other.
The Word2Vec network structure is a shallow neural network (input layer - linear fully connected hidden layer -> output layer). According to the training learning method, it can be divided into CBOW model (taking a word as input to predict its neighboring words) or Skip- gram model (uses neighboring words of a word as input to predict this word).

In 2014, the chat program "Eugene Goostman" "passed" the Turing Test for the first time at the "2014 Turing Test" conference held by the Royal Society.

In 2014, Goodfellow, Bengio and others proposed the Generative Adversarial Network (GAN), which is known as the coolest neural network in recent years.
GAN is designed based on the idea of ​​reinforcement learning (RL) and consists of two parts: a generation network (Generator, G) and a discriminator network (Discriminator, D). The generation network forms a mapping function G: Z→X (input noise z, output generation (fake data x), the discriminant network determines whether the input comes from real data or data generated by the generative network. In the game process of such training, the generation ability and discriminative ability of the two models are improved.

In 2015, to commemorate the 60th anniversary of the concept of artificial intelligence, the three giants of deep learning LeCun, Bengio and Hinton (who jointly won the Turing Award in 2018) launched a joint review of deep learning "Deep learning".
The article "Deep Learning" points out that deep learning is a feature learning method that transforms original data into higher-level and abstract expressions through some simple but non-linear models, which can strengthen the distinguishing ability of input data. With enough combinations of transformations, very complex functions can be learned.

In 2015, the residual network (ResNet) proposed by Kaiming He and others from Microsoft Research won the victory in image classification and object recognition in the ImageNet large-scale visual recognition competition.
The main contribution of the residual network is the discovery of "Degradation" caused by non-identical transformation of the network, and the introduction of "Shortcut connection" for the degradation phenomenon, which alleviates the problems caused by increasing depth in deep neural networks. The vanishing gradient problem.

In 2015, Google open sourced the TensorFlow framework. It is a symbolic mathematics system based on dataflow programming and is widely used in the programming implementation of various machine learning algorithms. Its predecessor is DistBelief, Google's neural network algorithm library.

In 2015, Musk and others co-founded OpenAI. It is a non-profit research organization whose mission is to ensure that artificial general intelligence (a system that is highly autonomous and surpasses humans in most economically valuable tasks) will bring benefits to all mankind. It releases popular products such as: OpenAI Gym, GPT, etc.

In 2016, Google proposed a federated learning method, which trains algorithms on multiple distributed edge devices or servers that hold local data samples without exchanging their data samples.
The three most important technologies for protecting privacy in federated learning are: Differential Privacy, Homomorphic Encryption and Private Set Intersection, which enable multiple participants to operate without sharing data. Establish a common and powerful machine learning model to solve key issues such as data privacy, data security, data access permissions and access to heterogeneous data.

In 2016, AlphaGo competed in a Go human-machine battle with Lee Sedol, the world champion of Go and professional nine-dan player, and won with a total score of 4 to 1.
AlphaGo is a Go artificial intelligence program. Its main working principle is "deep learning" and consists of the following four main parts: Policy Network (Policy Network), given the current situation, predicts and samples the next move; fast move (Fast rollout) has the same goal as the policy network, but at the appropriate sacrifice of the quality of moves, the speed is 1,000 times faster than the policy network; the value network (Value Network) estimates the winning rate of the current situation; Monte Carlo tree search (Monte Carlo Tree Search estimates the winning rate of each move.
AlphaGo Zero, updated in 2017, combined reinforcement learning for self-training based on previous versions. It has no idea about the rules of the game before playing chess and games. It only makes its own decisions by gaining insight into the chess game and the rules of the game through its own experiments and explorations. As the self-game increases, the neural network gradually adjusts to improve the winning rate of betting. What's even more amazing is that with the deepening of training, AlphaGo Zero independently discovered the rules of the game and came up with new strategies, bringing new insights to the ancient game of Go.

In 2017, Sophia, a humanoid robot developed by Hanson Robotics in Hong Kong, became the first robot in history to obtain citizenship. Sophia looks just like a human female, has rubber skin and is able to express more than 62 natural facial expressions. Algorithms in its "brain" can understand language, recognize faces, and interact with people.

6LLM prosperity and development period: 2018-2023

The development of the entire LLM can be summarized in the following figure:
Insert image description here
This figure nicely illustrates the development of large-scale language models in recent years and highlights some of the
most well-known models. Models on the same branch have closer relationships. Transformer-based models are shown in non-grey, decoder-only models in the blue branch, encoder-only models in the pink branch, and encoder-decoder model branches in green. The vertical position of models on the timeline indicates their release date. Open source models are represented as filled squares, while closed source models are represented as open squares. The stacked bar chart in the lower right corner shows the number of models from different companies and institutions.

These models differ in their training strategies, model architectures, and application areas. To understand the LLM landscape more clearly, we divide them into two categories: encoder-only or encoder-decoder language models and decoder-only language models. In Figure 1, we show the detailed evolution process of the language model. From this evolutionary tree, we get some interesting observations:

a) Only decoder models gradually dominate the development of LLM. In the early stages of LLM development, decoder-only models were less popular than encoder-only and encoder-decoder models. However, after 2021, decoder-only models experienced significant development with the introduction of game-changing LLMs such as GPT-3. At the same time, after the initial explosive growth brought by BERT, encoder-only models gradually began to decline.

b) OpenAI has always maintained a leadership position in the LLM field, both currently and in the future. Other companies and institutions are working to catch up with OpenAI in developing models that are comparable to GPT-3 and the current GPT-4. This leadership can be attributed to OpenAI's strong commitment to the technology's path, even if it is not widely recognized initially.

c) Meta has made important contributions to open source LLM and promoted LLM research. When considering contributions to the open source community, especially those related to LLMs, Meta stands out as one of the most generous commercial companies, as all LLMs developed by Meta are open source.

d) LLM shows a trend towards closed source. In the early stages of LLM development (before 2020), most models were open source. However, with the introduction of GPT-3, companies are increasingly moving towards closed models such as PaLM, LaMDA and GPT-4. Therefore, it becomes more difficult for academic researchers to conduct experiments on LLM training. Therefore, API-based research may become the dominant approach in academia.

e) Encoder-decoder models still hold promise as this architecture is still actively explored and most of them are open source. Google has made significant contributions to the open source encoder-decoder architecture. However, the flexibility and versatility of the decoder model seems to make Google's persistence in this direction less promising.

BERT-Style Language Models: Encoder-Decoder or Encoder-Only Encoder-
Decoder unsupervised learning. A common approach is to predict masked words in a sentence taking into account the surrounding context. This training paradigm is called a masked language model. This training method enables the model to develop a deeper understanding of the relationships between words and their contextual usage. These models are trained on large text corpora using techniques such as the Transformer architecture and achieve state-of-the-art results in many natural language processing tasks such as sentiment analysis and named entity recognition. Notable examples of masked language models include BERT, RoBERTa, and T5. Due to its success in a wide range of tasks, MLM has become an important tool in the field of natural language processing.

GPT-style language models: decoders only
Although language models are typically architecturally task-agnostic, these approaches require fine-tuning on datasets for specific downstream tasks. The researchers found that extending the language model can significantly improve performance with few or even zero samples. To achieve better few-shot and zero-shot performance, the most successful models are autoregressive language models, which are trained by generating the next word in a sequence given the previous words. These models have been widely used in downstream tasks such as text generation and question answering. Examples of autoregressive language models include GPT-3, OPT, PaLM, and BLOOM. The first introduction of GPT-3 was a game changer, demonstrating the superiority of autoregressive language models by demonstrating for the first time reasonable few-shot/zero-shot performance via hints and contextual learning. There are also models optimized for specific tasks, such as CodeX for code generation, and BloombergGPT for the financial sector. A recent breakthrough is ChatGPT, which improves GPT-3 specifically for conversational tasks, enabling more interactive, coherent, and context-aware conversations in a variety of practical applications.

Reference
Paper
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
Paper link
https://arxiv.org/pdf/2304.13712.pdf
Reference blog:
https://blog.csdn.net/CSDN_LYY/article/details/ 116924246

Guess you like

Origin blog.csdn.net/weixin_47567401/article/details/132310317