[ChatGPT] Behind the development of artificial intelligence: the ups and downs of the past century


foreword

Today, large-scale language pre-training neural network models such as ChatGPT have become well-known names. The algorithm core behind GPT - the artificial neural network algorithm, has experienced ups and downs for 80 years. During these 80 years, except for a few During several outbreak moments, most of the time, this theory was in a state of silence, no one cared about it, and even a state of "poison" in funding.

The birth of the artificial neural network came from the golden combination of the unruly genius Peters and the accomplished neurophysiology expert McCulloch at the time. However, their theories surpassed the technical level of their time, so they failed to gain widespread attention and empirical verification.

Fortunately, in the first two decades since its birth, researchers have continued to add bricks and tiles, and the field of artificial neural networks has evolved from the simplest neuron mathematical model and learning algorithm to a perceptron model with learning ability. However, Questions from other researchers came together with Rosenblatt, one of the founders of the "perceptron", who died during the voyage. After that, the field fell into a cold winter for more than 20 years, until the backpropagation algorithm was Introduced into the training process of artificial neural network.

After that, after 20 years of silence, the research on artificial neural networks was finally restarted. In the past 20 years of accumulation, convolutional neural networks and recurrent neural networks appeared in turn.

However, the rapid development of this field in academia and industry still has to wait until 17 years ago, the breakthrough in hardware - the emergence of general-purpose computing GPU chips, so today, with large-scale language pre-training neural network models such as ChatGPT , became a well-known name.

In a certain sense, the success of the artificial neural network is a kind of luck, because not all research can wait until the core key breakthrough, until everything is ready. In more fields, technological breakthroughs appear too early or too late, only to die slowly. However, in this luck, what cannot be ignored is the firmness and persistence of the researchers involved in it. Relying on the idealism of these researchers, the artificial neural network has gone through its ups and downs for 80 years, and finally achieved Positive fruit.

1. McCulloch-Peters neurons

In 1941, Warren Sturgis McCulloch moved to the University of Chicago School of Medicine as a professor of neurophysiology. Shortly after moving to Chicago, a friend introduced him to Walter Pitts. Peters and McCulloch, who are studying for a Ph.D. at the University of Chicago, have a common interest in neuroscience and logic, so the two hit it off and became like-minded friends and partners in scientific research. Peters was eager to learn by nature. At the age of 12, he finished reading "Principles of Mathematics" by Russell and Whitehead in the library, and wrote to Russell, pointing out several errors in the book. Russell appreciated the young reader's letter and wrote back inviting him to study at Cambridge University (even though Peters was only 12 years old). However, Peters' family has a low level of education and cannot understand Peters' thirst for knowledge, and instead often speak ill of each other. The relationship between Peters and his original family gradually deteriorated, and he ran away from home when he was 15 years old. Since then, Peters has become a homeless man on the University of Chicago campus, choosing to sit in on his favorite college classes during the day and sleeping in random classrooms at night. When Peters met McCulloch, although he was already a doctoral student registered at the school, he still had no fixed residence. After McCulloch learned of this situation, he invited Peters to live in his home.

When the two met, McCulloch had published many papers on the nervous system and was a well-known expert in the field. Although Peters is still a doctoral student, he has already made achievements in the field of mathematical logic and has been appreciated by experts in fields including von Neumann. Although the professional fields of the two are very different, they are both deeply interested in the working principle of the human brain and firmly believe that mathematical models can describe and simulate the function of the brain. Driven by this common belief, the two co-published several papers. They built the first artificial neural network model. Their work laid the foundations for the modern field of artificial intelligence and machine learning, and they are both recognized as pioneers in the fields of neuroscience and artificial intelligence.

In 1943, McCulloch and Pitts proposed the earliest artificial neural network model: the McCulloch-Pitts Neuron model [1]. The model aims to simulate the working principle of neurons with the mechanism of "on" and "off" of binary switches. The main components of the model are: an input node that receives a signal, an intermediate node that processes the input signal by a preset threshold, and an output node that generates an output signal. In the paper, McCulloch and Peters demonstrated that this simplified model can be used to implement basic logic (such as "and", "or", "not") operations. In addition, the model can also be used to solve simple problems such as pattern recognition and image processing.

2. Hebbian learning

In 1949, Canadian psychologist Donald Hebb (Donald Hebb) published a book entitled "The Organization of Behavior", and put forward the famous Hebbian Learning theory in the book [ 2]. The theory holds that "co-activated neurons are often connected to each other (Cells that fire together, wire together)", that is, neurons have synaptic plasticity (synaptic plasticity, synapses are interconnected between neurons for information transmission) key parts), and believes that synaptic plasticity is the basis of brain learning and memory functions.

A key step in machine learning theory is how to update the model using different update rules. When using a neural network model for machine learning, it is necessary to set the structure and parameters of the initial model. During model training, each input data from the training data set will cause the model to update various parameters. This process requires the use of an update algorithm. Hebbian learning theory provides an initial update algorithm for machine learning: Δw = η x xpre x xpost. Δw is the variation of the parameters of the synaptic model, η is the learning rate, xpre is the activity value of the pre-synaptic neuron, and xpost is the activity value of the post-synaptic neuron.

The Hebb update algorithm provides a theoretical basis for using artificial neural networks to imitate the behavior of brain neural networks. The Hebbian learning model is an unsupervised learning model - the model achieves the purpose of learning by adjusting the degree of connection between the input data it perceives. Because of this, the Hebbian learning model is particularly good at clustering subcategories in the input data. As the research on neural networks gradually deepened, the Hebbian learning model was later found to be applicable to many other subdivisions such as reinforcement learning.

3. Perceptron

In 1957, American psychologist Frank Rosenblatt (Frank Rosenblatt) proposed the Perceptron model for the first time, and used the Perceptron update algorithm for the first time [3]. The perceptron update algorithm extends the foundation of the Hebb update algorithm by utilizing an iterative, trial-and-error process for model training. During model training, the perceptron model calculates the difference between the output value of the data predicted by the model and the output value of the data actually measured for each new data, and then uses the difference to update the coefficients in the model. The specific equation is as follows: Δw = η x (t - y) xx. After proposing the initial perceptron model, Rosenblatt continued to explore and develop related theories of perceptrons. In 1959, Rosenblatt successfully developed a neural computer Mark1 that used a perceptron model to recognize English letters.

Similar to the McCulloch-Peters neuron, the perceptron model is also a biological model based on neurons. Its basic operating mechanism is to receive input signals, process input signals, and generate output signals. The difference between the perceptron model and the McCulloch-Peters neuron model is that the output signal of the latter can only be 0 or 1 - 1 above a preset threshold, and zero otherwise - while the perceptron model uses linear activation function, so that the output value of the model can be a continuously changing value like the input signal. In addition, the perceptron sets a coefficient for each input signal, which can affect the degree to which each input signal affects the output signal. Finally, the perceptron is a learning algorithm, because the coefficients of each input signal can be adjusted according to the data it sees; and the McCulloch-Peters neuron model has no coefficients, so its behavior cannot be dynamically updated according to data feedback .
insert image description here

In 1962, Rosenblatt compiled years of research on the perceptron model into the book "Principles of Neurodynamics: Perceptrons and the theory of brain mechanisms". The perceptron model is a major advance in the field of artificial intelligence because it is the first algorithmic model with learning capabilities, which can autonomously learn the laws and characteristics of the received data. Moreover, it has the ability to classify patterns, which can automatically classify data into different categories according to their characteristics. In addition, the perceptron model is relatively simple and requires less computing resources.

Despite its advantages and potential, the perceptron is a relatively simplified model with many limitations. In 1969, computer scientist Marvin Minsky and Seymour Papert co-published the book "Perceptron" [5]. In the book, the two authors made an in-depth critique of the perceptron model, and analyzed the limitations of the single-layer neural network represented by the perceptron, including but not limited to the implementation of "exclusive OR" logic and the problem of linear inseparability. However, both the authors and Rosenblatt realized that multilayer neural networks could solve problems that these single-layer neural networks could not. It is a pity that the negative evaluation of the perceptron model in the book "Perceptron" has a huge impact, making the public and government agencies lose interest in the study of perceptrons. In 1971, Rosenblatt, the proposer and number one supporter of the perceptron theory, unfortunately died on a voyage to the sea at the age of 43. Under the double blow of the book "Perceptron" and Rosenblatt's death, the number of papers related to perceptrons decreased rapidly year by year. The development of artificial neural networks has entered a "winter".

4. Back propagation algorithm

Multi-layer neural network can solve problems that cannot be solved by single-layer neural network, but it brings new problems: updating the weight of neurons in each layer of multi-layer neural network model involves a large number of precise calculations, and ordinary calculation methods are time-consuming The laboriousness makes the learning process of the neural network very slow and the practicability is poor.

In order to solve this problem, Paul Werbos, an American sociologist and machine learning engineer, proposed backpropagation in his doctoral dissertation "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences" at Harvard University in 1974. Algorithm (backpropagation) [6]. The basic idea of ​​the algorithm is to adjust the weight of each neuron in the neural network by backpropagating the error between the predicted output value and the actual output value from the output layer. The essence of this algorithm is to realize the training of a neural network composed of multi-layer perceptrons from the output layer to the input layer reversely (along the negative gradient direction) according to the chain rule commonly used in calculus.

It is regrettable that Weber's paper did not receive enough attention for a long time after its publication. Until 1985, psychologist David Rumelhart of the University of California, San Diego, cognitive psychologist and computer scientist Geoffrey Hinton, and computer scientist Luo Ronald Williams co-authored a paper on the application of the backpropagation algorithm to neural networks [7]. This paper has received great repercussions in the field of artificial intelligence. The ideas of Rumelhart and others are essentially similar to those of Weber, but Rumelhart and the others did not cite Weber's paper, which is often criticized recently.

insert image description here

The backpropagation algorithm plays a key role in the development of artificial neural networks and enables the training of deep learning models. Since the backpropagation algorithm was revived in the 1980s, it has been widely used to train a variety of neural networks. In addition to the original multi-layer perceptron neural network, the backpropagation algorithm is also applicable to convolutional neural networks, recurrent neural networks, etc. Due to the important position of the backpropagation algorithm, Weber and Rumelhart are considered to be one of the pioneers in the field of neural networks.

In fact, the backpropagation algorithm was an important achievement of the "Renaissance" era (during the 1980s and 1990s) in artificial intelligence. Parallel Distributed Processing (Parallel Distributed Processing) is the main methodology during this time. This methodology focuses on multi-layer neural networks, and advocates the acceleration of the training process and application of neural networks through parallel processing calculations. This is contrary to the previous mainstream thinking in the field of artificial intelligence, so it has epoch-making significance. In addition, the methodology has been welcomed by scholars in fields other than computer science, including psychology, cognitive science, and neuroscience. Therefore, this period of history is often considered by later generations to be a renaissance in the field of artificial intelligence.

5. Convolutional neural network

If McCulloch-Peters neurons are used as a symbol of the birth of artificial intelligence, then the United States can be said to be the birthplace of artificial neural networks. In the thirty years since the birth of the artificial neural network, the United States has been playing a leading role in the field of artificial intelligence, giving birth to key technologies such as perceptrons and backpropagation algorithms. However, in the first "winter" of artificial intelligence, people from all walks of life in the United States, including the government and academia, lost confidence in the potential of artificial neural networks, and greatly slowed down their support and investment in neural network technology iterations. Because of this, in this "winter" sweeping the United States, the research on artificial neural networks in other countries has come under the spotlight of historical development. It is against this background that convolutional neural networks and recurrent neural networks appear.

insert image description here

Convolutional neural network is a multi-layer neural network model that includes a variety of unique structures such as convolutional layers, pooling layers, and fully connected layers. The model uses the convolutional layer to extract the local features of the input signal, then reduces the dimension and complexity of the data through the pooling layer, and finally converts the data into a one-dimensional feature vector through the fully connected layer and generates an output signal (usually a prediction or classification results). The unique structure of convolutional neural networks makes it particularly advantageous when dealing with data (images, time series, etc.) that have grid structure properties.
The earliest convolutional neural network was proposed by Japanese computer scientist Kunihiko Fukushima in 1980 [8]. The model proposed by Fukushima includes a convolutional layer and a downsampling layer, which is still in use in today's mainstream convolutional neural network structure. The only difference between Fukushima's model and today's convolutional neural networks is that the former does not use the backpropagation algorithm-as mentioned above, the backpropagation algorithm did not receive attention until 1986. Since Fukushima's convolutional neural network model did not have the help of this algorithm, the model had the same problems of long training time and computational complexity as other multi-layer neural networks at the time.

In 1989, Yann LeCun, a French computer scientist working at Bell Laboratories in the United States, and his team proposed a convolutional neural network model named LeNet-5, and used the backpropagation algorithm for training in this model. [9]. Likun proved that the neural network can be used to recognize handwritten numbers and characters. This marked the beginning of widespread application of convolutional neural networks in image recognition.

6. Recurrent neural network

Like convolutional neural networks, recurrent neural networks are also a class of neural networks with unique structural features. The main structural feature of this type of neural network is that there is a recursive relationship between each level, rather than a sequential relationship. Due to the above special structural features, recurrent neural networks are particularly suitable for processing natural language and other text-based data.

In 1990, American cognitive scientist and psycholinguist Jeffrey Elman proposed the Elman network model (also known as simplified recurrent network) [10]. The Elman network model was the first recurrent neural network. Elman used this model to prove that the recurrent neural network can maintain the sequential nature of the data itself during training, which laid the foundation for the future application of this type of model in the field of natural language processing.

There is a phenomenon of gradient disappearance in recurrent neural networks. When using the backpropagation algorithm to train the neural network, the weight update gradient of the layer close to the input gradually becomes close to zero, making these weights change very slowly, resulting in poor training effect. In order to solve this problem, in 1997, German computer scientist Sepp Hochreiter and his doctoral supervisor Jürgen Schmidhuber proposed the long short-term memory network [11]. This model is a special recurrent neural network model. It introduces the memory node, which makes the model have better long-term memory retention ability, thus resolving the phenomenon of gradient disappearance. This model is still one of the most commonly used recurrent neural network models.

7. General Computing GPU Chip

In 2006, NVIDIA Corporation (NVIDIA) launched the first general-purpose computing GPU (Graphics Processing Unit) chip and named it CUDA (Compute Unified Device Architecture). Prior to this, the GPU was originally a chip processor dedicated to graphics rendering and computing, and was often used in computer graphics-related applications (such as image processing, real-time computing and rendering of game scenes, video playback and processing, etc.). CUDA allows general-purpose parallel computing, so that tasks that could only call the CPU (Central Processing Unit) can be completed by the GPU. The powerful parallel computing capability of GPU enables it to perform multiple computing tasks at the same time, and its computing speed is faster than that of CPU, which is suitable for matrix operations. The training of neural networks often requires large-scale matrix and tensor operations. Before the emergence of general-purpose GPUs, the development of artificial neural networks has long been limited by the limited computing power of traditional CPUs. This limitation includes the innovation of theoretical research and the commercialization and industrialization of existing models. The emergence of GPU has greatly weakened the constraints of these two aspects.

In 2010, Dan Ciresan, a postdoctoral researcher in Schmidhuber's team, used GPUs to achieve a significant acceleration of convolutional neural network training [12]. But the GPU really became famous in the field of artificial neural networks in 2012. That year, Canadian computer scientists Alex Krizhevsky, Ilya Sutskever, and the aforementioned Geoffrey Hinton proposed the Alex Alex network model (AlexNet) [13]. The Alex network model is essentially a class of convolutional network models. Krizewski and others used GPU when training the model, and used the model to participate in an internationally renowned image classification and labeling competition (ImageNet ILSVRC). Surprisingly, the model finally won the championship with a big advantage. The success of the Alex network model has greatly stimulated the interest and attention of all walks of life in the application of artificial neural networks in the field of computer vision.

8. Generative Neural Networks and Large Language Models

Recurrent neural networks can continuously generate sequences of text verbatim, and are therefore often considered an early generative neural network model. However, although the recurrent neural network is good at processing and generating natural language data, it has been unable to effectively capture global information for long sequence data (for information that is far away, it cannot be effectively connected).

Transformer model (Transformer) [14]. This large neural network is divided into two main parts, the encoder and the decoder. The encoder encodes the input sequence, and further processes the encoded information through the self-attention layer. After that, the information is passed to the decoder, and the output sequence is generated through the network structure such as the self-attention layer in the decoder part. The important innovation of this model is the self-attention layer (self-attention). The self-attention layer enables the neural network model to get rid of the limitation of sequentially processing text, but directly grabs information at different positions in the text and captures the dependencies between information everywhere, and parallelizes the semantic relationship between different positions. on the correlation. The birth of the transformer model has had a huge impact on the field of natural language processing and even the entire field of artificial intelligence. In just a few years, the transformer model has been widely used in various artificial intelligence large models.
insert image description here

Among the large-scale language models based on the transformer structure, the chat robot ChatGPT launched by OpenAI is the most famous. The language model on which ChatGPT is based is GPT-3.5 (Generative Pre-Training Transformer Model-3.5). OpenAI used a large amount of corpus data when training the model, so that it finally has a wide range of language understanding and generation capabilities, including providing information, communicating, text creation, completing software code writing, and easily competent for various tasks involving language understanding. related exams.

Summarize

A few weeks ago, I went to a volunteer event where middle school students had lunch with researchers. During the event, I chatted with several middle school students aged fifteen or sixteen. Naturally, we talked about ChatGPT. I asked them: "Do you use ChatGPT? You can tell me the truth, I won't tell your teacher." One of the boys smiled shyly and said that he can't do without ChatGPT now.

Eighty years ago, Peters, who was wandering around, could only imagine the mathematical model that could simulate the function of the brain. In the world of young people today, neural networks are no longer just illusory mathematical formulas, but have become ubiquitous. What will happen in the next 80 years? Will artificial neural networks generate consciousness like human neural networks? Will carbon-based brains continue to dominate silicon-based brains? Or will it be dominated by silicon-based brains?

Guess you like

Origin blog.csdn.net/liaozp88/article/details/130789856