What is the essential difference between artificial intelligence, machine learning, natural language processing, deep learning, etc.?

The meanings assigned to these terms by popular media are often at odds with the understanding of machine learning scientists and engineers. Therefore, when we use these terms, it is important to give precise definitions, and the Venn diagram of the relationship is shown in Figure 1.2.

Figure 1.2 Venn diagram of the relationship between terms such as natural language processing, artificial intelligence, machine learning, and deep learning

1 Artificial intelligence

Artificial intelligence emerged in the mid-20th century as a field of study devoted to enabling computers to simulate and perform tasks normally performed by humans. Initial approaches focused on manually deriving and hardcoding explicit rules for manipulating input data in various contexts of interest. This paradigm is often referred to as symbolist AI . It works well for well-defined problems like chess, but when it comes to problems from the perceptual class, like vision and speech recognition, it goes wrong dramatically. We need a new paradigm where computers can learn new rules from data, rather than humans explicitly specifying rules. This led to the rise of machine learning.

2 Machine Learning

In the 1990s, the machine learning paradigm became dominant in artificial intelligence. Now, instead of explicitly encoding every possible situation, the computer trains the model with corresponding input and output sample data, and automatically extracts the mapping relationship between input and output. While machine learning involves a great deal of mathematics and statistics, because it tends to deal with large and complex data sets, it relies more on experimentation, empirical observation, and engineering than on mathematical theory.

A machine learning algorithm learns a representation from input data and transforms it into an appropriate output. To do this, a machine learning model needs a set of data (such as a set of sentence inputs in a sentence classification task) and a corresponding set of outputs (such as {"positive", "negative"} labels for sentence classification). You also need a loss function, which measures the deviation between the current output of the machine learning model and the expected output of the dataset. To help the reader understand, consider the binary classification task, where the goal of machine learning might be to find a function called a decision boundary, whose responsibility is to perfectly segment different types of data points, as shown in Figure 1.3. This decision boundary should also perform well on new data instances outside of the training set. To speed up finding the decision boundary, the reader may need to first preprocess the data, or convert it into a form that is easier to segment. We search for the target function in a collection of possible functions called the hypothesis set. This search is done automatically, and it makes the ultimate goal of machine learning, which is called learning, easier to achieve.

Figure 1.3 An example of a major motivational task in machine learning (in the case shown in this figure, the hypothesis set can be an arc)

Machine learning uses the guidance of the feedback signal contained in the loss function to automatically search for the best mapping relationship between input and output within a predefined set of hypotheses. The nature of the hypothesis set determines the class of algorithms considered, and these are briefly introduced later.

Classical machine learning started with probabilistic modeling methods such as Naive Bayes . Here, we might as well optimistically assume that the input data features are independent. Logistic regression is a probabilistic modeling approach that is often the first approach a data scientist tries on a dataset. It and the hypothesis set of Naive Bayes are both sets of linear functions.

Although the neural network ( neural network ) originated in the 1950s, it was not until the 1980s that people discovered an effective method of training large-scale networks - the combination of back propagation and stochastic gradient descent (stochastic gradient descent) algorithm . Backpropagation provides a way to compute the gradients of the network, while stochastic gradient descent uses these gradients to train the network.

Appendix B of this book briefly introduces these concepts. In 1989, neural networks were successfully applied for the first time. At that time, Yann LeCun of Bell Labs established a system for recognizing handwritten digits, which was later widely used by the US Postal Service.

The kernel method has been popular since the 1990s. This approach tries to solve classification problems by finding good decision boundaries between point sets, as shown in Figure 1.3. The most popular kernel method is the Support Vector Machine (SVM), which tries to find a good decision boundary by mapping the data to a new high-dimensional representation (where the hyperplane is the efficient boundary), and then maximizes the distance between the hyperplane and the nearest data point in each category. With kernel methods, the high computational cost in high-dimensional spaces is reduced. The kernel function is used to calculate the distance between points instead of explicitly calculating the high-dimensional data representation, and its computational cost is much smaller than that in high-dimensional space. This method has solid theoretical support and is easy to analyze mathematically. When the kernel function is linear, the method is also linear, which makes this method very popular. However, this approach leaves a lot to be desired for perceptual machine learning problems, as it first requires a manual feature engineering step, which is prone to errors.

Decision trees (deciston trees) and related methods are another class of methods that are still widely used. A decision tree is a decision support aid that models decisions and their outcomes as a tree structure. It is essentially a graph, and there is only one path between any two connected nodes in the graph. Or a tree can be defined as a flow chart that transforms input values ​​into output categories. Decision trees took off in the 2010s, when decision tree-based methods became more popular than kernel methods. This popularity is due to the fact that decision trees are easier to visualize, understand and explain. To help the reader understand, Figure 1.4 shows an example of a decision tree structure that classifies the input {A,B} into class 1 (if A < 10), class 2 (if A ≥ 10 and B ≤ 25), and class 3 (other cases).

Figure 1.4 Example of a decision tree structure

Random forests provide a practical machine learning method for applying decision trees. This approach involves generating a large number of specialized decision trees and combining their outputs. Random forests are so flexible and generalizable that they are often used as the second baseline algorithm after logistic regression. When the Kaggle open competition platform was launched in 2010, random forests quickly became the most widely used algorithm on the platform. In 2014, the Gradient Boosting Machine (GBM) replaced it. Both work on the principle of iteratively learning new decision tree-based models that eliminate the weaknesses of the models in previous iterations. At the time of this writing, they are widely regarded as the best approach to non-perceptual machine learning problems. They are still popular on Kaggle.

Around 2012, GPU-trained Convolutional Neural Networks (CNNs) began winning annual ImageNet competitions, marking the advent of the current "golden age" of deep learning. CNNs started to dominate all major image processing tasks such as object recognition and object detection. Similarly, we can also find its application in the processing of human natural language, namely NLP. Neural networks learn through a series of increasingly meaningful, hierarchical representations of input data. The number of these layers determines the depth of the model. This is where the term "deep learning" comes from, the process of training deep neural networks. To distinguish it from deep learning, all machine learning methods described previously are often referred to as shallow or traditional learning methods. Note that neural networks with less depth can also be classified as shallow, but not traditionally. Deep learning has come to dominate the field of machine learning. It is clear that deep learning, the first choice for solving perception problems, has caused a "revolution" in terms of the complexity of the problems it can handle.

While neural networks are inspired by neurobiology, they are not a true model of how our nervous system works. Each layer of a neural network is parameterized by a set of numbers (called the layer's weights) that instruct the layer exactly how to transform the input data. In deep neural networks, the total number of parameters can easily reach millions. The previously mentioned backpropagation algorithm is an algorithmic engine used to find the correct set of parameters, i.e. to learn for the network. Figure 1.5(a) shows a visual representation of a simple feedforward neural network with two fully connected hidden layers. Figure 1.5(b) shows an equivalent simplified representation that we will often use to simplify diagrams. A deep neural network will have many such layers. A well-known neural network structure that does not have this feedforward property is the Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN). Unlike the feed-forward structure in Figure 1.5, which receives a fixed-length input of length 2, LSTMs can handle input sequences of arbitrary length.

Figure 1.5 Simple feedforward neural network with two fully connected hidden layers

As mentioned earlier, what ignited the "deep learning revolution" were advances in hardware, massive amounts of available data, and algorithms. GPUs developed specifically for the video game market, and the maturing of the internet, are beginning to provide unprecedented amounts of high-quality data for deep learning. The availability of data sources such as Wikipedia, YouTube, and ImageNet has driven advances in computer vision and NLP. The ability of neural networks to eliminate the need for expensive manual feature engineering, which is a necessary condition for the successful application of shallow learning methods to perceptual data, is arguably a factor affecting the ease of use of deep learning. Since NLP is a perceptual problem, neural networks are also the type of machine learning algorithm that this book focuses on, though not the only one.

3 Natural Language Processing

Language is one of the most important aspects of human cognition. There is no doubt that in order to create true artificial intelligence, machines need to learn how to interpret, understand, process and manipulate human language. This makes NLP increasingly important in the field of artificial intelligence and machine learning.

As with other subfields of AI, initial approaches to NLP problems such as sentence classification and sentiment analysis were based on explicit rule or symbolist AI. Systems employing these initial methods often fail to generalize to new tasks and break easily. Since the advent of kernel methods in the 1990s, people have been working on feature engineering—manually transforming input data into a form that shallow learning methods can use to predict correctly. Feature engineering is time-consuming and task-specific, making it difficult for non-domain experts to master. Around 2012, the advent of deep learning sparked a real revolution in NLP. The ability of neural networks to automatically design appropriate features in some of their layers lowers the threshold for feature engineering to tackle new tasks and problems. Efforts then turn to designing appropriate neural network architectures for any given task, as well as tuning various hyperparameters during training.

The standard approach to training NLP systems is to first collect a large number of data points, and then label each data point (such as "positive" or "negative") in the task of sentiment analysis of sentences or documents. Finally, these data points are provided to the machine learning algorithm to learn the best representation of the input signal to output signal mapping relationship, and the learned model also performs well on new data points. In NLP and other subfields of machine learning, this process is often referred to as the supervised learning paradigm. The manual annotation process provides a "supervisory signal" for learning representative mappings. In addition, the learning paradigm that has never labeled data points is called unsupervised learning (unsupervised learning) paradigm.

Although today's machine learning algorithms and systems are not direct replicas of biological learning systems, nor should they be considered models of such systems, they are in some ways inspired by evolutionary biology and lead to significant advances. It seems flawed that the supervised learning process is traditionally repeated from scratch for each new task, language, or application domain. This process is somewhat inverse to the way natural systems learn based on previously acquired knowledge and reuse it. Even so, learning from scratch for perceptual tasks has made significant progress, especially in machine translation, question answering systems, and chatbots, although it still has some shortcomings. In particular, today's systems are not robust to sharp changes in the distribution of relevant samples of the input signal. In other words, the system learns to perform well on a certain type of input. Changing the input type can cause significant performance degradation and sometimes fatal failures. Furthermore, in order to make AI more accessible and make NLP techniques accessible to the average engineer in a small business or anyone without the resources of a large Internet company, the ability to download and reuse what others have learned will become critical. This is also important for people in regions whose native language is not English or other popular languages ​​for which there are pre-trained models. Also, it's important for those who take on quests that are unique to their region, or new quests that haven't been seen before. Transfer learning offers a solution to some of these problems.

Transfer learning enables people to transfer knowledge from one environment to another, where an environment is defined as a combination of a specific task, domain, and language. The initial environment is called the source environment and the final environment is called the target environment. The ease and success of knowledge transfer depends on the similarity between the source and target environments. Naturally, a target environment that is "similar" to the source environment in some sense (as we'll define later in the book) will be easier to migrate and succeed in.

Transfer learning has been used in NLP much earlier than most practitioners realize, since vectorizing words using pretrained embeddings such as Word2Vec or Sent2Vec is a common practice (more on this in Section 1.3). Shallow learning methods typically use these vectors as features. We describe both techniques in more detail in Section 1.3 and Chapter 4, and apply them in a variety of ways throughout the book. This popular approach relies on an unsupervised preprocessing step that is used to first train these embeddings without any labels. Knowledge from this step is then transferred to specific applications in a supervised learning context, where knowledge learned from pre-training is further processed and specialized for a smaller set of labeled samples relevant to the shallow learning problem at hand. Traditionally, this paradigm combining unsupervised and supervised learning steps has been called semisupervised learning.

This article is excerpted from "Natural Language Processing Migration Learning Practice"

A book will take you to understand the technology behind ChatGPT, natural language processing transfer learning, unlock a new realm of machine learning, from shallow to deep, master the mysteries of NLP transfer learning, and make your model stand out!

As an important method in the field of machine learning and artificial intelligence, transfer learning has been widely used in computer vision, natural language processing (NLP), speech recognition and other fields. This book is a practical introduction to transfer learning technology, which can lead readers to practice natural language processing models in depth. First, this book reviews the key concepts in machine learning, and introduces the development history of machine learning, as well as the progress of NLP transfer learning; secondly, it discusses in depth some important NLP transfer learning methods—NLP shallow transfer learning and NLP deep transfer learning; finally, it covers an important subfield in the field of NLP transfer learning—deep transfer learning technology with Transformer as a key function. Readers can get hands-on with applying existing state-of-the-art models to real-world applications, including email spam classifiers, IMDb movie review sentiment classifiers, automated fact checkers, question answering systems, and translation systems, among others.

This book is concise in text, incisive in discussion, and clear in hierarchy. It is not only suitable for reading by developers related to machine learning and data science with NLP foundation, but also suitable as a reference book for students of computer and related majors in colleges and universities.

Guess you like

Origin blog.csdn.net/epubit17/article/details/131747550