NLP (Natural Language Processing) Learning Record

In the past few years, I have been studying the CV field, and I want to take some time to learn about NLP [pure personal hobbies], I have never been exposed to NLP learning, and I don’t know how to learn, so I can only do something about it. I have learned it with a stick, and I hope I can learn it quickly and systematically~ I will create a special column for learning NLP, and I will update and share it from time to time in the future. If there is a master in this field, you can also give me pointers. The follow-up study plan will hope to combine NLP with the CV content I have studied (so when I study this aspect, I will learn by analogy with the CV field) , and support is welcome.

Zero-based NLP, the first part is the interpretation of professional terms. Just like when I first learned CV, I need to understand what convolution is, what channel refers to, what feature map refers to, and so on. The same is true for NPL, for example, word vectors, word embeddings, distributed representations, etc.

Note: This article mainly involves the update of some basic knowledge.

Table of contents

expression of language

1. One-hot representation of words

2. Distributed representation

2.1 Matrix-based distribution

2.2 Word Embedding


expression of language

This part can be understood as talking about the form of data representation. For different tasks, the form of data we feed into the model is also different. For example, for speech signals, we usually feed the spectrum sequence into the model for feature extraction. For images, RGB three-channel images are usually sent to the model for feature extraction, and it is hoped that these features are as linearly independent as possible. Then in NLP, each word is expressed in the form of a vector . This is similar to the CV task, because in the CV feature extraction of the image, each channel is a different feature extracted from the image. [So can I understand that the vector dimension in the word vector corresponds to the channel in the CV]

But unlike CV, the words in NLP have contextual meaning. To understand a word, sometimes you need to understand the contextual meaning. This is the more advanced part than CV. At the same time, two words with different characters, such as "potato" and "potato", refer to the same thing. When we design any model, we want it to fit our data. Before modeling, we need to fully express the data.

According to the current development, the representation of words is divided into one-hot representation and distributed representation .

1. One-hot representation of words

The most intuitive and so far the most commonly used word representation method in NLP is One-hot [also in CV]. This method represents each word as a very long vector. The dimension of this vector is the size of the vocabulary , most of the elements are 0, and only one dimension has a value of 1, which represents the current word.

for example:

"dog": [0,0,0,0,0,1,0,0,0,....]

"cat": [0,0,0,01,0,0,0,0,0,..]

This approach is also often applied to classification tasks in CV. When the predicted index position is 1, it means the predicted class.

But there are problems with this method. For example, as the number of sentences increases, the dimension will also increase, and the words classified by this One-hot are independent, and there is no related information between words.

2. Distributed representation

Although the one-hot classification method is accurate, it has no semantic correlation and only symbolizes words. According to the distribution hypothesis proposed by Harris in 1954 , words with similar contexts have similar semantics. Firth further elaborated and clarified the distribution hypothesis in 1957: the semantics of a word is determined by its context. This can be used as an analogy. For example, when we are doing English reading, there is a word that we don’t know, but we can guess what the word is through the meaning of the context. This is distribution, or we can say that the word "distribution" refers to the distribution pattern and statistical characteristics of words in context .

When we talk about the distribution of a word in the context , we refer to its co-occurrence relationship with other words in the text (what is co-occurrence, which can be understood as simultaneous appearance, such as "cats love to eat fish", cats and Fish are related and appear together in the context, "like" and "eat" are also related to a certain extent, by observing and counting a large amount of text data, we can get the co-occurrence pattern between words ) . Specifically, we observe a word's occurrences of other words in a window around it or in a range of contexts, and then count the relationships between them.

Therefore, modeling in distributed representations needs to solve two problems, 1. Describe the context in a way; 2. Choose a model that can describe the relationship between the "target word" and its context.


The distributed representation of words can be divided into matrix-based distribution, word vector or word embedding.

2.1 Matrix-based distribution

The matrix-based distribution representation is usually also called the distribution semantic model. A row in the matrix becomes the representation of the corresponding word. This representation describes the distribution of the context of the word. Because the distribution hypothesis holds that words with similar contexts have similar Similarity, so under this representation, the semantic similarity of two words can be directly converted into the spatial distance of two vectors .

2.2 Word Embedding

In word embedding, the term "embedding" refers to the process of mapping words or phrases into a continuous vector space. This vector space is usually low-dimensional and continuous. [It can be understood as the process of mapping discrete space to continuous space, and "embedding" discrete words into continuous space]. This embedding method makes words with similar semantic relations closer in the vector space ( information about the context of the current word in this sentence ) .

For example, "cat" and "dog" may be mapped to similar vector representations in the word embedding space, because they have certain similarities in semantics, then the spatial distance between the two will be closer.


updating........

Guess you like

Origin blog.csdn.net/z240626191s/article/details/130912827