NLP Introductory Tutorial Series

Summary of introduction to NLP related knowledge

Chapter 1 Distributed Representation of Natural Language and Words



foreword

A simple introduction to NLP and some summary of your own.
The goal of natural language processing is to make computers understand language. Our language is made up of words, and the meaning is made up of words, so in order for a computer to understand natural language, it is necessary for him to understand the meaning of words. The following will introduce some methods of expressing the meaning of words


1. Method based on thesaurus dictionary

As the name implies, in a synonym dictionary, words that have the same meaning (synonyms) or similar meanings are grouped into the same group. In addition, sometimes there are other granular relationships, such as upper and lower, and the relationship between the whole part. Like motorcycles and cars are motor vehicles. WordNet is the most famous synonym dictionary in the NLP field. Using it, you can get synonyms of words to calculate the similarity between words. If you want to use WordNet, you must first install NLTKthe module.
Of course, thesaurus dictionaries also have the following problems:

  1. It is difficult to adapt to the changes of the times. As time goes by, many new words will appear such aschicken you are so beautifulEtc. On the other hand, the meaning, usage, derogatory attributes, etc. of words may change
  2. The maintenance cost is huge, whether it is English or Chinese, the number of words is very large, requiring huge labor costs
  3. Unable to show subtle differences between words
    So there are many problems when using a thesaurus dictionary. Another count-based approach is presented next.

2. Count-Based Methods

When using this method, first we need to use a corpus. The goal of this method is to automatically and efficiently extract the essence from the corpus. For example, there is a corpus of 'Learning makes me happy', we need to preprocess it, simply put, there are the following steps:

  1. Convert all words to lowercase
  2. Label words with IDs, such as 0: learning, 1: makes, etc. Similarly, the reverse mapping also needs

Next, words need to be represented as vectors, which requires 分布式假设: the meaning of a word is formed by the words around it

1. Distributed assumptions about words

Distributed representations of words represent words as fixed-length vectors. Such a vector is characterized in that it is represented by a dense vector. A dense vector means that each element of the vector (most of which are represented by non-zero real numbers. In addition, the context of a word refers to the vocabulary around the word, and the size of the context (that is, how many words are around) is called the window Size (window size). The window size is 1, the context contains 1 word on the left and right, and so on. With the distributed assumption, next consider how to use vectors to represent words based on the distributed assumption.

2. Co-occurrence matrix

After having the word ID, we can use the co-occurrence matrix to represent the word

learning makes me happy
learning 0 1 0 0
makes 1 0 1 0
me 0 1 0 1
happy 0 0 1 0

Among them, the first line of the table means that the context of learning is only makes, so it is marked as 1, the second line, the context of makes is learning and me, and so on. This way we have a vector representation of each word.

3. Similarity between vectors

With the word vector representation, we can evaluate the similarity between words, there are many methods such as or 向量内积and 欧氏距离, 余弦相似度let x = ( x 1 , x 2 . . . , xn ) , y = ( y 1 , y 2 . . . yn ) x=(x_1,x_2...,x_n),y=(y_1,y_2...y_n)x=(x1,x2...,xn),y=(y1,y2...yn) , then their cosine similarity can be expressed as:
similarity ( x , y ) = x ∗ y ∣ ∣ x ∣ ∣ ∣ y ∣ = x 1 y 1 + . . . xnynx 1 2 + . . . xn 2 y 1 2 + . . . + yn 2 similarity(x,y)=\frac{x*y}{||x|||y|}=\frac{x_1y_1+...x_ny_n}{ \sqrt{x^2_1+. ..x^2_n\sqrt{y^2_1+...+y^2_n}}}similarity(x,y)=∣∣x∣∣∣yxy=x12+...xn2y12+...+yn2 x1y1+...xnyn
In this expression, the numerator is the vector inner product, and the denominator is the L2 norm of the vector. 求内积之前,先对向量进行正规化. Cosine similarity expresses "to what extent two vectors point in the same direction". When two vectors point exactly in the same direction, the cosine similarity is 1; when they point completely opposite directions, the cosine similarity is −1. When implementing related codes, since division is involved, an error will occur when encountering a 0 vector, so it is necessary to add a very small value when dividing.

4. Similar word sorting

With the similarity, you can sort the words, the steps are as follows:

  1. Take out the word vector of the query word
  2. Find the cosine similarity between the query word and all other words separately
  3. display in descending order

Summarize

So far, the basic content based on counting is almost introduced, and the improved version will be introduced next time.

Guess you like

Origin blog.csdn.net/weixin_39524208/article/details/131335502