Summary of the basics in NLP

1. word

In Chinese, conventional words are generally directly cut by word segmentation tools, such as toolkit stammering word segmentation, to form phrases, and in some cases can also be represented by a single word. Words in English are more common single words, and in some cases English phrases are used

1.1 Preprocessing of words

Under normal circumstances, for English text, it is first necessary to convert tenses, the conversion of words such as three singles, the conversion of first capital letters to lowercase, and sometimes the spelling of words needs to be checked.

1.2 Stop words

The phenomenon of stop words exists for both Chinese and English texts. Stop words refer to a bunch of words that are very common but have no real meaning, and are usually filtered out by building a stop word list. The more typical word in English is "is" and in Chinese it is "的". Whether to filter out stop words depends on the situation. For text classification, it is recommended to filter out the similarity test problem, otherwise it will be found that all texts seem to have high similarity.

1.3 Word Vectors

Word vector is a more commonly used means of encoding words. The more common ones are the one-hot encoding, word2vec, glove method, and a fasttext method has recently been introduced, you can also try to use it. The one-hot method is easy to increase the dimension, because there are as many dimensions as there are dictionaries, and the latter method can actually control the dimension of the word vector, which is generally controlled within 300 dimensions. The word vector can basically be understood as how much semantic information the word contains. The discourse corpus that needs to be trained by itself should be as large as possible. If the corpus itself is small, it is better to use the pre-trained word vector for training. Generally, there is a Wiki. pre-training data.

1.4 Subject Features

Some words in the text may be represented as one and the same thing. At this time, the topic features are used to summarize these words to form new features. It may be heard more by the TF-IDF method, which is directly based on word frequency, and LDA is also common. method, you need to define several topics by yourself, while the HDP method can determine how many topics there are.

1.5 Part-of-speech features

Part-of-speech features are also known as POS. Part of speech refers to the noun-verb-adjective we often say. It can be directly extracted through common py packages.

1.6 Named Entities

Named entities are also called NER. Generally, Stanfordnlp directly includes the detection of named entity recognition, but it is limited to common types such as currency (dollar, pound), time (minute, second). In many cases, it is necessary to extract named entities according to their own scenes, such as marking the creatures in the text as animals or plants. The only one who needs special fields is to train on their own.

2. Syntactic Analysis and Semantic Analysis

The main method of syntactic analysis is to rely on package adjustment. The implementation method is mainly rules + probability. After the general syntax is analyzed, a word will have two attributes, one is the displacement of the word pointing to him and himself, and the other is the relationship between him and the word. Every word has a unique word that refers to him

Enter image description

As shown in the example from https://www.ltp-cloud.com/demo/ . Both of these can be counted as features. In the same way, according to the semantic dependencies, there will be more semantic relationships, and the choice is still based on the scene. The picture shows an example of semantic analysis

Enter image description

A root appears in both of these pictures, and this root is additionally supplemented. Can declare the center of a sentence (usually a verb)

  1. Some common routines that can be used in nlp

3.1 Included angle

The included angle is generally calculated using the cosine value, and there are many methods for calculating the distance, so I will not list them one by one. It is generally used to measure the similarity between texts, both short text and long text are applicable, but you need to consider how to deal with the features.

3.2 n-gram method

The N-gram method refers to the upper and lower words of the current word. It can be regarded as a window of size n sliding on the text. For example, when n is equal to 2, the sentences in the above picture are read in turn, and he calls Tom Tom to go. Get your coat. By analogy, when n is 3, the window is 3. Generally, when n is 4, it will not be taken. Similarly, according to the above, part of speech, syntax, and features can also be expanded according to the method of ngram

3.3 CNN method

The Cnn method is often referred to as a convolutional neural network. For text, there is only 1 dimension, and Conv1 is used in keras. The input here is generally a piece of text, and each word in the text corresponds to a set of one-dimensional vectors. Works wonders for well-characterized text

3.4 LSTM method

The LSTM method is a memory neural network, which is an evolutionary type of rnn. It is also derived from gru. It is often seen that lstm training is slower, and gru will be faster. The input method is the same as cnn. Generally, it is more comfortable to use front and rear LSTM stacking.

The above is a personal summary, if there is anything inappropriate, please point out. Emm. . . If you think of anything else, add it. . .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324454560&siteId=291194637