NLP basics - word representation, text feature engineering

NLP basics - word representation and text features

1. Word Representation: one-hot encoding, tf-idf

  1. Word representation: 0-1 one-hot encoding --> sentence representation: 0-1 ( Boolean )

    Construct the thesaurus V, the representation of each sentence: According to whether each participle appears in V (0/1), the size of the representation vector is |V|.

  2. Word representation: 0-1 one-hot encoding --> sentence representation: 0-1 ( Count )

    Construct the thesaurus V, the representation of each sentence: count the number of occurrences of each participle in V, and the size of the representation vector is |V|.

  3. Sentence similarity: Euclidean distance, cosine similarity.

  4. TF-IDF: t f i d f ( w ) = t f ( d , w ) ∗ i d f ( w ) tfidf(w)=tf(d,w)*idf(w) tfidf(w)=tf(d,w)idf(w)

    t f ( d , w ) tf(d,w) tf(d,w ) : word frequency of w in document d;

    i d f ( w ) idf(w) idf(w): = l o g ( N N ( w ) ) =log(\frac{N}{N(w)}) =log(N(w)N) , the importance of words, inverse document frequency (inverse document frequency);

    N: total number of documents in the corpus;

    N(w): How many documents the word w appears in.

2. Word2Vec

Because one-hot representation: cannot represent the meaning of words, and there is also the problem of data sparsity, so try to find word representation in a word vector space.

An important transition: one-hot representation —> distributed representation .

2.1 Word Embedding

Word Embedding word vector model: (according to the distributed assumption: the similarity of words next to each other is higher)

  1. Traditional: SkipGram, CBOW, Glove, FastText, Matrix Factorization(MF);

  2. Consider context (dynamic representation): ELMo, BERT, XLNet;

  3. Word vector dimensionality reduction: T-sne

If the representation of the word vector is known, the representation of the sentence or document can also be obtained quickly.

Word Embedding --> Sentence Embedding: Average Pooling, Max Pooling, …

The traditional Word Embedding models are:

  • SkipGram : A classic local method that selects associated context words according to window-size. Make use of context window.

  • The core idea of ​​FastText : Solve the OOV (out-of-vocabulary) problem, and some word processing problems with low frequency. Using the n-gram feature, the 2-gram, 3-gram, 4-gram and other subword features of each word are considered during the training process, and SkipGram is used for training after feature fusion.

  • Matrix Factorization : classic global method, make use of co-occurance counts

  • Glove : Combines globality (such as MF) and locality (such as SkipGram) at the same time, using weighted minimum square error. Use weighted least square error.


Question: Which is better, CBOW or Skip-Gram?

Not necessarily, but generally Skip-Gram is better than CBOW:

  1. Data size(window size):例如在 w 1 , w 2 , w 3 , w 4 , w 5 {w_1,w_2,w_3,w_4,w_5} w1,w2,w3,w4,w5In , window size = 1, there are 3 samples in CBOW, and 8 samples in Skip-Gram.
  2. Difficulty: CBOW is relatively simple to predict the central word from multiple context words, and Skip-Gram is relatively difficult to predict the context word from a single central word.
  3. Smoothing effect: In CBOW, the effect is not good for words with a small number of occurrences, but it works well for words with a large number of occurrences. Words with more or less frequent words in the context words are combined with some word features by the Average Pooling (average) process: the average weakens the representation effect of words with less frequent words.

2.2 Gaussian Embedding

Used to measure the similarity/difference between two probability distributions: KL Divergence (Kullback–Leibler divergence, KL divergence).

For probability distributions P(x) and Q(x):
D ( P ∣ ∣ Q ) = ∑ P ( x ) log ( P ( x ) / Q ( x ) ) D(P||Q) = \sum P( x)log(P(x)/Q(x))D(PQ=P ( x ) l o g ( P ( x ) / Q ( x ) )
The higher the similarity between P(x) and Q(x), the smaller the KL divergence.


Question: What is wrong with the word vectors learned for words with high and low frequency in the corpus?

From a statistical point of view, the reliability of more occurrences is higher, and the probability distribution N ( μ , σ ) N(\mu,\sigma) can be calculated for the word vector of each wordN ( μ ,σ ) . By calculating the KL divergence between two word vector distributions, the similarity is judged.

2.3 Contextual Embedding

Solve polysemy problems.

Consider context: ELMo, BERT, XLNet.

3. Text feature engineering

In the fields of machine learning, data mining, and NLP, what are the fixed basic features for text? Let's summarize it below.

Text Features:

  • tf-idf: thesaurus size |V|;
  • Word2vec/Sentence2Vec: The embedding dimension k of the word vector;
  • n-gram: Use bigram, trigram and other features. For the thesaurus V, Bi-gram can construct all combinations of any two words selected from V, (the number of combinations) is S>>|V| (every two words are regarded as a "word", forming A new lexicon), for each sentence in the corpus, 0-1 encoding can be performed according to the Bi-gram (combining the words before and after the sentence, if the combination appears in the Bi-gram, it will be 1, otherwise 0).
  • POS part of speech;
  • Topic features: can be calculated by LDA;
  • Task-specific features

Feature engineering based on word vectors mainly includes the following aspects:

  • Sum or take the maximum of Word2Vec or FastText word vectors.
  • Based on Word2Vec or FastText's word embedding, find the maximum value and average value of a certain word vector, and use them as new features of the sample.
  • Integrate the embedding of Bert, XLNet and other pre-trained models in the sample representation .
  • Consider the impact of the interaction between words on the model . For the model, the greatest contribution is not necessarily the entire sentence, but may be a part of the sentence, such as phrases, phrases, and so on. On this basis, we use sliding windows of different sizes
    (k=[2, 3, 4]), and then perform average or maximum operations.
  • In the sample representation, the latent features generated by the fusion sample in the AutoEncoder model.
  • When the sample is represented, the Topic features generated by the fusion sample in the LDA model.
  • For classification problems, the category information is not considered, so we can get the embedding of all categories from the trained model. You can refer to the LEAM proposed in the paper " Joint Embedding of Words and Labels for Text Classification " to obtain the word embedding of the label space. Calculation method: get the label embedding of all categories, multiply it with the input word embedding matrix, perform a softmax operation on the result, and average or maximize the result of multiplying the attention score and the input word embedding.

Here, a variety of feature engineering is combined to construct a new sample representation, mainly through the following three feature engineering methods:

  1. First: use the average pooling and max-pooling of word-embedding;
  2. Second: use window size=2, 3, 4 to perform convolution operation on word-embedding, and then perform max/avg-pooling operation;
  3. Third: Using the representation of category tags, the semantic interaction between words and tags is increased, so as to achieve a deeper consideration of word-level semantic information.

Based on manually defined features, there can be:

  • the length of the (raw) text;
  • number of capital letters;
  • The ratio of uppercase to text length;
  • the number of exclamation marks;
  • number of question marks;
  • number of punctuation marks;
  • Total words (number of words);
  • the number of unique words;
  • the number of nouns, adjectives, and verbs;
  • The ratio of unique words to the total number of words;
  • The ratio of nouns to total words;
  • The ratio of adjectives to the total number of words;
  • The ratio of verbs to the total number of words;
  • the average length of words;
  • The proportion of punctuation marks to the total length;
  • the existence of place names;
  • whether it contains a person's name;

Welcome everyone to pay attention to my personal public account: HsuDan , I will share more about my learning experience, pit avoidance summary, interview experience, and the latest AI technology information.

Guess you like

Origin blog.csdn.net/u012744245/article/details/124257618