N-GRAM text mining

N-GRAM introduction:

N-Gram is based on an assumption: the occurrence of the nth word is related to the first n-1 words, but not related to any other words. (Hidden Markov hypothesis.) The probability of the entire sentence is equal to the product of the probability of each word. The probability of each word can be calculated by statistics in the corpus.
Assuming that the sentence T is composed of word sequence w1, w2, w3,...wn, the N-Gram language model is expressed by the formula as follows:
P (T) = P (w 1) ∗ P (w 2 ∣ w 1) ∗ p (w 3 ∣ w 1 w 2) ∗ p (wn ∣ w 1 w 2 w 3...) P(T) = P(w1)*P(w2|w1)*p(w3|w1w2)*p(wn| w1w2w3...)P(T)=P(w1)P(w2w1)p(w3w1w2)the p- ( w n | w 1 w 2 w 3 . . . )
the p-(w2 | w1): the probability of occurrence of w2 appeared on the basis of the conditional probability of occurrence of w1 on;
how much n take appropriate?

The commonly used N-Gram models are Bi-gram and Tri-Gram, which are represented by formulas as follows:
Pay attention to the calculation method of probability above, P (w 1 ∣ begin) = all sentences starting with w 1 / total number of sentences; P (w 2 ∣ w 1) = w 1, the number of simultaneous occurrences of w 2 / the number of occurrences of w 1, and so on P(w1|begin)=all sentences starting with w1/total number of sentences; P(w2|w1 )=w1,w2 appearing times at the same time/w1 appearing times, and so onP(w1begin)=In W . 1 to open the head of the have the sentence sub / clause Sub Total number ; P ( W 2 | W . 1 )=w1,W 2 the same time the current of the secondary number / W . 1 a current of the secondary number , to this class push
a classic example of binary language model (Bi-gram) is:
N-Gram can be used to evaluate an expected or reasonable sentence ; It
has a good effect on Chinese part-of-speech tagging and Chinese word segmentation; NNLM, CBOW

TF-IDF algorithm introduction

TF-IDF is used to evaluate the importance of a word to one of the documents in a document set. The importance of a word increases in proportion to the number of times it appears in the document, and decreases with the frequency of its appearance in the corpus.
Term Frequency (TF) The number of occurrences of a word in a document is generally normalized to prevent the number of long documents from being too large;
Inverse Document Frequency (IDF) is a measure of the universal importance of a term, the total number of documents (how many documents are in the corpus ) Is divided by the number of documents containing the term, and then the obtained quotient is taken as the logarithm.
Conclusion: A high-frequency word in a particular document, but the word appears in fewer documents in the entire document collection, the TF-IDF value is higher.

Formula:
TF/IDF is the fusion of two different metrics
TF / IDF = TF ij ∗ IDF i TF/IDF = TF_ij * IDF_iTF/IDF=TFijIDFi
T F i , j = n i , j ∑ k n k , j , n i , j TF_{i,j} = \frac {n_{i,j}} {\sum_{k}n_{k,j}},n_{i,j} TFi,j=knk,jni,j,ni,jIndicates the number of times a word appears in the document. The denominator is the sum of the number of words in the document.
IDF i = log ∣ D ∣ ∣ j: ti ∈ dj + 1 ∣ IDF_i=log{\frac {|D|}{|{j:t_i \in d_j}+1|}}IDFi=logj:tidj+1D , Where |D| represents the total number of corpus files

|{$ j: t_i \in d_j$}| The total number of files containing a word

What is the difference between the effect of adding and the effect of multiplying?

The added effect is that if one of the two fused models works well, its effect will be very high. After multiplying, as long as there is a low value, the overall effect will be low.

Application examples

  1. Keyword extraction
  2. Sentence similarity calculation
  3. Pre-algorithm of other algorithms

TF-IDF article similarity calculation process:

  1. Use the TF-IDF algorithm to find the keywords of the two articles respectively;
  2. Take the words with the top 15 TF/IDF value of each article and merge them into a set, and calculate the relative word frequency of the keywords of each article in the set;
  3. Generate word frequency vectors of two articles respectively;
  4. Calculate the cosine similarity of two vectors, the larger the value, the more similar

Guess you like

Origin blog.csdn.net/qq_29027865/article/details/93891078