nlp of the TF-IDF

   First of all I do not know what use this exercise is to play with, do not know a lot of things are dim, there might play with, and happy. Today we see the hair of a horse's circle of friends always screenshot: Tencent was founded at the beginning is to make a good product, not to make money. Ha ha ha ha ha ha ha

TF-IDF (term frequency-inverse document frequency) is a common weighting information retrieval and text mining technology used. TF-IDF is a statistical method to evaluate the importance of a term set for a file or a document in the corpus where the. The importance of words as the number of times it appears in the file is proportional to the increase, but will also decrease as the frequency is inversely proportional to its appearance in the corpus.

TF-IDF is actually: TF * IDF. The main idea is: If the frequency of a word or phrase that appears in an article high (ie, high TF) and rare (ie high IDF) in other articles, is considered the word or phrase has a very good category discriminatory power, suitable for classification.

TF (Term Frequency, term frequency) of a given term t represents the frequency of occurrence in a given document d. TF is higher, the more important the words in the document d t is, the lower the TF, the words t d is less important documents. Whether it can be used as TF similarity evaluation standard text it? The answer is not enough, for example, commonly used Chinese words such as "I," "the," "a," and so on, the frequency of occurrence of a given document in a Chinese is very high, but the Chinese word almost Each document has a very high word frequency, if the TF as the similarity evaluation standard text, each document can almost be hit.

IDF (Inverse Document Frequency, inverse document frequency) is the main idea: if t contains fewer words in a document, the greater the IDF, indicating the words t has a good ability to distinguish between categories across the entire documentation set level. IDF illustrates the problem? Give you an example, commonly used Chinese words such as "I," "the," "a" and almost has a very high term frequency in each document, then for the entire set of documents, these words are not important. For the entire set of documents, the evaluation terms, important criterion is the IDF.

 

 

Guess you like

Origin www.cnblogs.com/students/p/8998971.html