In-depth understanding of wmd algorithm

In-depth understanding of wmd algorithm

WMD (Word Mover's Distance) 1 is a method to measure the similarity of the text proposed in 2015. It has the following advantages:

  • Excellent results: full use of word2vec field of migration
  • Unsupervised: do not rely annotation data, no cold start problem
  • Model is simple: Only the results of term vectors as input, there is no argument over
  • Interpretability: the problem into linear programming, there is a global optimal solution
  • Flexibility: the word of the importance of human intervention can

Of course, it also has some disadvantages:

  • Bag of words model , there is no reserved word order information
  • OOV not a good word processing vector (Out of vocabulary) problem
  • Processing power offset negative words
  • Processing ability bias mutually exclusive areas of synonyms of the word
  • High time complexity: O (p3logp) O (p3log⁡p) (wherein, p representative of two word text to the size of the vocabulary of the weight)

When WMD is calculated using the two text similarity, it performs the following steps:

  • The use of the word encoded into words word2vec vector

  • Remove stop words

  • Calculate the share of each word in the text weights, usually by word frequency to represent

  • For each word found in the other provisions of this word, to determine how much movement on that word. If two words meaning relatively similar, can all move or move more. If the semantic differences larger, less or no movement can be moved. How many words and movement vector distance is multiplied by the cost of the transfer of two words

  • Ensure the overall cost of the transfer sum is the smallest

  • Text the word to be completely out of 1, 2 requires all text into word

    We first word of the document as a distribution (such as using word frequency characteristics normalized). First consider how to make "1 documentation for each word in a different weight matching all the words to another document." Below, it is very simple, we allow "partial match" on it. Here we match as a word in the document 1 "moved" to document a process of 2 words, moving the cost of Euclidean distance between two vectors of words. For example, "Obama" in the document 1 in the weight (probability) is 0.5, 0.4 if I move to the "President", 0.05 move to "greets" ...... and so on, the price is moving [official]
    here should be a constraint: the "Obama "the right word document assigned 2 heavy and it should be equal rights in the document 1 fold, that is [official].

    Now consider the "mandatory" this feature. To document the word will 2 is matched to, we can ask the same time, flows into the documentation right of a word in 2 weight (such as for "Press", flows into its weighting = "Obama" → right "press" heavy + " speaks "→" press "the right weight +" media "→" right press "...... + weight), which is equal to the right in the document 2 weight. So for more than two moves each in line with the constraints of the word, we have a total cost of moving. We make the minimum price for moving between the two documents Word Mover's Distance (WMD), namely

    Author: Ziyuan

    Links: https://www.zhihu.com/question/33952003/answer/134691643

Guess you like

Origin www.cnblogs.com/rise0111/p/11440365.html