Search engine (2) - the query understanding - word

Search word is fundamental and very important function, the correct word is a necessary condition of good search results.

1. The word size

Word, the main problem is segmentation granularity problem. For example, "Eagle Shooting Heroes", the following word several ways, which most correct?

  1. The most fine-grained segmentation: [Legend of the Condor Heroes, Chuan]
  2. Normal size word: [Condor, Heroes]
  3. The most coarse-grained segmentation: [] Eagle Shooting Heroes
  4. Mixed size word: [Condor, Eagle Shooting Heroes, Heroes, Heroes, Chuan]

Above four are not wrong, according to the specific application scenarios to decide which word to use way.

  • Construction of the index

When indexing, in order to expand the recall, the general requirements should have a coarse grained, fourth best. Representation in the index, [Condor, Eagle Shooting Heroes, Heroes, Heroes, Chuan] which has five words, the user enters insufficiency, for example, enter "Legend of the Condor", it is also possible to search out.

If the index only word with coarse-grained, for example, the first three kinds [] Eagle Shooting Heroes, the index is only one word [] Eagle Shooting Heroes, a user searches for "Legend of the Condor Heroes" will not match, the search is not to this result.

  • Online queries
    when online searches, word of coarse fine-grained advantages and disadvantages.
    • Coarse-grained segmentation:
      • A small number of recall. For example, online inquiry, into the [Eagle Shooting Heroes], it can not search for "post-Eagle Shooting Heroes" like the content.
      • Search accuracy rate, only a complete search with "Legend of the Condor Heroes" will not search the contents of the "Tale of Heroes" and the like
      • Fast performance, with only a coarse-grained search term, inverted relatively short. Only you need to take a inverted zippers, no other term in scoring calculations.
    • Fine-grained segmentation :( word with the merits of the opposite coarse-grained)
      • Number of recalls more, you can search for "After The Eagle Shooting Heroes," "The Tale of Heroes" and the like (if the term is or between multiple queries, does not require all term are hit).
      • Accuracy rate will drop, search out the part of the relevant content.
      • Processing logic is more complex. After word, how relationships between multiple term treatment, is taking the intersection or union?

If the content of the convention, such as idioms, names, places, etc., do not recommend further subdivided words. Otherwise, the search results will be significantly offset.

If it can be rough or refined, consider a compromise approach: first division with coarse-grained to do word search, if the search results are more than enough, the quality is good enough, then fine-grained segmentation do not retrieved. Otherwise, if the relatively small number of search results, or poor quality, then subdivided the word, to do further inquiry.

For example: name "Jay", when indexing, there are a variety of possible particle sizes, e.g. into weeks [Zhou Jie, Jay, Jay].

  • Users search for "Jay", only with coarse-grained sub-word [Jay], search out the exact content. If it subdivided] [Zhou Jie, search out "Zhou Jie" relevant content, a clear violation of the user's primary intent.

  • Users search for "weeks" or "Zhou Jie", you can also search for the "Jay" relevant content. Because the user may want to search Jay, did not enter the full click on the search button. Enter incomplete, in the search is a common problem.

2 & Lemmatization stemming

When it comes to English and other languages, it will involve change tense, singular and plural, in the Chinese do not have this problem. When word if you do not consider Lemmatization & dry extraction problem, the recall will cause leakage.

  • Lemmatization (Lemmatization)

Lemmatization, involves the reduction of the word into the most primitive state. For example in the past tense, past participle, as it becomes (running -> run). And other singular into plural (dogs -> dog).

That is, users search dog, but also can search the contents of dogs. Also, search dogs, but also can search the contents of the dog. So when the word, the dogs need to recognize the prototype is a dog.

Lemmatization generally through the dictionary to achieve high accuracy. Also it is based on the rules do, but we know in English irregular plurals, tenses example too much, can not be resolved by the rules.

  • Stemming (temming)

Stemming is to remove the suffix of the word, get root. And Lemmatization a very obvious difference is that, after Lemmatization still is a meaningful word, but stemming out of the root, may not be a word, just part of a word. Such as electricity roots are electr.

Lemmatization compared with the stemming after recall more accurately rate will also decline.

In the search, if you want to use Lemmatization and stemming, the user's original input word with words or bases after reduction, weight by weight between the two may be distinguished do. Otherwise, it will search out some of the biased results.

Guess you like

Origin www.cnblogs.com/grindge/p/11968557.html