Full-text indexing: Some words can be found, part of finding out the word bug

The concept of full-text indexing: Find out any of the content information stored in the database of the entire book or the whole article technology. It can get information about the full text of the chapters, sections, paragraphs, sentences, words like, if necessary, can also be a variety of statistics and analysis.

The principle is to define a thesaurus, and then find the frequency and location of each entry (term) appear in the article, the frequency and location of such information summarized in the order in the thesaurus, this is equivalent to set up a file to thesaurus to index directory, so look for a word when you can quickly navigate to the location of occurrences of the term.
Problems in dealing with the English document is clearly such a way is very good, because the nature of the English word is divided into several spaces, as long as we have a large enough vocabulary can be a good deal. But Asian characters because there is no space as hyphenation flag, so it is difficult to judge a word, and words that people use in constant change, while maintaining a scalable vocabulary of the cost is very high, so the question arises.
 
 

1, based on the attribute name Trojans edgelabel search index;
2, data is inserted Trojans, to name "Trojan - -Trojan.Win32.FakeLPK.7cfa remote control on-line Package";
. 3, name ":" Trojan - remote control -Trojan .Win32.FakeLPK.7cfa on line package "search Trojan horse remote control, on-line packages can be found to four; Win32 FakeLPK 7cfa found less than three

 

 

After judgment is selected in the default configuration file tokenizer cause: some support points at noon good word, and some support for the English word good.

word MaximumMatching: [Trojan, far control, trojan,, win32, fakelpk, 7cfa, the line pack.]
word MaximumMatching: [Trojan, far control, trojan,, win32, fakelpk, 7cfa, the line package.]

jieba SEARCH: [Trojan, -, remote control, trojan,, win32, fakelpk, 7cfa, the line pack.]
jieba the INDEX: [Trojan, -, remote control, trojan,, win32, fakelpk, 7cfa, the line package.] ------- this is now more realistic

smartcn: [Trojan, far control, trojan, win, 32, fakelpk, 7, cfa, on line, including]

mmseg4j Simple: [Trojan, far control, trojan, win32, fakelpk, 7cfa , on-line, the package]
mmseg4j Complex: [Trojan, far control, trojan, win32, fakelpk, 7cfa , on line, including]

jcseg Simple: [Trojan, -, far control, trojan.win32.fakelpk.7cfa, trojan, win, 32 , fakelpk, 7, cfa, on-line, the package]
jcseg Complex: [Trojan, -, far control, trojan. win32.fakelpk.7cfa, trojan, win, 32, fakelpk, 7, cfa, on-line, the package]

hanlp standard: [Trojan, -, far control, -Trojan,, Win, 32, FakeLPK, 7, cfa, on line package.]
hanlp NLP: [Trojan, -, remote control, -Trojan,, Win. , 32, FakeLPK, 7, cfa , on line, including]

ansj BaseAnalysis: [Trojan, -, far control, trojan,, win, 32, fakelpk, 7, cfa, on-line, the package.]
ansj IndexAnalysis: [Trojan, -, far control, trojan,, win, 32, . fakelpk, 7, cfa, on-line, the package]

ik smart: [Trojan, far control, trojan.win32.fakelpk.7cfa, on-line, the package] is the default ------------
ik max_word: [Trojan, far control, trojan. win32.fakelpk.7cfa, trojan, win, 32, fakelpk, 7, cfa, on-line, the package]

 

Guess you like

Origin www.cnblogs.com/tarzen213/p/11982432.html