ElasticSearch usage summary (eight)

In Elasticsearch, there are many built-in analyzers. Let's compare the differences between the system default tokenizer and the commonly used Chinese tokenizer.

System default tokenizer:

1. Standard word
segmenter The processing capability of English is the same as that of StopAnalyzer. The method used to support Chinese is word segmentation. It converts lexical units to lowercase and removes stop words and punctuation.

/**StandardAnalyzer分析器*/
    public void standardAnalyzer(String msg){
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
        this.getTokens(analyzer, msg);
    }

2. The simple tokenizer
first splits the text information by non-alphabetic characters, and then unifies the lexical units into lowercase. The parser strips characters of numeric type.

/**SimpleAnalyzer分析器*/
    public void simpleAnalyzer(String msg){
        SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_36);
        this.getTokens(analyzer, msg);
    }

3. The Whitespace tokenizer
only removes spaces, does not lowcase characters, does not support Chinese, and does not perform other normalization processing on the generated lexical units.

/**WhitespaceAnalyzer分析器*/
    public void whitespaceAnalyzer(String msg){
        WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_36);
        this.getTokens(analyzer, msg);
    }

4. The function of Stop tokenizer
StopAnalyzer surpasses SimpleAnalyzer. On the basis of SimpleAnalyzer, common words in English (such as the, a, etc.) are added to remove common words in English. You can also set common words according to your own needs; Chinese is not supported.

/**StopAnalyzer分析器*/
   public void stopAnalyzer(String msg){
       StopAnalyzer analyzer = new StopAnalyzer(Version.LUCENE_36);
       this.getTokens(analyzer, msg);
   }

5. The keyword analyzer
KeywordAnalyzer treats the entire input as a single lexical unit, which is convenient for indexing and retrieval of special types of text. It is very convenient to use the keyword tokenizer to create index items for text information such as zip codes and addresses.
6. Pattern tokenizer
A pattern-type analyzer can divide the text into "terms" (things obtained after token Filter) through regular expressions. Accept the following settings:

A pattern analyzer can do the following property settings:

lowercase Whether terms are lowercase. Defaults to true lowercase.
pattern The pattern of the regular expression, the default is W+.
flags regular expression flags
stopwords A list of required stop words to initialize the stop filter. The default word is an empty list

7. Language tokenizer
An analyzer collection for parsing text in a special language. ( arabic,armenian, basque, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french,galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian,persian , portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.) Unfortunately, there is no Chinese. Not be considered.
8. Snowball
tokenizer A snowball type analyzer is composed of standard tokenizer and four filters, standard filter, lowercase filter, stop filter, and snowball filter.

The snowball analyzer is generally deprecated in Lucene.

Chinese tokenizer:

1. ik-analyzer
IKAnalyzer is an open source, lightweight Chinese word segmentation toolkit developed based on java language.

It adopts the unique "forward iterative most fine-grained segmentation algorithm", which supports two segmentation modes of fine-grained and maximum word length; it has a high-speed processing capacity of 830,000 words/second (1600KB/S).

Adopt multi-subprocessor analysis mode, support: word segmentation processing of English letters, numbers, Chinese vocabulary, etc., compatible with Korean and Japanese characters

Optimized dictionary storage, smaller memory footprint. Support user dictionary extension definition

IKQueryParser, a query analyzer optimized for Lucene full-text retrieval (recommended by the author), introduces simple search expressions and uses ambiguity analysis algorithms to optimize the search arrangement and combination of query keywords, which can greatly improve the hit rate of Lucene retrieval.

Maven usage:

<dependency>
    <groupId>org.wltea.ik-analyzer</groupId>
    <artifactId>ik-analyzer</artifactId>
    <version>3.2.8</version>
</dependency>

Before adding IK Analyzer to the Maven Central Repository, you need to install it manually, install it to your local repository, or upload it to your own Maven repository server.

2. Jiba Chinese word segmentation
Features:
1. Support three word segmentation modes:
precise mode, which tries to cut the sentence most accurately, suitable for text analysis;
full mode, scans all the words that can be formed into words in the sentence, and the speed is very fast Fast, but cannot resolve ambiguity;
search engine mode, on the basis of precise mode, re-segments long words to improve recall rate, suitable for search engine word segmentation.
2. Support traditional Chinese word segmentation
3. Support custom dictionary

3. THULAC
THULAC (THU Lexical Analyzer for Chinese) is a set of Chinese lexical analysis toolkit developed by the Natural Language Processing and Social Humanities Computing Laboratory of Tsinghua University. It has Chinese word segmentation and part-of-speech tagging functions. THULAC has the following characteristics:

strong ability. It is trained using the world's largest artificial word segmentation and part-of-speech tagging Chinese corpus (about 58 million words) that we have integrated, and the model tagging ability is powerful.

High accuracy. The F1 value of the toolkit for word segmentation on the standard dataset Chinese Treebank (CTB5) can reach 97.3%, and the F1 value of part-of-speech tagging can reach 92.9%, which is comparable to the best method on this dataset.

faster. The speed of simultaneous word segmentation and part-of-speech tagging is 300KB/s, and it can process about 150,000 words per second. Only the word segmentation speed can reach 1.3MB/s.

Chinese word segmentation tool thulac4j released

1. Standardize the word segmentation dictionary and remove some useless words;

2. Rewrite the construction algorithm of DAT (Double Array Trie Tree), the generated DAT size is reduced by about 8%, thus saving memory;

3. Optimize the word segmentation algorithm and improve the word segmentation rate.
Maven:

<dependency>
  <groupId>io.github.yizhiru</groupId>
  <artifactId>thulac4j</artifactId>
  <version>${thulac4j.version}</version>
</dependency>

thulac4j supports two word segmentation modes:

SegOnly mode, only word segmentation without part-of-speech tagging;

SegPos mode, word segmentation and part-of-speech tagging.

// SegOnly mode
String sentence = "滔滔的流水,向着波士顿湾无声逝去";
SegOnly seg = new SegOnly("models/seg_only.bin");
System.out.println(seg.segment(sentence));
// [滔滔, 的, 流水, ,, 向着, 波士顿湾, 无声, 逝去]

// SegPos mode
SegPos pos = new SegPos("models/seg_pos.bin");
System.out.println(pos.segment(sentence));
//[滔滔/a, 的/u, 流水/n, ,/w, 向着/p, 波士顿湾/ns, 无声/v, 逝去/v]

4. NLPIR
Chinese Academy of Sciences Institute of Computing NLPIR Online Analysis Chinese
Download address: https://github.com/NLPIR-team/NLPIR
5. Ansj word segmenter
This is what I saw on github, and it works pretty well
. word segmentation speed reaches every second About 2 million words (tested under mac air), the accuracy rate can reach more than 96%.
At present realized. Chinese word segmentation. Chinese name recognition.
User-defined dictionary, keyword extraction, automatic summary, keyword tagging and other functions
can be applied to natural In terms of language processing, etc., it is suitable for various projects that require high word segmentation effects.
maven introduces:

<dependency>
            <groupId>org.ansj</groupId>
            <artifactId>ansj_seg</artifactId>
            <version>5.1.1</version>
</dependency>

call demo

String str = "欢迎使用ansj_seg,(ansj中文分词)在这里如果你遇到什么问题都可以联系我.我一定尽我所能.帮助大家.ansj_seg更快,更准,更自由!" ;
 System.out.println(ToAnalysis.parse(str));

 欢迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分词/n,),在/p,这里/r,如果/c,你/r,遇到/v,什么/r,问题/n,都/d,可以/v,联系/v,我/r,./m,我/r,一定/d,尽我所能/l,./m,帮助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,准/a,,,更/d,自由/a,!

6. LTP of Harbin Institute of Technology
LTP formulates the representation of language processing results based on XML, and on this basis provides a complete set of bottom-up rich and efficient Chinese language processing modules (including lexical, syntactic, semantic and other 6 Chinese processing cores) Technology), as well as dynamic link library (Dynamic Link Library, DLL)-based application programming interface, visualization tools, and can be used in the form of network services (Web Service).

……

custom tokenizer

Although Elasticsearch comes with some analyzers out of the box, the real power of Elasticsearch on analyzers is that we can combine character filters, tokenizers, and lexical unit filters in a setting that suits your specific data. Create custom analyzers.
Character filter: A
character filter is used to tidy up a string that has not been tokenized. For example, if our text was in HTML, it would contain something like

or

Such HTML tags, which we don't want to index. We can use the html clean character filter to remove all HTML tags and convert HTML entities like Á to the corresponding Unicode character Á.

An analyzer may have zero or more character filters.
Tokenizer:
An analyzer must have a unique tokenizer. A tokenizer breaks a string into individual terms or lexical units. The standard tokenizer used in the standard analyzer breaks a string into individual terms based on word boundaries and removes most of the punctuation, however there are other tokenizers that behave differently.
Word unit filter:
After word segmentation, the resulting word unit stream will pass through the specified word unit filter in the specified order.

The word unit filter modifies, adds, or removes word units. We've already mentioned lowercase and stop word filters, but there are many alternative word filters in Elasticsearch. Stemming filters stem the words. The ascii_folding filter removes diacritics and turns a word like "très" into "tres". The ngram and edge_ngram word unit filters can generate word units suitable for partial matching or autocompletion.
Create a custom analyzer
We can set the character filter, tokenizer and word unit filter in the corresponding places under analysis:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

This analyzer can do the following:

1. Use the html clean character filter to remove the HTML part.

2. Use a custom mapped character filter to replace & with "and":

"char_filter": {
    "&_to_and": {
        "type":       "mapping",
        "mappings": [ "&=> and "]
    }
}

3. Use the standard tokenizer to segment words.

4. Lowercase entries are processed using the lowercase filter.

5. Use the custom stop word filter to remove the words contained in the custom stop word list:

"filter": {
    "my_stopwords": {
        "type":        "stop",
        "stopwords": [ "the", "a" ]
    }
}

Our analyzer definition combines the defined tokenizer and filter with the custom filter we've set up earlier:

"analyzer": {
    "my_analyzer": {
        "type":           "custom",
        "char_filter":  [ "html_strip", "&_to_and" ],
        "tokenizer":      "standard",
        "filter":       [ "lowercase", "my_stopwords" ]
    }
}

Taken together, a complete create index request should look like this:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

After the index is created, use the analyze API to test the new analyzer:

GET /my_index/_analyze?analyzer=my_analyzer
The quick & brown fox

The abbreviated results below show that our analyzer is working correctly:

{
  "tokens" : [
      { "token" :   "quick",    "position" : 2 },
      { "token" :   "and",      "position" : 3 },
      { "token" :   "brown",    "position" : 4 },
      { "token" :   "fox",      "position" : 5 }
    ]
}

This analyzer is of little use right now unless we tell Elasticsearch where to use it. We can apply this analyzer to a string field like this:

PUT /my_index/_mapping/my_type
{
    "properties": {
        "title": {
            "type":      "string",
            "analyzer":  "my_analyzer"
        }
    }
}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325902577&siteId=291194637