solr analyzer

Excerpted from the following website, thank you very much
http://matieli.iteye.com/blog/1011149

IX. SOLR builds an enterprise search platform--field increase weight
In many cases, we may need to increase the weight of a field to Reasonable display of search results.
For example: There is a schma with three fields: chapterId, title, content.
We hope that if a certain keyword matches in the title, it will be displayed first, and if it matches in the content, it will be placed after the search results. Of course, if the two match at the same time, of course there is nothing to say. See how to do it in solr.
title:(test1 test2)^4 content:(test1 test2)
Add weight to the title field and match first.
Regarding the number 4 after ^, after my test, the best value should be n+1 when there are n fields, of course I hope Everyone can better test!

X. SOLR builds an enterprise search platform -- Solr tokenizer, filter,
analyzer For lucene's analyzer, tokenizer, and filter, please see: http://lianj-lee.iteye.com/blog/501247

    for one When the document is indexed, the data in each field will be analyzed (according to a blog above, the analysis is the combination of word segmentation and filtering), and finally the sentence will be divided into individual words, and the blanks in the sentence will be removed. Convert lowercase, plural to singular, remove redundant words, perform synonym substitution, etc.
  Such as: This is a blog! this, is, a will be removed, and finally the blog is left. Of course! This symbol will also be removed.
  This process is carried out in both the indexing and querying process, and usually the processing of both is the same. This is to ensure that the established index and the query are correctly matched.
Analyzer (Analyzer)
  The analyzer consists of two parts: the tokenizer and the filter. The tokenizer function divides the sentence into a single token token, and the filter is to filter the token.
  Solr comes with some tokenizers. If you need to use a custom tokenizer, you need to modify the schema.xml file.
  The schema.xml file allows two ways to modify the way the text is parsed, usually only the content of a field of type solr.TextField allows a custom parser.

    Method 1: Use any subclass of org.apache.lucene.analysis.Analyzer to set it.
<fieldType name="text" class="solr.TextField">
      <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>
    Method 2: Specify a TokenizerFactory followed by a series of TokenFilterFactories ( They will act in the order listed), Factories are used to create tokenizers and tokenizer filters, they are used to prepare the configuration of tokenizers and tokenizer filters, this is done to avoid the overhead of creation via reflection .

<analyzer type="index">
        <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
...

    </analyzer>
    <analyzer type="query">
        <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory " isMaxWordLength="true"/>
    ...
    </analyzer>

  It should be noted that Any Analyzer, TokenizerFactory, or TokenFilterFactory should be specified with the full class name with the package name, please make sure they are located in Solr's classpath. For classes under the org.apache.solr.analysis.* package, only via solr. * can be specified.

   If you need to use your own tokenizers and filters, you will need to write a factory yourself, which must be a subclass of BaseTokenizerFactory (the tokenizer) or BaseTokenFilterFactory (the filter). like below.

public class MyFilterFactory extends BaseTokenFilterFactory {
  public TokenStream create(TokenStream input) {
    return new MyFilter(input);
  }
}

For IK3.1.5 version, solr's word segmentation is fully supported, so you don't have to write it yourself, and for Chinese word segmentation, ik's support for solr is perfect.

What TokenizerFactories does Solr offer?

1. solr.LetterTokenizerFactory
   creates org.apache.lucene.analysis.LetterTokenizer.
   Example of word segmentation:
   "I can't" ==> "I", "can", "t", letter segmentation.

2. solr.WhitespaceTokenizerFactory
   creates org.apache.lucene.analysis.WhitespaceTokenizer, mainly to remove all whitespace characters.

3. solr.LowerCaseTokenizerFactory
   creates an example of org.apache.lucene.analysis.LowerCaseTokenizer
   word segmentation:
  "I can't" ==> "i", "can", "t", mainly uppercase to lowercase.

4. solr.StandardTokenizerFactory
   creates org.apache.lucene.analysis.standard.

ACRONYM: "IBM", APOSTROPHE:"cat's", APOSTROPHE:"can't"
   Description: This tokenizer will automatically add a type to each segment, so that the subsequent type-sensitive filters can be processed. Currently only Only StandardFilter is sensitive to the type of Token.

5. solr.HTMLStripWhitespaceTokenizerFactory
removes HTML tags from the result and passes the result to WhitespaceTokenizer for processing.
Example:
my <a href="www.foo.bar">link</a>
my link
<?xml?><br>hello<!--comment-->
hello
hello<script><-- f(' <--internal--></script>'); --></script>
hello
if a<b then print a;
if a<b then print a;
hello <td height=22 nowrap align="left" >
hello
a<b A Alpha&Omega Ω



Remove the HTML tag from the result and pass the result to the StandardTokenizer for processing.

7. Description of solr.PatternTokenizerFactory
: Tokenize parts according to the regular expression style.
Example: The processing objects are mice; kittens; dogs, they are separated by a semicolon plus one or more spaces.
<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer>
     <tokenizer class="solr.PatternTokenizerFactory" pattern="; *" />
  </analyzer>
</fieldType>


What TokenFilterFactories does Solr have?

1. solr.StandardFilterFactory is
created: org.apache.lucene.analysis.standard.StandardFilter.
Remove the dot in the acronym and the 's after the Token. Only works on classed Tokens, they are generated by StandardTokenizer.
Example: StandardTokenizer+ StandardFilter
"IBM cat's can't" ==> "IBM", "cat",



































<fieldtype name="lengthfilt" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="2" max="5" />
  </analyzer>
</fieldtype>

7. Created by solr.PorterStemFilterFactory
: org.apache.lucene.analysis.PorterStemFilter
uses the Porter Stemming Algorithm algorithm to remove the suffix of the word, such as changing the plural form to the singular form, the third person verb to the first person , the present participle becomes a verb in the simple present tense.
8. solr.EnglishPorterFilterFactory is
created: solr.EnglishPorterFilter
deals with the processing of sentence stems, where "protected" specifies the file of words that are not allowed to be modified.
9. solr.SnowballPorterFilterFactory
on stemming in different languages
​​10.solr.
WordDelimiterFilterFactory deals with delimiters.
11.solr.SynonymFilterFactory
on the handling of synonyms.
12.solr.RemoveDuplicatesTokenFilterFactory
avoids repeated processing.

11. SOLR builds an enterprise search platform -- Solr highlights use
1. SolrQuery class, which has the method setHighlight(true), when set to true, it means that highlighting is turned on.
2. The SolrQuery class has methods:
    // The following two fields are highlighted, namely name, description,
    query.addHighlightField("name");
    query.addHighlightField("description");
    // The following two methods Mainly add html code before and after the highlighted keyword
    query.setHighlightSimplePre("<font color=\"red\">");
    query.setHighlightSimplePost("</font>");
3. The following is to get the highlight Content:
Map<String,Map<String,List<String>>> map = response.getHighlighting();
The Key of Map is the ID of the document, that is, you are in the schema.
Therefore, when doing logical processing, you only need to take out the things in sequence according to this level. If the taken out thing is empty, use the value of getFieldValue(filedName) of SolrDocument in QueryResponse.
By the way, please pay attention to turn on the highlighted component in solrConfig.xml, you can look at the official wiki or see the comments in solrconfig.xml!

12. SOLR builds an enterprise search platform -- Solr's search operator
1. ":" specifies a field to search for a specified value, such as returning all values ​​*:*
2. "?" represents a single wildcard of any character
3. "*" Indicates the wildcard of multiple arbitrary characters (the * or ? symbol cannot be used at the beginning of the searched item)
4. "~" means fuzzy search, such as searching for items whose spelling is similar to "roam", write this: roam~ will find the form such as foam Words with roams; roam~0.8, the search returns records with a similarity of more than 0.8.
5. Proximity retrieval, such as retrieval of "apache" and "jakarta" separated by 10 words, "jakarta apache"~10
6. "^" controls the relevance retrieval, such as retrieval of jakarta apache, and hopes to make "jakarta" relevant If the degree is better, then add the "^" symbol and the increment value after it, that is, jakarta^4 apache
7. Boolean operator AND, ||
8. Boolean operator OR, &&
9. Boolean operator NOT, !, - (The exclusion operator cannot be used alone with the item to form a query)
10. "+" exists operator, the item after the symbol "+" must exist in the corresponding field of the document
11. ( ) is used to form a subquery
12. [] Include range retrieval, such as retrieving records in a certain time period, including head and tail, date:[200707 TO 200710]
13. {} Excluding range retrieval, such as retrieving records in a certain time period, excluding head and tail
date: {200707 TO 200710}
14. \ Escape operator, special characters include + - && || ! ( ) { } [ ] ^ ” ~ * ? : \
complement: Tokenizer 1. <
fieldType
name="text" class=" solr.TextField">
2. <analyzer class="net.paoding.analysis.analyzer.PaodingAnalyzer"></analyzer>
3. </fieldType>
4. Note: can not have positionIncrementGap property.paoding-dic-home.properties
file is configured as follows:

#values ​​are "system-env" or "this";
#if value is "this" , using the paoding.dic.home as dicHome if configed!
#paoding.dic.home. config-fisrt=system-env

#dictionary home (directory)
#"classpath:xxx" means dictionary home is in classpath.
#eg "classpath:dic" means dictionaries are in "classes/dic" directory or any other classpath directory
#paoding.dic.home=dic

#seconds for dic modification detection
#paoding.dic.detector.interval=60
paoding.dic. home=C://solr-tomcat//solr//dic
Set the environment variable paoding.dic.home
and then configure the type of FILED in schema.xml to be the Text defined above

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326833485&siteId=291194637