Solr text analysis analysis [text analysis, tokenizer Detailed, custom text and word analysis of field device]

I. Overview

  Solr text analysis to eliminate the language differences between the index terms in a user's search terms in, allowing users to find similar content when searching for buying a new house, for example: purchasing a new home this document. If the match properly, text analytics can allow users to use natural language search, without having to consider all possible forms of search terms term. After all, who do not want to see similar search and construct such a query expression: buying house OR purchase home OR buying a home OR purchasing a house ....

  The user can use natural language to search for important information they need, which is to provide a good search experience foundation. Given the widespread use of search engines such as Google or Baidu, the search engine users tend to expect very smart, and intelligent search engine is excellent from the beginning of the text analysis. Text analysis not only to eliminate the difference between the surface of a term also used to solve more complex problems, for example: a particular language parsing, and speech part of speech tagging reduction and the like.

  Solr contains a text analysis extensible framework that can remove common stop word vocabulary [is], and perform other more complex text analysis tasks. Solr text analysis framework provides a powerful and flexible scalability, but for beginners, this framework is too complex and prohibitive. This is what we say things that has two sides, Solr is possible to solve very complex text analysis of the problem, but may also make the original very simple analysis tasks become very troublesome. Solr preset number of fields in schema.xml file types, for beginners to ensure that the contact out of the box when Solr can more easily begin using text analysis.

II. Microblogging case analysis

  When we want to design and implement a public social media site Weibo [Similarly, Twitter] microblogging content search solution. Because they want to focus the content of microblogging, which will focus on the analysis focused on the text, so to understand the text fields microblogging documents. What is going to text analysis.

  elasticsearch and Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

  As mentioned above, the main objective is to allow users to analyze text using natural language search, regardless of all possible forms of expression of lexical items. When users want to search for solr index, which is a natural language query, users expect to find and index data contain solr, which will exactly match the above indexing will not match on. The following is a sub-word word is the result of:

  

  Therefore, the task before us is to use solr text analysis framework microblogging text is converted to an easy to find format.

III. Base text analysis

  schema.xml the <types> section using <fieldType> element defines all the possible fields in a document, each <fieldType> defines the format field, and how to analyze the field in the index in the query. The above is the word using word comes text_general solr types of fields, this is a relatively simple field types, as follows:

    <!-- A general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer,removes stop words 
    from case-insensitive "stopwords.txt" (empty by default), and down cases. At query time only, it also applies synonyms.
--> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.FlattenGraphFilterFactory"/> --> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

  1. analyzer

    In <fieldType> element defining at least one <analyser>, to determine how to parse the text. Common practice is to define two separate <analyser> element, when analyzing a text for an index, a text input for the user to search analysis [text_general as above]. Why use two instead for a share, it is clear that the document text analysis for index and query processing, which often requires more analysis. For example, textual analysis of the query usually add synonyms, text analysis and indexing will not do, since a synonym index will increase the volume. Therefore, this treatment is generally placed in query processing.

    Although the definition of two separate analyzers, but Analyzer query terms in text analysis method must be compatible with the index [e.g., must use the same word, a word is different from the same text points, words separated may be different, It will lead not query].

  2. Word Breaker

    In Solr, each <analyser> text analysis process is divided into two phases: the words and word segmentation filter. Strictly speaking, there is a third stage, i.e. a stage before the pre-word segmentation points, the character may be used at this stage filter. Cut phased in words, the text will be split into various analytical form lexical units flow. WhitespaceTokenizer is the most basic word is, using only the space split text. StandardTokenizer is more common, it uses the spaces and punctuation marks split the lexical items, but also for handling URLs and e-mail acronyms. Defined word filter factory class needs to be specified Java. To use a common StandardTokenizer, need to specify the word's class solr.StandardTokenizerFactory, refer to the above text_general field.

    In Solr, since most of the word need to provide parameters constructor, must be specified as factory class, rather than the underlying word is implementation class. By using factory methods, Solr provides a standard practice word is defined in XML. In the background, each factory class know how to translate the XML configuration property to construct a particular word implements an instance of the class. All word will generate word symbol stream, the filter may be used for processing, a transformation performed LUs.

  3. The word filter

    Word filter performs one of three operations of lexical units:

      1. The word yuan conversion

        Changing the form of the word element, for example, or all lowercase letters stemming.

      2. The word yuan injection

        Add a word to the lexical units in the symbol stream, for example, practices a synonym of the filter.

      3. remove the word yuan

        Delete unnecessary words yuan, for example, stop word filter approach.

      Filters can be used simultaneously, a series of primitives of the word conversion processing. Order filter is important, the top surface of the filter will first work.

  4.StandardTokenizer

    <tokenizer class="solr.StandardTokenizerFactory"/>

    Textual analysis of the first step is to determine how to use the word will resolve to the text word stream of symbols. StandardTokenizer start from the use of this word is a lot of Solr and Lucene projects preferred solution, this word using spaces and punctuation marks to split the text. Let us look at the role of this word is through case:

    

    NOTE: ST representative of StandardTokenizer standard tokenizer, SF StopFilterFactory stop words on behalf of the filter, LCF Representative LowerCaseFilterFactory conversion lowercase filter.

  5.StopFilterFactory

    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />

    It will stop using the word filter that removes stop words LUs stream when Solr text analysis, these stop words for users to locate relevant documents of little value. Remove stop words in the index will be effective in reducing the size of the index, providing search performance. This reduces the number of Solr document data to be processed, but also reduces the number of terms when the query includes stop words correlation is calculated.

    The default stop word filter specifies a list of stop words in English [words = "stopwords.txt"]. Solr out of the box, which provides the basis of a list of stop words, on this basis we can stop words according to custom requirements. In general, the removal of stop words with linguistic specificity. Different languages, different general stop words, if the analysis of German, you need to include a die, list of stop words such lexical items ein. Solr provides multilingual stop word list to customize files that are located under each core Solr conf / lang / directory.

  

  Note: In the new version of solr not used the word stop configuration file in the lang / directory, but stopwords.txt direct use in the current directory, and where the default is not configured any stop words!

  6.LowerCaseFilterFactory

    LCF letter words all the elements to lowercase, this way, the word elements would not have been the case for indexing and searching interfere.

    <filter class="solr.LowerCaseFilterFactory"/>

    Similar to the case of stop words, the need to use all lowercase conversion filter for a term sometimes is not a good judge. For example, in the middle of a word is usually capitalized words represent a proper noun. If the accuracy of user search, the search results will be in accordance with the proper noun form has improved. But for all lowercase search habits of users, so it is not very friendly. Therefore, whether to use stop words depends on fancy accuracy or suitability.

IV. Custom text analysis field

  Because Solr predefined field types can not meet all of our needs, so by the integrated use of other Solr built-in text analysis tools to solve these problems. Therefore, we define a new field type in the schema.xml. Text_microblog field increases in <types> element managed-schema.xml code is as follows:

 <fieldType name="text_microblog" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
          <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])\1+" replacement="$1$1"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" 
            splitOnNumerics="0"
            stemEnglishPossessive="1"
            preserveOriginal="0"
            catenateWords="1"
            generateNumberParts="1"
            catenateNumbers="0"
            catenateAll="0"
            types="wdfftypes.txt"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.KStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([a-zA-Z])\1+" replacement="$1$1"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" 
            splitOnNumerics="0"
            stemEnglishPossessive="1"
            preserveOriginal="0"
            catenateWords="1"
            generateNumberParts="1"
            catenateNumbers="0"
            catenateAll="0"
            types="wdfftypes.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.KStemFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      </analyzer>
    </fieldType>

  备注:

  

  1.PatternReplaceCharFilterFactory

    在Solr中,传入的字符流在分词处理之前,字符过滤器对其进行预处理。与分词过滤器类似,CharFilters作为过滤链的一环,可以对文本字符进行添加、修改和移除。Solr三个常用的CharFilters如下:

    1.1 solr.MappingCharFilterFactory:使用外部配置文件进行字符替换。

    1.2 solr.PatternReplaceCharFilterFactory:使用正则表达式进行字符替换。

    1.3 solr.HTMLStripCharFilterFactory:从文本中去除HTML标记。

  2.WordDelimiterFilterFactory

    该过滤器在更高层面上使用各种解析规则,将分词单元变为子词【subwords】。在上面定义的自定义文本分析字段中可知,该过滤器需配置wdfftypes.txt配置文件,这个文件需要自己创建,里面的内容需要根据实际情况进行填写,此处内容为:

    

    这些设置将-映射到ALPHA类,这意味着WordDelimiterFilter实例就不会把它们作为分隔符。效果如下:

    

    WordDelimiterFilterFactory的配置选项及解释如下:

    

  3.ASCIIFoldingFilterFactory移除变音符号

    <filter class="solr.ASCIIFoldingFilterFactory"/>,最好在小写过滤器之后使用此过滤器,这样只需处理小写字符。ASCIIFoldingFilter仅适用于拉丁字符,其它语种请使用solr.analusis.ICUFoldingFilterFactory,这个工厂方法在Solr3.1之后的版本可用。

  4.KStemFilterFactory提取词干

    词干提取根据特定语言规则,将词转换为基础词形。Solr提供了许多词干提取过滤器,每种各有优缺点。KStemFilterFactory与PorterStemmer等其它流行的词干提取器相比,这个词干提取器的转换并不是那么激进。这里使用是为了对indexing和querying等这样的词项移除ing。

  5.SynonymFilterFactory同义词

    在大多数情况下,同义词的添加只用于查询阶段的分析。这样做有助于减少索引的大小,同义词列表的变更维护也更容易一点。另外,还要考虑其在过滤器链中所处的位置,通常最有用的做法是,将这个过滤器作为查询分析器的最后一个,以便使同义词列表可以认为分词单元上的所有其他转换都已经做完。

五.高级文本分析【自定义分词器】

  1.高级字段属性

    

    

  2.使用Solr插件扩展文本分析

    2.1自定义TokenFilter类

      

 

    2.2自定义TokenFilterFactory类

      

    

    备注:要想使用自定义分词器,必须把对应的程序打包上传到solr中,并在对应core的solrconfig.xml中指定配置!

Guess you like

Origin www.cnblogs.com/yszd/p/12129952.html