Chapter 7, Text Analyzer (Analyzer) & Tokenizer (Tokenizer)

1. Introduction

The analyzer is a very important component in solr. It plays a very important role in indexing and searching for the desired results. This is an example of a fieldType that has been defined by solr and contains the analyzer.

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
	<analyzer type="index">
		<tokenizer class="solr.StandardTokenizerFactory"/>
		<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.EnglishPossessiveFilterFactory"/>
		<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
		<filter class="solr.PorterStemFilterFactory"/>
	</analyzer>
	<analyzer type="query">
		<tokenizer class="solr.StandardTokenizerFactory"/>
		<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
		<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.EnglishPossessiveFilterFactory"/>
		<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
		<filter class="solr.PorterStemFilterFactory"/>
	</analyzer>
</fieldType>

  

1. Analyzer (Analyzer)

Tell Solr how to process document content when indexing and searching. The type is usually index or query

 

2. Tokenizer:

How to split a complete piece of content into separate words or phrases? Solr defines a lot of text types with analyzers, but Chinese needs to be defined by themselves. There are many excellent Chinese word breakers (mmseg4j, IK, etc.)

 

3. Filter (Filter)

After being processed by the Tokenizer, the components for subsequent processing, such as full conversion to lowercase, etc.

 

4. Why?

Why use word segmentation to query a piece of text "James: NBA superstar, first round No. 1 pick, nicknamed Little Emperor." (The corresponding field does not define any tokenizer), if we query "No. 1 pick", we can't find this Of course you can add wildcards, but what if we also want to highlight the word of the query?

At this time, if we add the index tokenizer to the type of the corresponding field, the content will be appropriately split when the index is established, and the query string will also be split by word segmentation when querying, so that even if we query "Little Emperor Champion Scholar" show” can also get the results we want

 

2. Configuration of analyzer (configuration of Chinese word segmentation mmseg4j )

Here my solr version is 6.5.0, and the mmseg4j version is 2.4.0

1. Introduction to common Chinese word segmentation

Paoding: Supports an unlimited number of user-defined thesaurus, plain text format, one word per line, uses a background thread to detect the update of the thesaurus, automatically compiles the updated thesaurus to the binary version, and loads it

mmseg4j: comes with sogou thesaurus, supports user-defined thesaurus named wordsxxx.dic, utf8 text format, one word per line. Automatic detection is not supported. -Dmmseg.dic.path

IK: Support api-level user lexicon loading, and configuration-level lexicon file specification, UTF-8 encoding without BOM, /r/n split. Automatic detection is not supported.

 

2.mmseg4j tokenizer configuration

Import the downloaded mmseg4j package into solr's WEB-INF/lib, and add the following to the schema configuration file: 

<fieldType name="text_mmseg4j_Complex" class="solr.TextField">  
	<analyzer>
		<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="dic"/>
	</analyzer>
</fieldType>

 

将之前定义的core_test中short_desc字段定义为text_mmseg4j_Complex类型,重建索引后,我们查询"状元秀小皇帝",只返回了一条记录

{
	"responseHeader" : {
		"status" : 0,
		"QTime" : 2,
		"params" : {
			"q" : "short_desc:状元秀小皇帝",
			"indent" : "on",
			"wt" : "json",
			"_" : "1493135606415"
		}
	},
	"response" : {
		"numFound" : 1,
		"start" : 0,
		"docs" : [{
					"name" : "LeBron James",
					"id" : "12b3e4b030314f55a89006769b8b743a",
					"short_desc" : "NBA巨星,首轮状元秀,克里夫兰骑士队当家球星,3届总冠军得主,场上司职小前锋,绰号“小皇帝”",
					"age" : 32,
					"tags" : ["小皇帝", "吾皇"],
					"_version_" : 1565666158942617600
				}]
	}
}

 

用提供的分析器页面也可以看到分词的结果比原来用标准分词器(完全拆成了一个个的汉字,而忽了语义)更符合自然意义

//mmseg4j基于正向最大匹配,分词结果
MMST  状元  秀  小  皇帝

 

3.mmseg4j的三种模式区别

mmseg4j涉及到了Simple、Complex、max-word三种分词方法,前两个都是基于正向最大匹配,第三种是在Complex算法基础上实现的最多分词方法。

1)正向最大匹配:

就是将一段字符串进行分隔,其中分隔 的长度有限制,然后将分隔的子字符串与字典中的词进行匹配,如果匹配成功则进行下一轮匹配,直到所有字符串处理完毕,否则将子字符串从末尾去除一个字,再进行匹配,如此反复。

2)三种模式分词实例

原字符串:阿根廷足球巨星

Simple、Complex分词结果:阿根廷  足球巨星

Max-Word分词结果:阿  根  廷 足球  巨星

 

三.自定义词库

可能注意到前面tokenizer配置了dicPath属性(dic),这里是指定自定义词库的目录,可以是绝对路径也可以是现对路径,相对路径为"{solr.home}/core_name/词库目录",词库文件words开头并以".dic"结尾,如words-myown.dic。

注意,如果你的词库文件也是words.dic会覆盖对应solr-core.jar包中预定义的词库文件。

 

1.动态加载词库

通过一个简单的配置就可以让mmseg4j动态加载词库

首先在solrconfig.xml中配置一个requestHandler,其中dicPath和fieldType配置的dicPath属性值相同

<requestHandler name="/mmseg4j" class="com.chenlb.mmseg4j.solr.MMseg4jHandler" >  
    <lst name="defaults">
        <str name="dicPath">dic</str>
    </lst>
</requestHandler>

然后我们就可以通过URL(http://localhost:10001/solr/core_test/mmseg4j?reload=true)来通知其重新加载词库

 

返回目录

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326567177&siteId=291194637