Spring Cloud Learning Route (12) - Distributed Search ElasticSeach Data Aggregation, Auto-Completion, Data Synchronization

1. Data Aggregation

Aggregations: Realize statistics, analysis, and operations on document data.

(1) Common types of aggregation

  • Bucket aggregation: used for document grouping.
    • TermAggregation: group by document field value
    • Date Histogram: group by date ladder, such as one group per week, one group per month
  • Metric aggregation: used to calculate some values, such as maximum value, minimum value, average value, etc.
    • Avg: Average
    • Max: find the maximum value
    • Min: Find the minimum value
    • Stats: Simultaneously seek max, min, avg, sum, etc.
  • Pipeline aggregation: Aggregation based on the results of other aggregations.

Field types involved in aggregation:

  • keyword
  • value
  • date
  • Boolean

(2) DSL realizes aggregation

1. Bucket polymerization

When we count the number of hotel brands in all the data, we can do aggregation based on the name of the hotel brand.

(1) Basic implementation

GET /hotel/_search
{
	"size": 0,	// 设置size为0,结果中不包含文档,只包含聚合结果
	"aggs": {	// 定义聚合
		"brandAgg": {	// 给聚合起个名字
			"terms": {	// 聚合的类型,按照品牌值聚合,所以选择term
				"field": "brand",	//参与聚合的字段
				"size": 20	//	希望获取的聚合结果数量
			}
		}
	}
}

(2) Bucket aggregation result sorting

By default, Bucket aggregation will count the number of documents in the Bucket, record it as _count, and sort in descending order of _count.

So how to modify the sorting?

GET /hotel/_search
{
	"size": 0,	// 设置size为0,结果中不包含文档,只包含聚合结果
	"aggs": {	// 定义聚合
		"brandAgg": {	// 给聚合起个名字
			"terms": {	// 聚合的类型,按照品牌值聚合,所以选择term
				"field": "brand",	//参与聚合的字段,
				"order": {	# 排序
					"_count": "asc"
				},
				"size": 20	//	希望获取的聚合结果数量
			}
		}
	}
}

(3) Limit the scope of aggregation

By default, Bucket aggregation aggregates all documents in the index library, and you can limit the range of documents to be aggregated by adding query conditions.

GET /hotel/_search
{
	"query": {
		"range": {
			"price": {
				"lte": 200	# 只对200元以下的文档聚合
			}
		}
	}
	"size": 0,	// 设置size为0,结果中不包含文档,只包含聚合结果
	"aggs": {	// 定义聚合
		"brandAgg": {	// 给聚合起个名字
			"terms": {	// 聚合的类型,按照品牌值聚合,所以选择term
				"field": "brand",	//参与聚合的字段,
				"order": {	# 排序
					"_count": "asc"
				},
				"size": 20	//	希望获取的聚合结果数量
			}
		}
	}
}

2. Metrics aggregation

Requirements: It is required to obtain the min, max, avg and other values ​​of user ratings for each brand.

GET /hotel/_search
{
	"size": 0,
	"aggs": {	// 定义聚合
		"brandAgg": {	// 给聚合起个名字
			"terms": {	// 聚合的类型,按照品牌值聚合,所以选择term
				"field": "brand",
				"order": {	# 排序
					"scoreAgg.avg": "desc"
				},
				"size": 20
			},
			"aggs": {	#是brands聚合的子聚合,也就是对分组后对每组分别计算
				"score_stats": {	#聚合名称
					“stats”:	{	#聚合类型,这里的stats可以同时计算min、max、avg等
						"field": "score"	#聚合字段,这里是score
					}
				}
			}
		}
	}
}

(3) RestClient implements aggregation

1. Bucket polymerization

//1、创建request对象
SearchRequest request = new SearchRequest("hotel");
//2、DSL组装
request.source().size(0);
request.source().aggregation(
	AggregationBuilders.term("brand_agg").field("brand").size(20)
);

//3、发起请求
SearchResponse response = client.search(request, RequestOptions.DEFAULT);

//4、解析结果
Aggregations aggregations = response.getAggregations();

//5、根据名称获取聚合结果
Terms brandTerms = aggregations.get("brand_agg");

//6、获取桶
List<? extends Terms.Bucket> buckets = brandTerms.getBuckets();

//7、遍历
for (Terms.Bucket bucket : buckets) {
    
    
	//获取品牌信息
	String brandName = bucket.getKeyAsString();
}


2. Auto-completion

(1) Pinyin tokenizer

1. Install the pinyin word breaker offline

insert image description here
2. Restart ES

(2) Custom tokenizer

1. The problem of directly using the Pinyin tokenizer:

  • The pinyin word segmenter does not segment words, only pinyin
  • Every word forms a pinyin
  • no Chinese characters

2. The composition of the word breaker

  • Character filters: Process the text before the tokenizer. For example delete characters, replace characters.
  • tokenizer: Cut the text into terms according to certain rules. For example keyword, ik_smart
  • tokenizer filter: further process the entries output by the tokenizer. For example, case conversion, synonyms processing, pinyin processing.

3. Custom implementation structure

When creating an index library, configure a custom analyzer (word breaker) through settings

PUT /test
{
	"settings": {	#设置配置
		"analysis": {	#解析组
			"analyzer": {	#自定义解析器
				"my_analyzer": {	#分词器名称,按照character 》 tokenizer 》 filter 顺序进行配置
					"tokenizer": "ik_max_word",
					"filter": "pinyin"
				}
			}
		}
	}
}

Extend custom tokenizer

PUT /test
{
	"settings": {	#设置配置
		"analysis": {	#解析组
			"analyzer": {	#自定义解析器
				"my_analyzer": {	#分词器名称,按照character 》 tokenizer 》 filter 顺序进行配置
					"tokenizer": "ik_max_word",
					"filter": "py"
				}
			},
			"filter": {	#自定义tokenizer filter
				"py": {	#过滤器名称
					"type": "pinyin",	# 过滤器类型,设置为pinyin
						"keep_full_pinyin": false,	# 是否开启单字拼音
					"keep_joined_full_pinyin": true,	# 是否开启全拼
					"keep_original": true,	# 是否保留中文
					"limit_first_letter_length": 16,
					"remove_duplicated_term": true,
					"none_chinese_pinyin_tokenize": false
				}
			}
		}
	}
}

Now the problem of directly using the pinyin tokenizer:

When we insert two words with the same pinyin but different meanings, when we search for a word with the same pronunciation, both will be searched, which is obviously a wrong search result.

So we need to use the pinyin tokenizer when creating the index, and use the Chinese tokenizer when searching the index.

"mappings":	{
	"properties": {
		"name": {
			"type": "text",
			"analyzer": "my_analyzer",	# 在索引创建时使用自定义分词器
			"search_analyzer": "ik_smart"	# 在搜索时使用中文分词器
		}
	}
}

(3) Auto-complete query

ES provides Completion Suggester query to realize the automatic completion function.
This query will match terms beginning with the user input and return them.
In order to improve the efficiency of the completion query, there are some constraints on the types of fields in the document:

  • The fields participating in the completion query must be of completion type
  • The content of the field is generally used to complete the array formed by multiple entries
#创建索引库
PUT test
{
	"maapings": {
		"properties": {
			"title": {
				"type": "completion"
			}
		}
	}
}

#	示例数据
POST test/_doc
{
	"title": ["Sony", "WH-1000XM3"]
}
POST test/_doc
{
	"title": ["SK-II", "PITERA"]
}
POST test/_doc
{
	"title": ["Nintendo", "switch"]
}

query example

GET /test/_search
{
	"suggest": {
		"title_suggest": {
			"text": "s",	# 关键字
			"completion": {
				"field": "title",	# 补全查询的字段
				"skip_duplicates": true,	# 跳过重复的
				"size": 10	# 获取前10条结果
			}
		}
	}
}

(4) RestClient realizes automatic completion

//1、准备请求
SearchRequest request = new SearchRequest("hotel");

//2、请求参数
request.source().suggest(new SuggestBuilder().addSuggestion(
	"mySuggestion",	
	SuggestBuilders
		.completionSuggestion("title")
		.prefix("h")
		.skipDuplicates(true)
		.size(10)
));

//3、发送请求
client.search(request, RequestOptions.DEFAULT);

//4、解析结果
Suggest suggest = response.getSuggest();

//5、根据名称获取补全结果
CompletionSuggestion suggestion = suggest.getSuggestion("title_suggest");

//6、获取options并遍历
for (CompletionSuggestion.Entry.Option option : suggestion.getOptions()){
    
    
	//获取option的text
	String text = option.getText().string();
}

3. Data synchronization

The data of ES comes from the database, and when the database data changes, ES must also change. This is the data synchronization between ES and database.

In microservices, the business responsible for data operation and data search may appear in two different microservices. How to achieve data synchronization?

(1) Data synchronization idea

Method 1: Synchronous call

New data "data management business (directly written to the database)" call update index database interface "data search service (update ES)

  • Advantages: simple to implement, rough
  • Disadvantages: data coupling, business coupling, performance degradation.

Method 2: Asynchronous notification (the most recommended method at this stage)

Add new data "Data management business (directly write to the database, and send a message to MQ) "MQ (search service subscription) "Data search service (update ES)

  • Advantages: low coupling, generally difficult to implement
  • Disadvantages: rely on the reliability of mq

Method 3: Monitor binlog

Add new data "data management business (directly write to mysql database, mysql database listens to binlog library) "canal (middleware, notify search service of data changes) "data search service (update ES)

  • Advantages: completely decoupling between services
  • Disadvantages: Enabling binlog increases the burden on the database, and the implementation complexity is high

(2) Realize ES and database data synchronization

We use asynchronous notification for data synchronization

Realize data synchronization

  • Declare switch, queue, RoutingKey
  • Complete the message sending in the addition, deletion and modification business in admin
  • Complete message monitoring and update ES data

Guess you like

Origin blog.csdn.net/Zain_horse/article/details/131925655