Distributed search engine 03-elasticsearch-data aggregation (statistical query, DSL&javaRestAPI implementation), automatic completion, (mysql and es) data synchronization (RabbitMQ implementation), cluster (build, problem)

Distributed search engine 03

0. Learning Objectives

1. Data aggregation

Aggregations ( aggregations ) allow us to achieve statistics, analysis, and operations on data extremely conveniently. For example:

  • What brand of mobile phone is the most popular?
  • The average price, the highest price, the lowest price of these phones?
  • How are these phones selling monthly?

It is much more convenient to implement these statistical functions than the sql of the database, and the query speed is very fast, which can realize near real-time search effect.

1.1. Types of Aggregation

There are three common types of aggregation:

  • Bucket aggregation: used to group documents

  • TermAggregation: group by document field value, such as group by brand value, group by country

  • Date Histogram: Group by date ladder, for example, a week as a group, or a month as a group

  • Metric aggregation: used to calculate some values, such as: maximum value, minimum value, average value, etc.

    • Avg: Average
    • Max: find the maximum value
    • Min: Find the minimum value
    • Stats: Simultaneously seek max, min, avg, sum, etc.
  • Pipeline (pipeline) aggregation: aggregation based on the results of other aggregations

Note: The fields participating in the aggregation must be keyword, date, value, and Boolean

1.2.DSL realizes aggregation

Now, we want to count the hotel brands in all the data. In fact, we group the data according to the brand. At this point, aggregation can be done based on the name of the hotel brand, that is, Bucket aggregation.

1.2.1. Bucket aggregation syntax

The syntax is as follows:

GET /hotel/_search
{
    
    
  "size": 0,  // 设置size为0,结果中不包含文档,只包含聚合结果
  "aggs": {
    
     // 定义聚合
    "brandAgg": {
    
     //给聚合起个名字 名字随便写
      "terms": {
    
     // 聚合的类型,按照品牌值聚合,所以选择term
        "field": "brand", // 参与聚合的字段
        "size": 10 // 希望获取的聚合结果数量
      }
    }
  }
}

The size has also been used for page-by-page queries before, which refers to how many documents per page.
Here, setting it to 0 means not checking the documents, but only checking the aggregated results.
"terms": {"size": 10}The size here refers to: statistics of 200 brands, but only the first 10 items are displayed (default is 10)

The result is shown in the figure:

insert image description here

1.2.2. Aggregation result sorting

By default, Bucket aggregation will count the number of documents in the Bucket, record it as _count, and sort in descending order of _count.

We can specify the order attribute to customize the sorting method of the aggregation:

GET /hotel/_search
{
    
    
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
    
      "terms": {
    
    
        "field": "brand",
        "order": {
    
    
          "_count": "asc" // 按照_count升序排列
        },
        "size": 20
      }
    }
  }
}

1.2.3. Limit the scope of aggregation

By default, Bucket aggregation aggregates all documents in the index library , but in real scenarios, users will enter search conditions, so the aggregation must be the aggregation of search results. Then the aggregation has to be qualified. (There are often hundreds of millions of documents, and direct aggregation is not realistic)

We can limit the range of documents to be aggregated by adding query conditions:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "range": {
    
    
      "price": {
    
    
        "lte": 200 // 只对200元以下的文档聚合
      }
    }
  }, 
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
    
      "terms": {
    
    
        "field": "brand",
        "size": 20
      }
    }
  }
}

This time, the aggregated brands are significantly less:

insert image description here

  • summary
    insert image description here

1.2.4. Metric aggregation syntax

In the last class, we grouped hotels by brand to form buckets. Now we need to perform calculations on the hotels in the bucket to obtain the min, max, and avg values ​​of the user ratings for each brand.

This requires the use of Metric aggregation, such as stat aggregation: you can get results such as min, max, and avg.

The syntax is as follows:

GET /hotel/_search
{
    
    
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
     
      "terms": {
    
     
        "field": "brand", 
        "size": 20
      },
      "aggs": {
    
     // 是brands聚合的子聚合,也就是分组后对每组分别计算
        "score_stats": {
    
     // 聚合名称
          "stats": {
    
     // 聚合类型,这里stats可以计算min、max、avg等
            "field": "score" // 聚合字段,这里是score
          }
        }
      }
    }
  }
}

This time the score_stats aggregation is a sub-aggregation nested inside the brandAgg aggregation. Because we need to calculate separately in each bucket.

In addition, we can also sort the aggregation results, for example, according to the average hotel score of each bucket:

insert image description here

1.2.5. Summary

aggs stands for aggregation, which is at the same level as query. What is the function of query at this time?

  • Scope the aggregated documents

The three elements necessary for aggregation:

  • aggregate name
  • aggregation type
  • aggregate field

Aggregate configurable properties are:

  • size: specify the number of aggregation results
  • order: specify the sorting method of aggregation results
  • field: specify the aggregation field

1.3. RestAPI implements aggregation

1.3.1. API Syntax

Aggregation conditions are at the same level as query conditions, so request.source() needs to be used to specify aggregation conditions.

Syntax for aggregate conditions:

insert image description here

The aggregation result is also different from the query result, and the API is also special. However, JSON is also parsed layer by layer:

insert image description here

@Test
public void testAggregation() throws IOException {
    
    //Aggregation:“聚合”
    //1.准备request
    SearchRequest request = new SearchRequest("hotel");
    //2.准备DSL
    request.source().size(0);//不要返回任何文档(只要聚合结果)
    request.source().aggregation(AggregationBuilders
            .terms("brandAgg")//名称很重要 获取时也需要  获取时返回类型也是terms
            .field("brand")
            .size(20));
    //3.发出请求
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    //4.解析结果
    handleAggregationResponse(response);
}

private void handleAggregationResponse(SearchResponse response) {
    
    
    //System.out.println(response);
    Aggregations aggregations = response.getAggregations();
    Terms brandTerm = aggregations.get("brandAgg");//注意返回值得是term 聚合查询时写的是term
    List<? extends Terms.Bucket> buckets = brandTerm.getBuckets();
    for (Terms.Bucket bucket : buckets) {
    
    
        String key = bucket.getKeyAsString();
        long docCount = bucket.getDocCount();
        System.out.println(key+" "+docCount);
    }
}

insert image description here

1.3.2. Business requirements

Requirement: The brand, city and other information of the search page should not be hard-coded on the page, but obtained by aggregated hotel data in the index library:

insert image description here

analyze:

At present, the city list, star list, and brand list on the page are all hard-coded, and will not change as the search results change. But when the user's search conditions change, the search results will change accordingly.

For example, if a user searches for "Oriental Pearl", the searched hotel must be near the Shanghai Oriental Pearl Tower. Therefore, the city can only be Shanghai. At this time, Beijing, Shenzhen, and Hangzhou should not be displayed in the city list.

That is to say, which cities are included in the search results, which cities should be listed on the page; which brands are included in the search results, which brands should be listed on the page.

How do I know which brands are included in my search results? How do I know which cities are included in my search results?

Use the aggregation function and Bucket aggregation to group the documents in the search results based on brands and cities, and you can know which brands and cities are included.

Because it is an aggregation of search results, the aggregation is a limited-range aggregation , that is to say, the limiting conditions of the aggregation are consistent with the conditions of the search document.

Looking at the browser, we can find that the front end has actually sent such a request:
insert image description here

The request parameters are exactly the same as the parameters of the previous search document (search method) .

The return value type is the final result to be displayed on the page:

insert image description here

The result is a Map structure:

  • key is a string, city, star, brand, price
  • value is a collection, such as the names of multiple cities

1.3.3. Business Realization

Add a method to cn.whu.hotel.webthe package HotelController, following the requirements:

  • Request method:POST
  • Request path:/hotel/filters
  • Request parameters: RequestParams, consistent with the parameters of the search document
  • Return value type:Map<String, List<String>>

code:

@PostMapping("filters")
public Map<String,List<String>> getFilters(@RequestBody RequestParam params){
    
    
    log.info(params.toString());
    try {
    
    
        return hotelService.getFilters(params);
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

The getFilters method in IHotelService is called here, which has not been implemented yet.

cn.whu.hotel.service.IHotelServiceDefine the new method in :

Map<String, List<String>> getFilters(RequestParam params) throws IOException;

cn.whu.hotel.service.impl.HotelServiceImplement the method in :

@Override
public Map<String, List<String>> getFilters(RequestParam params) throws IOException {
    
    
    // 1. 准备request
    SearchRequest request = new SearchRequest("hotel");
    // 2. 准备DSL
    // 过滤条件也要加上
    buildBasicQuery(params,request);
    // 三种聚合
    request.source().aggregation(AggregationBuilders.terms("brandAgg").field("brand").size(20));
    request.source().aggregation(AggregationBuilders.terms("cityAgg").field("city").size(20));
    request.source().aggregation(AggregationBuilders.terms("starAgg").field("starName").size(20));
    // 3. 发送请求
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. 封装结果 三种分别封装一次
    HashMap<String, List<String>> map = new HashMap<>();
    map.put("brand", getAggByName(response,"brandAgg"));
    map.put("city", getAggByName(response,"cityAgg"));
    map.put("starName", getAggByName(response,"starAgg"));

    return map;
}

private List<String> getAggByName(SearchResponse response, String aggName) {
    
    
    ArrayList<String> list = new ArrayList<>();
    Terms terms = response.getAggregations().get(aggName);
    for (Terms.Bucket bucket : terms.getBuckets()) {
    
    
        list.add(bucket.getKeyAsString());
    }
    return list;
}

insert image description here

2. Auto-completion

When the user enters a character in the search box, we should prompt the search item related to the character, as shown in the figure:

insert image description here

This function of prompting complete entries based on the letters entered by the user is automatic completion.

Because it needs to be inferred based on the pinyin letters, the pinyin word segmentation function is used.

2.1. Pinyin tokenizer

To achieve completion based on letters, it is necessary to segment the document according to pinyin. There happens to be a pinyin word segmentation plugin for elasticsearch on GitHub. Address: https://github.com/medcl/elasticsearch-analysis-pinyin

https://github.com/medcl/elasticsearch-analysis-pinyin/releases/tag/v7.12.1
insert image description here

Currently the latest version is 8.8.1, you can’t follow the new version, use the old version like the tutorial
because: our es version is 7.12.1, we have to correspond one by one

The installation package of the pinyin word breaker is also provided in the pre-class materials:

Link: https://pan.baidu.com/s/1akBs8WuTip9Rvwnj6dLvvw
Extraction code: hzan

insert image description here

The installation method is the same as the IK tokenizer, in three steps:

① Decompression

② Upload to the virtual machine, the plugin directory of elasticsearch /var/lib/docker/volumes/es-plugins/_data(this directory has been mounted before)

③ Restart elasticsearch

④ Test

For detailed installation steps, please refer to the installation process of the IK word breaker (the same installation: search engine elasticsearch-3. Install the IK word breaker )

insert image description here
insert image description here

The test usage is as follows:

Go back to the browser to get dev tools

POST /_analyze
{
    
    
  "text": ["如家酒店还不错"],
  "analyzer": "pinyin"
}

Note that there is no need to write pinyin in the text, just write Chinese, and the pinyin will be automatically split for you

result:

insert image description here

First example for beginners, quick recall
insert image description here

2.2. Custom tokenizer

The default pinyin word breaker divides each Chinese character into pinyin, but we want each entry to form a set of pinyin, so we need to customize the pinyin word breaker to form a custom word breaker.

In addition, when the pinyin word breaker is used, all the Chinese characters are gone, and the Chinese characters are also kept.

The composition of the analyzer in elasticsearch consists of three parts :

  • Character filters: Process the text before the tokenizer. e.g. delete characters, replace characters
  • tokenizer: Cut the text into terms according to certain rules. For example, keyword is not participle; there is also ik_smart
  • tokenizer filter: further process the entries output by the tokenizer. For example, case conversion, synonyms processing, pinyin processing, etc.

When document word segmentation, the document will be processed by these three parts in turn:

insert image description here

The syntax for declaring a custom tokenizer is as follows:

Create the specified index library, so this custom tokenizer is only valid for the current index library

PUT /test
{
    
    
  "settings": {
    
    
    "analysis": {
    
    
      "analyzer": {
    
     // 自定义分词器
        "my_analyzer": {
    
      // 分词器名称
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
    
     // 自定义tokenizer filter
        "py": {
    
     // 过滤器名称
          "type": "pinyin", // 过滤器类型,这里是pinyin
		  "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    
     // 定义索引库映射(表结构)
    "properties": {
    
    
      "name": {
    
    
        "type": "text",
        "analyzer": "my_analyzer", //用到自定义分词器
        "search_analyzer": "ik_smart" // 搜索时用 (能解决一些问题,比如搜索"狮子”时把"虱子"同音词搜素出来了) 倒排索引不适合搜索拼音,因为shizi确实两个中文对应1个一摸一样的词条
      }
    }
  }
}
# 无注释版本
PUT /test
{
    
    
  "settings": {
    
    
    "analysis": {
    
    
      "analyzer": {
    
     
        "my_analyzer": {
    
     
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
    
    
        "py": {
    
     
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    
    
    "properties": {
    
    
      "name":{
    
    
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_smart"
      }
    }
  }
  
}

Test: Chinese, Quanpin, Pinyin of the first letter are available

insert image description here

  • test 2
POST /test/_doc/1
{
    
    
  "id": 1,
  "name": "狮子"
}

POST /test/_doc/2
{
    
    
  "id": 2,
  "name": "柿子"
}

GET /test/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name": "shizi"
    }
  }
}

insert image description here
insert image description here

insert image description here
insert image description here

analyzer: use my_analyzer when creating an index
search_analyzer: use ik_smart when searching


https://github.com/medcl/elasticsearch-analysis-pinyin
insert image description here

Summarize:

How to use Pinyin tokenizer?

  • ①Download the pinyin tokenizer

  • ② Unzip and put it in the plugin directory of elasticsearch

  • ③Restart

How to customize the tokenizer?

  • ① When creating an index library, configure it in settings, which can contain three parts

  • ②character filter

  • ③tokenizer

  • ④filter

Precautions for pinyin word breaker?

  • In order to avoid searching for homonyms, do not use a pinyin word breaker when searching (use a pinyin word breaker when creating, and then use a common ik_smart word breaker when you search for words)

2.3. Autocomplete query

Elasticsearch provides Completion Suggester query to achieve automatic completion. This query will match terms beginning with the user input and return them. In order to improve the efficiency of the completion query, there are some constraints on the types of fields in the document:

  • The fields participating in the completion query must be of completion type. (A new type of field, dedicated to auto-completion queries)

  • The content of the field is generally an array formed by multiple entries for completion.

For example, an index library like this:

// 创建索引库
PUT test2
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "title":{
    
    
        "type": "completion"
      }
    }
  }
}

Then insert the following data:

// 示例数据
POST test2/_doc
{
    
    
  "title": ["Sony", "WH-1000XM3"]
}
POST test2/_doc
{
    
    
  "title": ["SK-II", "PITERA"]
}
POST test2/_doc
{
    
    
  "title": ["Nintendo", "switch"]
}

The query DSL statement is as follows:

// 自动补全查询
GET /test/_search
{
    
    
  "suggest": {
    
    
    "title_suggest": {
    
    
      "text": "s", // 关键字
      "completion": {
    
    
        "field": "title", // 补全查询的字段
        "skip_duplicates": true, // 跳过重复的
        "size": 10 // 获取前10条结果
      }
    }
  }
}

insert image description here

2.4. Realize automatic completion of hotel search box

Now, our hotel index library has not set up a pinyin word breaker, and we need to modify the configuration in the index library. But we know that the index library cannot be modified, it can only be deleted and then recreated.

In addition, we need to add a field for auto-completion, and put the brand, suggestion, city, etc. into it as a prompt for auto-completion.

So, to summarize, the things we need to do include:

  1. Modify the structure of the hotel index library and set a custom pinyin word breaker

  2. Modify the name and all fields of the index library and use a custom tokenizer

  3. The index library adds a new field suggestion, the type is completion type, using a custom tokenizer

  4. Add a suggestion field to the HotelDoc class, which contains brand and business

  5. Re-import data to the hotel library

2.4.1. Modify the hotel mapping structure

View the existing hotel index library structure:GET /hotel/_mapping

code show as below:

# 酒店数据索引库
DELETE hotel
PUT /hotel
{
    
    
  "settings": {
    
    
    "analysis": {
    
    
      "analyzer": {
    
    
        "text_anlyzer": {
    
    
          "tokenizer": "ik_max_word",
          "filter": "py"
        },
        "completion_analyzer": {
    
    
          "tokenizer": "keyword",
          "filter": "py"
        }
      },
      "filter": {
    
    
        "py": {
    
    
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    
    
    "properties": {
    
    
      "id":{
    
    
        "type": "keyword"
      },
      "name":{
    
    
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart",
        "copy_to": "all"
      },
      "address":{
    
    
        "type": "keyword",
        "index": false
      },
      "price":{
    
    
        "type": "integer"
      },
      "score":{
    
    
        "type": "integer"
      },
      "brand":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
    
    
        "type": "keyword"
      },
      "starName":{
    
    
        "type": "keyword"
      },
      "business":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
    
    
        "type": "geo_point"
      },
      "pic":{
    
    
        "type": "keyword",
        "index": false
      },
      "all":{
    
    
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart"
      },
      "suggestion":{
    
    
          "type": "completion",
          "analyzer": "completion_analyzer"
      }
    }
  }
}

2.4.2. Modify the HotelDoc entity

A field needs to be added in HotelDoc for automatic completion, and the content can be information such as hotel brand, city, business district, etc. As required for autocomplete fields, preferably an array of these fields.

So we add a suggestion field in HotelDoc, the type is List<String>, and then put information such as brand, city, business, etc. into it.

code show as below:

package cn.whu.hotel.pojo;

import lombok.Data;
import lombok.NoArgsConstructor;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

@Data
@NoArgsConstructor
public class HotelDoc {
    
    
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;
    private Object distance;
    private Boolean isAD;
    private List<String> suggestion;

    public HotelDoc(Hotel hotel) {
    
    
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
        this.pic = hotel.getPic();
        // 组装suggestion : 需要自动补全的内容 来自这些字段
        if(this.business.contains("/")){
    
    
            // business有多个值,需要切割
            String[] arr = this.business.split("/");
            // 添加元素
            this.suggestion = new ArrayList<>();
            this.suggestion.add(this.brand);
            Collections.addAll(this.suggestion, arr);
        }else {
    
    
            this.suggestion = Arrays.asList(this.brand, this.business);
        }
    }
}

The data imported with java code is filled with the value of the suggestion field during import, so the query can be automatically completed according to the suggestion field

2.4.3. Reimport

Re-execute the import data function written before, and you can see that the new hotel data contains suggestions:

Re-run:cn.whu.hotel.HotelDocumentTest#testBulkRequest
insert image description here

insert image description here

  • Test autocompletion
# 测试自动补全
GET /hotel/_search
{
    
    
  "suggest": {
    
    
    "suggestions_test": {
    
    
      "text": "h",
      "completion": {
    
    
        "field": "suggestion",
        "skip_duplicates": true,
        "size": 10
      }
    }
  }
}

insert image description here

2.4.4. Java API for auto-completion query

Before we learned the DSL of auto-completion query, but did not learn the corresponding Java API, here is an example:

insert image description here

The result of auto-completion is also quite special, the parsing code is as follows:

insert image description here

2.4.5. Realize the automatic completion of the search box

Looking at the front-end page, we can find that when we type in the input box, the front-end will initiate an ajax request:

insert image description here

The return value is a collection of completed entries, the type isList<String>

1) Add a new interface cn.whu.hotel.webunder the package HotelControllerto receive new requests:

@GetMapping("suggestion")
public List<String> getSuggestion(String key) throws IOException {
    
    
    log.info("自动补全: "+key);
    return hotelService.getSuggestion(key);
}

2) Add the method in cn.whu.hotel.servicethe package :IhotelService

List<String> getSuggestion(String key) throws IOException;

3) cn.whu.hotel.service.impl.HotelServiceImplement the method in:

@Override
public List<String> getSuggestion(String key) throws IOException {
    
    
    // 1. 准备request
    SearchRequest request = new SearchRequest("hotel");
    // 2. 准备DSL
    request.source().suggest(new SuggestBuilder().addSuggestion(
            "my_suggestion",//补全查询名称
            SuggestBuilders.
                    completionSuggestion("suggestion")//根据名称获取时也返回CompletionSuggestion类型
                    .prefix(key)
                    .skipDuplicates(true)
                    .size(10)
    ));
    // 3. 发送请求
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. 处理结果
    //System.out.println(response);
    // 4.1 根据补全查询名称,获取补全结果
    Suggest suggest = response.getSuggest();
    // 注意类型改为CompletionSuggestion ★
    CompletionSuggestion suggestions = suggest.getSuggestion("my_suggestion");
    // 4.2 获取options
    List<CompletionSuggestion.Entry.Option> options = suggestions.getOptions();
    // 4.3 遍历
    List<String> ans = new ArrayList<>(options.size());//指定死长度
    for (CompletionSuggestion.Entry.Option option : options) {
    
    
        String text = option.getText().toString();//前面改成了 CompletionSuggestion 类型 这里API就简单了
        ans.add(text);
    }
    return ans;
}

insert image description here

3. Data synchronization

The hotel data in elasticsearch comes from the mysql database, so when the mysql data changes, the elasticsearch must also change accordingly. This is the data synchronization between elasticsearch and mysql .

insert image description here

3.1. Thinking analysis

There are three common data synchronization schemes:

  • synchronous call
  • asynchronous notification
  • monitor binlog

3.1.1. Synchronous call

Solution 1: Synchronous call

insert image description here

The basic steps are as follows:

  • hotel-demo provides an interface to modify the data in elasticsearch
  • After the hotel management service completes the database operation, it directly calls the interface provided by hotel-demo,

3.1.2. Asynchronous notification

Solution 2: Asynchronous notification

insert image description here

The process is as follows:

  • Hotel-admin sends MQ message after adding, deleting and modifying mysql database data
  • Hotel-demo listens to MQ and completes elasticsearch data modification after receiving the message

3.1.3. Monitor binlog

Solution 3: Monitor binlog

insert image description here

The process is as follows:

  • Enable the binlog function for mysql
  • The addition, deletion, and modification operations of mysql will be recorded in the binlog
  • Hotel-demo listens to binlog changes based on canal, and updates the content in elasticsearch in real time (canal is also a kind of middleware that has never been learned)

3.1.4. Selection

Method 1: Synchronous call

  • Advantages: simple to implement, rough
  • Disadvantages: high degree of business coupling

Method 2: Asynchronous notification

  • Advantages: low coupling, generally difficult to implement
  • Disadvantages: rely on the reliability of mq

Method 3: Monitor binlog

  • Advantages: Complete decoupling between services
  • Disadvantages: Enabling binlog increases database burden and high implementation complexity

3.2. Realize data synchronization

3.2.1. Ideas

Link: https://pan.baidu.com/s/1-H2B3MKP9k1ZhWoUfJ1g0Q
Extraction code: hzan

Use the hotel-admin project provided by the pre-class materials as a microservice for hotel management. When the hotel data is added, deleted, or modified, the same operation is required for the data in elasticsearch.

step:

  • Import the hotel-admin project provided by the pre-course materials, start and test the CRUD of hotel data

  • Declare exchange, queue, RoutingKey

  • Complete message sending in the add, delete, and change business in hotel-admin

  • Complete message monitoring in hotel-demo and update data in elasticsearch

  • Start and test the data sync function

3.2.2. Import demo

Import the hotel-admin project provided by the pre-course materials:

insert image description here
insert image description here

The configuration that may need to be modified is shown in the figure

After running, visit http://localhost:8099

insert image description here

Which contains the hotel's CRUD functionality:

insert image description here

3.2.3. Declare switches and queues

The MQ structure is shown in the figure:

insert image description here

The addition and modification of es can be unified into a PUT full modification operation. There is a modification of the id, but there is no new addition. (When it exists, it is deleted first and then added)
So add, change a queue, delete a queue, a total of 2 queues are enough

1) Introduce dependencies

Introduce the dependency of rabbitmq in hotel-admin and hotel-demo:

<!--amqp-->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-amqp</artifactId>
</dependency>

2) Declare the queue switch name

cn.whu.hotel.constatntsCreate a new class under the packages in hotel-admin and hotel-demo MqConstants:

package cn.whu.hotel.constants;

public class MqConstatnts {
    
    
    /**
     * 交换机
     */
    public final static String HOTEL_EXCHANGE = "hotel.topic";
    /**
     * 监听新增和修改的队列
     */
    public final static String HOTEL_INSERT_QUEUE = "hotel.insert.queue";
    /**
     * 监听删除的队列
     */
    public final static String HOTEL_DELETE_QUEUE = "hotel.delete.queue";
    /**
     * 新增或修改的RoutingKey
     */
    public final static String HOTEL_INSERT_KEY = "hotel.insert";
    /**
     * 删除的RoutingKey
     */
    public final static String HOTEL_DELETE_KEY = "hotel.delete";

}

3) Configure the MQ address

Configure the MQ address in the application.yaml of hotel-admin and hotel-demo:

spring:
  rabbitmq:
    host: 192.168.141.100
    port: 5672
    username: whuer
    password: 123321
    virtual-host: /

Pay attention to start RabbitMQ on the linux side first:

# 查询rabbitmq:3.8-management的地址
docker ps -a
# eg: 查到地址为cd6a833208d2
docker start cd6a833208d2

http://192.168.141.100:15672/#/

4) Declare the queue switch

In hotel-demo, define configuration classes, declare queues, switches, and configure MQ addresses:

hotel-demo is the consumer.
The declaration of the annotation method is very simple.
Here is the practice principle, let’s define it in the form of code @Bean

cn.whu.hotel.config.MqConfig


package cn.whu.hotel.config;

import cn.whu.hotel.constants.MqConstatnts;
import org.springframework.amqp.core.Binding;
import org.springframework.amqp.core.BindingBuilder;
import org.springframework.amqp.core.Queue;
import org.springframework.amqp.core.TopicExchange;
import org.springframework.context.annotation.Bean;

@Configuration //千万别忘了 否则不创建Bean
public class MqConfig {
    
    

    @Bean
    public TopicExchange topicExchange() {
    
    
        return new TopicExchange(MqConstatnts.HOTEL_EXCHANGE,true,false);
    }

    @Bean
    public Queue insertQueue() {
    
    
        return new Queue(MqConstatnts.HOTEL_INSERT_QUEUE,true);
    }

    @Bean
    public Queue deleteQueue() {
    
    
        return new Queue(MqConstatnts.HOTEL_DELETE_QUEUE,true);
    }

    @Bean
    public Binding insertQueueBinding(){
    
    
        return BindingBuilder.bind(insertQueue()).to(topicExchange()).with(MqConstatnts.HOTEL_INSERT_KEY);
    }

    @Bean
    public Binding deleteQueueBinding(){
    
    
        return BindingBuilder.bind(deleteQueue()).to(topicExchange()).with(MqConstatnts.HOTEL_DELETE_KEY);
    }
}

3.2.4. Send MQ message

Send MQ messages respectively in the add, delete, and modify services in hotel-admin:

First cn.whu.hotel.web.HotelControllerinject the RabbitTemplate in: (introduce dependencies)

@Autowired
private RabbitTemplate rabbitTemplate;

Then add the message sending code after the addition, deletion and modification methods
insert image description here

// 发送新增消息  内容发一个id就行了,让消费者自己查数据库,节省队列内存     
rabbitTemplate.convertAndSend(MqConstatnts.HOTEL_EXCHANGE,MqConstatnts.HOTEL_INSERT_KEY,hotel.getId());

// 发送新增消息  内容发一个id就行了,让消费者自己查数据库,节省队列内存
rabbitTemplate.convertAndSend(MqConstatnts.HOTEL_EXCHANGE,MqConstatnts.HOTEL_DELETE_KEY,id);

3.2.5. Receive MQ message

Things to do when hotel-demo receives MQ messages include:

  • New message: Query hotel information according to the passed hotel id, and then add a piece of data to the index library
  • Delete message: Delete a piece of data in the index library according to the passed hotel id

1) First, add new and delete services under cn.whu.hotel.servicethe package of hotel-demoIHotelService

void deleteById(Long id);

void insertById(Long id);

2) cn.whu.hotel.service.implImplement business in HotelService under the package in hotel-demo:

@Override
public void deleteById(Long id) {
    
    
    //1.准备request
    DeleteRequest request = new DeleteRequest("hotel",id.toString());
    //2.发送请求
    try {
    
    
        client.delete(request,RequestOptions.DEFAULT);
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

@Override
public void insertById(Long id) {
    
    //新增或者修改
    //0.根据id查询酒店数据
    Hotel hotel = getById(id); //查数据库去了 (两个微服务查一个数据库了 这里可能做得不是那么好了)  ISevice继承的方法
    //转化为文档类型
    HotelDoc hotelDoc = new HotelDoc(hotel);
    //1.准备request
    IndexRequest request = new IndexRequest("hotel").id(id.toString());
    //2.准备DSL(json文档部分)
    request.source(JSON.toJSONString(hotelDoc), XContentType.JSON);
    //3.发送请求
    try {
    
    
        client.index(request,RequestOptions.DEFAULT);
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

3) Write a listener

Add a new class to the package in hotel-demo cn.whu.hotel.mq:

cn.whu.hotel.mq.HotelListener

package cn.whu.hotel.mq;

import cn.whu.hotel.constants.MqConstants;
import cn.whu.hotel.service.IHotelService;
import org.springframework.amqp.rabbit.annotation.RabbitListener;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

@Component
public class HotelListener {
    
    

    @Autowired
    private IHotelService hotelService;//业务都在service里写 这里也可以注入service
    // 消费者模块的service都是对es的操作 正好了

    /**
     * 监听酒店新增或修改的业务
     * @param id 酒店id
     */
    @RabbitListener(queues = MqConstants.HOTEL_INSERT_QUEUE)
    public void listenHotelInsertOrUpdate(Long id){
    
    
        hotelService.insertById(id);
    }

    /**
     * 监听酒店删除的业务
     * @param id 酒店id
     */
    @RabbitListener(queues = MqConstants.HOTEL_DELETE_QUEUE)
    public void listenHotelDelete(Long id){
    
    
        hotelService.deleteById(id);
    }
}
  • Test
    Restart both projects
    insert image description here
    insert image description here
    insert image description here

Http://localhost:8099/ is modified here
insert image description here
http://localhost:8089/# es is also updated here
insert image description here

The same is true for deletion
(if there is a problem with the addition, I will not try it for the time being)

4. Cluster

Stand-alone elasticsearch for data storage will inevitably face two problems: massive data storage and single point of failure.

  • Massive data storage problem: Logically split the index library into N shards (shards) and store them in multiple nodes
  • Single point of failure problem: back up fragmented data on different nodes (replica)

insert image description here

ES cluster related concepts :

  • Cluster (cluster): A group of nodes with a common cluster name. (does exactly the same thing)

  • Node (node) : an Elasticearch instance in the cluster

  • Shard : Indexes can be split into different parts for storage, called shards. In a cluster environment, different shards of an index can be split into different nodes

    Solve the problem: the amount of data is too large and the storage capacity of a single point is limited.

insert image description here

Here, we divide the data into 3 pieces: shard0, shard1, shard2

  • Primary shard (Primary shard): relative to the definition of replica shards.

  • Replica shard (Replica shard) Each primary shard can have one or more copies, and the data is the same as the primary shard.

Data backup can ensure high availability, but if each shard is backed up, the number of nodes (Elasticearch instances) required will double, and the cost is too high!

In order to find a balance between high availability and cost, we can do this:

  • First shard the data and store it in different nodes
  • Then back up each shard and put it on the other node to complete mutual backup

This can greatly reduce the number of required service nodes. As shown in the figure, we take 3 shards and each shard as a backup copy as an example:

insert image description here

Now, each shard has 1 backup, stored on 3 nodes:

  • node0: holds shards 0 and 1
  • node1: holds shards 0 and 2
  • node2: saved shards 1 and 2

4.1. Building an ES cluster

Refer to the documentation of the pre-class materials:
or directly refer to this blog: search engine elasticsearch-4. Deploy es cluster

4.2. Cluster split-brain problem

4.2.1. Division of Cluster Responsibilities

Cluster nodes in elasticsearch have different responsibilities:

insert image description here

coordinating: the coordinating node does not need data itself, but performs routing + load balancing + merging results (for users)

By default, any node in the es cluster has the above four roles at the same time.

But the real cluster must separate the cluster responsibilities : (The default role is not separated, so it must be configured as a separation of responsibilities)

  • master node: high CPU requirements, but low memory requirements
  • data node: high requirements for CPU and memory
  • Coordinating node: high requirements for network bandwidth and CPU

Separation of duties allows us to allocate different hardware for deployment according to the needs of different nodes . And avoid mutual interference between services.

A typical es cluster responsibility division is shown in the figure:

insert image description here

4.2.2. Split brain problem

A split-brain is caused by the disconnection of nodes in the cluster.

For example, in a cluster, the master node loses connection with other nodes:

insert image description here

At this time, node2 and node3 think that node1 is down, and they will re-elect the master:

insert image description here

After node3 is elected, the cluster continues to provide external services. Node2 and node3 form a cluster, and node1 forms a cluster. The data of the two clusters is not synchronized, and data differences occur.

When the network is restored, because there are two master nodes in the cluster, the status of the cluster is inconsistent, and a split-brain situation occurs:

insert image description here

The solution to split-brain is to require votes to exceed (number of eligible nodes + 1)/2 to be elected as the master, so the number of eligible nodes should preferably be an odd number. The corresponding configuration item is discovery.zen.minimum_master_nodes, which has become the default configuration after es7.0, so the split-brain problem generally does not occur

For example: for a cluster formed by 3 nodes, the votes must exceed (3 + 1) / 2, which is 2 votes. node3 gets the votes of node2 and node3, and is elected as the master. node1 has only one vote for itself, and was not elected (the original master was deprived of master authority and completely reduced to a commoner) . There is still only one master node in the cluster, and there is no split brain.

4.2.3. Summary

What is the role of the master eligible node?

  • Participate in group election
  • The master node can manage the cluster state, manage sharding information, and process requests to create and delete index libraries

What is the role of the data node?

  • CRUD of data

What is the role of the coordinator node?

  • Route requests to other nodes

  • Combine the query results and return them to the user

4.3. Cluster distributed storage

When a new document is added, it should be saved in different shards to ensure data balance, so how does the coordinating node determine which shard the data should be stored in?

4.3.1. Shard storage test

Insert three pieces of data:

You can also use the restFull tool to insert without starting kibana

insert image description here
insert image description here
insert image description here

You can see from the yml file: es01 port
9200
es02 port 9201
es03 port 9202. The 9200 port accessed above is inserted into the es01 node

You can see from the test that the three pieces of data are in different shards:

insert image description here

result:

insert image description here

{
    
    
    "took": 5,
    "timed_out": false,
    "_shards": {
    
    
        "total": 3,
        "successful": 3,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
    
    
        "total": {
    
    
            "value": 3,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
    
    
                "_shard": "[whu][0]",
                "_node": "gGPmt5m6RsmtFYSwwW5zqw",
                "_index": "whu",
                "_type": "_doc",
                "_id": "5",
                "_score": 1.0,
                "_source": {
    
    
                    "title": "试着插入一条 id=5"
                },
                "_explanation": {
    
    
                    "value": 1.0,
                    "description": "*:*",
                    "details": []
                }
            },
            {
    
    
                "_shard": "[whu][1]",
                "_node": "-GmaostMRsSvOxGJWr8viw",
                "_index": "whu",
                "_type": "_doc",
                "_id": "3",
                "_score": 1.0,
                "_source": {
    
    
                    "title": "试着插入一条 id=3"
                },
                "_explanation": {
    
    
                    "value": 1.0,
                    "description": "*:*",
                    "details": []
                }
            },
            {
    
    
                "_shard": "[whu][2]",
                "_node": "-GmaostMRsSvOxGJWr8viw",
                "_index": "whu",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
    
    
                    "title": "试着插入一条 id=1"
                },
                "_explanation": {
    
    
                    "value": 1.0,
                    "description": "*:*",
                    "details": []
                }
            }
        ]
    }
}

http://192.168.141.100:9200/whu/_search
http://192.168.141.100:9201/whu/_search
http://192.168.141.100:9202/whu/_search

Visit the above 3 ports, that is, check es01, es02, and es03 respectively, and you can find the 3 document records you just inserted.
Add explain: truethe parameter, and you can see which chip the query results are on
. The following results show, (before When the cerebro visualization tool creates the index library, the whu index library is divided into 3 slices), and the 3 records are stored in the 3 slices. It happens that a document on each slice is inserted at 9200, but on 3 nodes and 3
slices , both of which indicate that the coordinating node is indeed working
(it is reasonable for all three nodes to have the same index library, and the cluster is originally composed of multiple identical instances)

4.3.2. Shard storage principle

Elasticsearch will use the hash algorithm to calculate which shard the document should be stored in:

insert image description here

illustrate:

  • _routing defaults to the id of the document
  • The algorithm is related to the number of shards, so once the index library is created, the number of shards cannot be modified!

The process of adding new documents is as follows:

insert image description here

Interpretation:

  • 1) Add a document with id=1
  • 2) Do a hash operation on the id, if the result is 2, it should be stored in shard-2
  • 3) The primary shard of shard-2 is on node3, and the data is routed to node3
  • 4) Save the document
  • 5) Synchronize to replica-2 of shard-2, on the node2 node
  • 6) Return the result to the coordinating-node node

4.4. Cluster distributed query

The elasticsearch query is divided into two stages:

  • scatter phase: In the scatter phase, the coordinating node will distribute the request to each shard (I don’t know the id, so I can only go to each shard to check, so it must be complete)

  • gather phase: In the gathering phase, the coordinating node summarizes the search results of the data node, and processes it as the final result set and returns it to the user (there are too many searches, there must be repeated allocation, so summarize and sort it out)

insert image description here

The aggregation node in the above figure is the coordinating node.
Each node can act as a coordinating node.

  • This is why you can find complete information when visiting any node: no matter which node is queried, the request will be forwarded to all nodes, and then summarized by itself and returned to the user.
  • Therefore, not every node stores all the fragments of the index when storing.
    Distributed query leads to complete information retrieval
  • The effect of accessing each node is the same, which just fits the concept of a cluster (multiple instances of a microservice, multiple people doing the same thing)

  • summary
    insert image description here

4.5. Cluster failover

The master node of the cluster will monitor the status of the nodes in the cluster. If a node is found to be down, it will immediately migrate the fragmented data of the down node to other nodes to ensure data security. This is called failover.

1) For example, a cluster structure as shown in the figure:

insert image description here

Now, node1 is the master node and the other two nodes are slave nodes.

2) Suddenly, node1 fails:

insert image description here

The first thing after the downtime is to re-elect the master, for example, node2 is selected:

insert image description here

After node2 becomes the master node, it will check the cluster monitoring status and find that: shard-1 and shard-0 have no replica nodes. Therefore, the data on node1 needs to be migrated to node2 and node3:

insert image description here

Demo:

  1. 3 normal nodes, es03 is the master node

  2. Stop the es03 node instance (close the container)

    docker-compose stop es03
    

    First re-elect the master, es02 was elected

    then turns yellow to indicate unhealthy
    insert image description here

    Finally, after data migration,
    insert image description here
    go to the browser or postman to query the data, and it is not lost
    http://192.168.141.100:9200/whu/_search
    http://192.168.141.100:9201/whu/_search

  3. restart es03

    docker-compose start es03
    

    It is alive again, and it has returned to the normal 3 nodes, and the fragments are still stored in a scattered manner (and migrated back to ensure balance)
    insert image description here

  • summary
    insert image description here

Guess you like

Origin blog.csdn.net/hza419763578/article/details/131817446