Distributed search engine 01-elasticsearch-introduction, inverted index principle, concept (document and field, index and mapping), installation, index library crud, document crud, RestAPI (java code implements crud of es)

Article directory

Distributed search engine 01

– elasticsearch basics

0. Learning Objectives

1 Understand the principle of inverted index
2 Understand the concepts of index library, type, mapping, document, field
3 Can install and use IK tokenizer
4 Can use kibana to realize index library, type mapping, document operation
5 Can use RestClient to realize index library, type Mapping, Document Operations

1. Getting to know elasticsearch first

1.1. Understanding ES

1.1.1. The role of elasticsearch

Elasticsearch is a very powerful open source search engine with many powerful functions that can help us quickly find what we need from massive amounts of data

For example:

  • Search code on GitHub

    insert image description here

  • Search for products on e-commerce sites

insert image description here

  • Search for answers in Baidu

insert image description here

  • Search for nearby cars on the taxi app

    insert image description here

1.1.2. ELK technology stack

elasticsearch combines kibana, Logstash, Beats, which is elastic stack (ELK). It is widely used in log data analysis, real-time monitoring and other fields:

insert image description here

Elasticsearch is the core of the elastic stack, responsible for storing, searching, and analyzing data.

insert image description here
insert image description here
insert image description here

1.1.3.elasticsearch和lucene

The bottom layer of elasticsearch is implemented based on lucene .

Lucene is a Java language search engine class library (a jar package) , which is a top-level project of Apache Corporation and was developed by Doug Cutting in 1999. Official website address: https://lucene.apache.org/ .

insert image description here

Because the bottom layer is based on lucene , and lucene is a jar package (java language class library), so it is limited to the Java language development
API design is more complicated, most people develop directly according to the API provided by it, the learning difficulty is still very large (the learning curve is steep )
only considers search, does not consider high concurrency, and needs secondary development (horizontal expansion) if you want to use it efficiently = " elasticsearch has done this and solved many shortcomings of native lucene

The development history of elasticsearch :

  • In 2004, Shay Banon developed Compass based on Lucene
  • In 2010, Shay Banon rewrote Compass, named Elasticsearch.

insert image description here

Restful is Http access, it has nothing to do with language, any language can call

1.1.4. Why not other search techniques?

Currently well-known search engine technology rankings:

insert image description here

Although in the early days, Apache Solr was the most important search engine technology, but with the development of elasticsearch has gradually surpassed Solr and took the lead:

insert image description here

1.1.5. Summary

What is elasticsearch?

  • An open source distributed search engine that can be used to implement functions such as search, log statistics, analysis, and system monitoring

What is elastic stack (ELK)?

  • A technology stack with elasticsearch as the core , including beats, Logstash, kibana, elasticsearch

What is Lucene?

  • It is Apache's open source search engine class library, which provides the core API of the search engine

1.2. Inverted index

The concept of an inverted index is based on a forward index like MySQL.

The index of the MySQL database is called the forward index,
and the index of my elasticsearch is called the inverted index.
It seems that there is a big difference. What is the difference? The following example

1.2.1. Forward index

So what is a forward index? For example, create an index for the id in the following table (tb_goods):

insert image description here

If you query based on id, then go directly to the index, and the query speed is very fast. (B+tree O(log k N))

But if you do fuzzy query based on the title, you can only scan the data line by line. The process is as follows:

1) The user searches for data, the condition is that the title matches"%手机%"

2) Get data row by row , such as data with id 1

3) Determine whether the title in the data meets the user's search criteria

4) If it matches, put it into the result set, and if it doesn't match, discard it. back to step 1

Progressive scan, that is, full table scan, as the amount of data increases, its query efficiency will become lower and lower. When the volume of data reaches millions, it is a disaster. the dreaded O(n)

1.2.2. Inverted index

There are two very important concepts in inverted index:

  • Document ( Document): The data used for searching, each piece of data is a document. For example, a web page, a product information
  • Entry ( Term): For document data or user search data, use a certain algorithm to segment words, and the words with meaning obtained are entries. For example: I am Chinese, it can be divided into several entries: I, Yes, Chinese, China, Chinese

Creating an inverted index is a special treatment for a forward index, and the process is as follows:

  • Use the algorithm to segment the data of each document to get each entry
  • Create a table, each row of data includes information such as the entry, the document id where the entry is located, and the location
  • Because of the uniqueness of the entry, you can create an index for the entry, such as a hash table structure index

As shown in the picture:

insert image description here

A new table will be generated for fast query

The search process of the inverted index is as follows (take the search for "Huawei mobile phone" as an example):

1) The user enters criteria "华为手机"to search.

2) Segment the content input by the user to get the entries: 华为, 手机.

3) Take the entry and search in the inverted index, and you can get the document ids that contain the entry: 1, 2, and 3. (No. 2 has both entries, and the priority is higher when sorting)

4) Take the document id to find the specific document in the forward index.

As shown in the picture:

insert image description here

Although it is necessary to query the inverted index first, and then query the forward index, both entries and document ids have been indexed, and the query speed is very fast! No full table scan is required.

The id is found according to the content, so it is called an inverted index

1.2.3. Forward and reverse

So why is one called a forward index and the other an inverted index?

  • Forward indexing is the most traditional way of indexing by id. However, when querying based on terms, you must first obtain each document one by one, and then determine whether the document contains the required term, which is the process of finding terms based on the document .

  • The inverted index is the opposite. It first finds the entry that the user wants to search, obtains the id of the document that protects the entry according to the entry, and then obtains the document according to the id. It is the process of finding documents based on entries .

Is it just the other way around?

So what are the pros and cons of both approaches?

forward index :

  • advantage:
    • Indexes can be created for multiple fields
    • Searching and sorting based on index fields is very fast
  • shortcoming:
    • When searching based on non-indexed fields or some terms in indexed fields, only full table scans can be performed.

Inverted index :

  • advantage:
    • When searching according to terms and fuzzy search, the speed is very fast
  • shortcoming:
    • Indexes can only be created for terms, not fields
    • Can't sort by field

Some concepts of 1.3.es

There are many unique concepts in elasticsearch, which are slightly different from mysql, but there are also similarities.

1.3.1. Documents and fields

Elasticsearch is stored for **document (Document)**, which can be a piece of product data or an order information in the database. Document data will be serialized into json format and stored in elasticsearch:

insert image description here

Json documents often contain many fields (Field) , which are similar to columns in a database.

1.3.2. Indexes and Mappings

Index (Index) is a collection of documents of the same type.

For example:

  • All user documents can be organized together, called the user index;
  • The documents of all commodities can be organized together, called commodity index;
  • The documents of all orders can be organized together, called the index of orders;

insert image description here

Therefore, we can think of indexes as tables in a database.

The table of the database will have constraint information, which is used to define the structure of the table, the name and type of the field and other information. Therefore, there is mapping in the index library , which is the field constraint information of the documents in the index, similar to the structural constraints of the table.

1.3.3.mysql与elasticsearch

Let's compare the concepts of mysql and elasticsearch in a unified way:

MySQL Elasticsearch illustrate
Table Index An index is a collection of documents, similar to a table in a database
Row Document Document (Document), is a piece of data, similar to the row (Row) in the database, the document is in JSON format
Column Field Field (Field) is a field in a JSON document, similar to a column (Column) in a database
Schema Mapping Mappings are constraints on documents in an index, such as field type constraints. Database-like table structure (Schema)
SQL DSL DSL is a JSON-style request statement provided by elasticsearch , which is used to operate elasticsearch and implement CRUD

Does it mean that we no longer need mysql after learning elasticsearch?

Not really, both have their own pros and cons:

  • Mysql: Good at transaction type operations, can ensure data security and consistency

  • Elasticsearch: Good at searching, analyzing, and computing massive amounts of data

Therefore, in enterprises, the two are often used in combination:

  • For write operations with high security requirements, use mysql to implement
  • For search requirements that require high query performance, use elasticsearch to achieve
  • The two are based on a certain method to achieve data synchronization and ensure consistency

insert image description here

1.4. Install es, kibana

1.4.1. Install es, kibana

Link: https://pan.baidu.com/s/1LRpd6xncRhxHIgK13gHu4g
Extraction code: hzan

Reference pre-class materials:

insert image description here

Or directly refer to this blog: search engine elasticsearch

1.4.2. Install tokenizer

Reference pre-class materials:

insert image description here
Or directly refer to this blog: search engine elasticsearch

1.4.3. Summary

What is the function of the tokenizer?

  • Word segmentation for documents when creating an inverted index
  • When the user searches, word segmentation of the input content

How many modes does the IK tokenizer have?

  • ik_smart: intelligent segmentation, coarse-grained
  • ik_max_word: the finest segmentation, fine-grained

How does the IK tokenizer expand entries? How to deactivate an entry?

  • Use the IkAnalyzer.cfg.xml file in the config directory to add extended dictionaries and deactivated dictionaries
  • Add extended or disabled entries to the dictionary

2. Index library operation

The index library is similar to the database table , and the mapping mapping is similar to the table structure .

If we want to store data in es, we must first create "library" and "table".

2.1.mapping mapping properties

Mapping is a constraint on the documents in the index library. Common mapping attributes include:

  • type: field data type, common simple types are:
    • String: text (text that can be segmented), keyword (exact value, such as: brand, country, ip address)
    • Value: long, integer, short, byte, double, float,
    • Boolean: boolean
    • date: date
    • Object: object
    • (There is no array, but each type is allowed to have multiple values. In other words, regardless of the array type, only the type of the array element is concerned)
  • index: Whether to create an inverted index, the default is true (once it is set to false, there will be no inverted index, and it will not be able to participate in the search in the future) (of course, some non-content fields do not need to participate in the search)
  • analyzer: Which tokenizer to use (only the text type requires word segmentation. It can be understood as being used with text)
  • properties: the subfield of this field (used when nesting)

For example the following json document:

{
    
    
    "age": 21,
    "weight": 52.1,
    "isMarried": false,
    "info": "whuer程序员Java讲师",
    "email": "[email protected]",
    "score": [99.1, 99.5, 98.9],
    "name": {
    
    
        "firstName": "云",
        "lastName": "赵"
    }
}

Corresponding to each field mapping (mapping):

  • age: The type is integer; participate in the search, so the index needs to be true; no tokenizer is required
  • weight: The type is float; participate in the search, so the index needs to be true; no tokenizer is required
  • isMarried: The type is boolean; participate in the search, so the index needs to be true; no tokenizer is required
  • info: The type is a string, word segmentation is required, so it is text; to participate in the search, so the index needs to be true; the word segmenter can use ik_smart
  • email: The type is a string, but word segmentation is not required, so it is a keyword; it does not participate in the search, so the index needs to be false; no word segmentation is required
  • score: Although it is an array, we only look at the type of the element, which is float; participate in the search, so the index needs to be true; no tokenizer is required
  • name: The type is object, and multiple sub-attributes need to be defined
    • name.firstName; the type is a string, but word segmentation is not required, so it is a keyword; it participates in the search, so the index needs to be true; no word segmentation is required
    • name.lastName; the type is a string, but word segmentation is not required, so it is a keyword; it participates in the search, so the index needs to be true; no word segmentation is required

2.2. CRUD of the index library

Here we uniformly use Kibana to write DSL to demonstrate.

2.2.1. Create index library and mapping

Basic syntax:

  • Request method: PUT
  • Request path: /index library name, which can be customized
  • Request parameter: mapping mapping

Creating an index library (and mapping) is equivalent to creating a table in the mysql database.
Mysql is a sql statement, and es is a json-style DSL statement.

Format:

PUT /索引库名称
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "字段名":{
    
    
        "type": "text",
        "analyzer": "ik_smart"
      },
      "字段名2":{
    
    
        "type": "keyword",
        "index": "false"
      },
      "字段名3":{
    
    
        "properties": {
    
    
          "子字段": {
    
    
            "type": "keyword"
          }
        }
      },
      // ...略
    }
  }
}

"mappings": mapping, representing the structure
"properties": representing this is a nesting
Note that each field has a type that cannot be missed, especially "name": {"type": "object"}
The rest of the description above is already very detailed

Example:

# 创建索引库(包含了映射)
PUT /whu
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "info":{
    
    
        "type": "text",
        "analyzer": "ik_smart"
      },
      "email":{
    
    
        "type": "keyword",
        "index": false
      },
      "name":{
    
    
        "type": "object", 
        "properties": {
    
    
          "firstName":{
    
    
            "type": "keyword"
          },
          "lastName":{
    
    
            "type": "keyword"
          }
        }
      }
    }
  }
}

insert image description here

2.2.2. Query index library

Basic syntax :

  • Request method: GET

  • Request path: /index library name

  • Request parameters: none

Format :

GET /索引库名

Example :

insert image description here

2.2.3. Modify the index library

Although the inverted index structure is not complicated, once the data structure changes (for example, the tokenizer is changed), the inverted index needs to be recreated, which is a disaster. Therefore, once the index library is created, the mapping cannot be modified .

Although the existing fields in the mapping cannot be modified, it is allowed to add new fields to the mapping because it will not affect the inverted index.

Grammar description :

PUT /索引库名/_mapping
{
    
    
  "properties": {
    
    
    "新字段名":{
    
    
      "type": "integer"
    }
  }
}

Example :

insert image description here

insert image description here

2.2.4. Delete the index library

grammar:

  • Request method: DELETE

  • Request path: /index library name

  • Request parameters: none

Format:

DELETE /索引库名

Test in kibana:

insert image description here

2.2.5. Summary

What are the index library operations?

  • Create an index library: PUT /index library name
  • Query index library: GET / index library name
  • Delete the index library: DELETE / index library name
  • Add field: PUT /index library name/_mapping

3. Document operation

The document in es is similar to a row of data in the table in mysql,
and kibana is equivalent to navicate

3.1. New document

grammar:

POST /索引库名/_doc/文档id
{
    
    
    "字段1": "值1",
    "字段2": "值2",
    "字段3": {
    
    
        "子属性1": "值3",
        "子属性2": "值4"
    },
    // ...
}

If you don't write the id yourself, a random id will be generated, which is not what we want to see, so you must write the id

Example:

# 插入文档
POST /whu/_doc/1
{
    
    
  "info": "whuer学不会java",
  "email": "[email protected]",
  "name":{
    
    
    "firstName": "波",
    "lastName": "波"
  }
}

The id is specified as 1, and the others can be written in json format

response:

insert image description here

3.2. Query documents

According to the rest style, the new addition is post, and the query should be get, but the query generally requires conditions. Here we bring the document id.

grammar:

GET /{
    
    索引库名称}/_doc/{
    
    id}

View data through kibana:

GET /heima/_doc/1

View Results:

insert image description here

3.3. Delete document

Deletion uses the DELETE request. Similarly, it needs to be deleted according to the id:

grammar:

DELETE /{
    
    索引库名}/_doc/id值

Example:

# 根据id删除数据
DELETE /heima/_doc/1

result:

insert image description here

If there is no modification once, "_version" will be incremented once, the version control field

3.4. Modify the document

There are two ways to modify:

  • Full modification: overwrite the original document directly (delete the original completely and add it)
  • Incremental modification: Modify some fields in the document (the original ones will not be deleted, and the original ones will be modified directly)

3.4.1. Full modification

Full modification is to overwrite the original document, its essence is:

  • Delete a document based on the specified id
  • Add a new document with the same id

Note : If the id does not exist when deleting according to the id , the addition in the second step will also be executed, and the operation is changed from modification to addition.

grammar:

PUT /{
    
    索引库名}/_doc/文档id
{
    
    
    "字段1": "值1",
    "字段2": "值2",
    // ... 略
}

Example:

# 全量修改文档
PUT /whu/_doc/1
{
    
    
  "info": "whuer学不会java",
  "email": "[email protected]",
  "name":{
    
    
    "firstName": "波",
    "lastName": "波"
  }
}

Email name changed
insert image description here
insert image description here

3.4.2. Incremental modification

Incremental modification is to modify only some fields in the document matching the specified id.

grammar:

POST /{
    
    索引库名}/_update/文档id
{
    
    
    "doc": {
    
    
         "字段名": "新的值",
    }
}

Partial modification of POST, there needs to be a large "doc" in the json document
PUT full modification, the json document is normal, no need for "doc"

Example:

# 局部修改文档字段
POST /whu/_update/1
{
    
    
  "doc":{
    
    
    "email": "[email protected]"
  }
}

If you only modify the email field, you only need to write the modification description of one field.
Be careful! ! ! The middle is _update, not _doc (written as _doc has become a full modification, and the others have been deleted, and only this field is left in the end)

3.5. Summary

What are the document operations?

  • Create document: POST /{index library name}/_doc/document id { json document }
  • Query documents: GET /{index library name}/_doc/document id
  • Delete document: DELETE /{index library name}/_doc/document id
  • Modify the document:
    • Full modification: PUT /{index library name}/_doc/document id { json document }
    • Incremental modification: POST /{index library name}/ _update /document id { "doc": {field}}

4.RestAPI

ES officially provides clients in various languages ​​to operate ES. The essence of these clients is to assemble DSL statements and send them to ES through http requests. Official document address: https://www.elastic.co/guide/en/elasticsearch/client/index.html

The Java Rest Client includes two types:

  • Java Low Level Rest Client
  • Java High Level Rest Client

insert image description here

What we are learning is the Java HighLevel Rest Client client API

4.0. Import Demo project

4.0.1. Import data

Link: https://pan.baidu.com/s/1LRpd6xncRhxHIgK13gHu4g
Extraction code: hzan

First create a new database

create DATABASE es01;

Then import the database data provided by the pre-class materials:

insert image description here
insert image description here
Then refresh the database es01 to see

The data structure is as follows:

CREATE TABLE `tb_hotel` (
  `id` bigint(20) NOT NULL COMMENT '酒店id',
  `name` varchar(255) NOT NULL COMMENT '酒店名称;例:7天酒店',
  `address` varchar(255) NOT NULL COMMENT '酒店地址;例:航头路',
  `price` int(10) NOT NULL COMMENT '酒店价格;例:329',
  `score` int(2) NOT NULL COMMENT '酒店评分;例:45,就是4.5分',
  `brand` varchar(32) NOT NULL COMMENT '酒店品牌;例:如家',
  `city` varchar(32) NOT NULL COMMENT '所在城市;例:上海',
  `star_name` varchar(16) DEFAULT NULL COMMENT '酒店星级,从低到高分别是:1星到5星,1钻到5钻',
  `business` varchar(255) DEFAULT NULL COMMENT '商圈;例:虹桥',
  `latitude` varchar(32) NOT NULL COMMENT '纬度;例:31.2497',
  `longitude` varchar(32) NOT NULL COMMENT '经度;例:120.3925',
  `pic` varchar(255) DEFAULT NULL COMMENT '酒店图片;例:/img/1.jpg',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

4.0.2. Import project

Then import the items provided by the pre-class materials:

insert image description here

The project structure is shown in the figure:

insert image description here

Then write a small demo in the test class to see if the project can run through
cn.whu.hotel.HotelDemoApplicationTests

@Autowired
private HotelService hotelService;

@Test
public void test(){
    
    
    int id = 36934;
    Hotel hotel = hotelService.getById(id);
    System.out.println(hotel);
}

4.0.3.mapping mapping analysis

The key to creating an index library is the mapping, and the information to be considered for the mapping includes:

  • field name
  • field data type
  • Whether to participate in the search
  • Do you need word segmentation
  • If word segmentation, what is the tokenizer?

in:

  • Field name, field data type, you can refer to the name and type of the data table structure
  • Whether to participate in the search should be judged by analyzing the business, such as the image address, there is no need to participate in the search
  • Whether word segmentation depends on the content, if the content is a whole, there is no need for word segmentation, otherwise, word segmentation is required
  • Tokenizer, we can use ik_max_word uniformly

Let’s take a look at the index library structure of hotel data:

  1. The id is always a string, so it cannot be defined as long. There are only two types of strings in es: text (text that can be segmented), keyword (exact value, non-segmentable), obviously the id here is keyword
  2. It is considered that users will not search for hotels by address, so the address keyword (just indicates that it is a string type) does not participate in the search. (same as picture)
  3. The price needs to be sorted and filtered, so it needs to participate in the search (other similar)
  4. Coordinate points have specialized types
# 酒店的mapping (类似于定义表结构)
PUT /hotel
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "id":{
    
    
        "type": "keyword"
      },
      "name":{
    
    
        "type": "text",
        "analyzer": "ik_max_word",
        "copy_to": "all"
      },
      "address":{
    
    
        "type": "keyword",
        "index": false
      },
      "price":{
    
    
        "type": "integer"
      },
      "score":{
    
    
        "type": "integer"
      },
      "brand":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
    
    
        "type": "keyword"
      },
      "starName":{
    
    
        "type": "keyword"
      },
      "business":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
    
    
        "type": "geo_point"
      },
      "pic":{
    
    
        "type": "keyword",
        "index": false
      },
      "all":{
    
    
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

index: false means not to participate in the search

Description of several special fields:

  • location: Geographical coordinates, including accuracy and latitude
  • all: a combined field, the purpose of which is to combine the values ​​of multiple fields using copy_to and provide them for users to search (auxiliary fields, making it easy to search for multiple conditions)

Description of geographic coordinates:

insert image description here

copy_to description:

The content of multiple fields can be searched in one field, and the multi-field search is realized ingeniously and simply. The
field is not really copied, but an index is created based on it

insert image description here
Remember to execute it after writing

4.0.4. Initialize RestClient

In the API provided by elasticsearch, all interactions with elasticsearch are encapsulated in a class called RestHighLevelClient, and the initialization of this object must be completed first to establish a connection with elasticsearch.

Divided into three steps:

1) Introduce the RestHighLevelClient dependency of es:

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
</dependency>

2) Because the default ES version of SpringBoot is 7.6.2, we need to override the default ES version:

<properties>
    <java.version>1.8</java.version>
    <elasticsearch.version>7.12.1</elasticsearch.version>
</properties>

3) Initialize RestHighLevelClient:

The initialization code is as follows:

RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(
        HttpHost.create("http://192.168.150.101:9200")
));

Here, for the convenience of unit testing, we create a test class HotelIndexTest, and then write the initialization code in the @BeforeEach method:

package cn.whu.hotel;

import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;

import java.io.IOException;

public class HotelIndexTest {
    
    

    private RestHighLevelClient client;

    @BeforeEach
    public void setUp(){
    
    
        client = new RestHighLevelClient(RestClient.builder(
                HttpHost.create("http://192.168.141.100:9200/") // 最后的'/'不能有
                // 集群的话可以指定多个,中间逗号隔开
        ));
    }

    @AfterEach
    public void tearDown() throws IOException {
    
    
        client.close();
    }
	
	@Test
    void testInit() {
    
    
        System.out.println(client);
        // org.elasticsearch.client.RestHighLevelClient@12d2ce03
    }

}

4.1. Create an index library

4.1.1. Code Interpretation

The API for creating an index library is as follows:

insert image description here

The code is divided into three steps:

  • 1) Create a Request object. Because it is an operation to create an index library, the Request is CreateIndexRequest.
  • 2) Adding request parameters is actually the JSON parameter part of the DSL. Because the json string is very long, the static string constant MAPPING_TEMPLATE is defined here to make the code look more elegant.
  • 3) To send a request, the return value of the client.indices() method is the IndicesClient type, which encapsulates all methods related to index library operations.

4.1.2. Complete example

Under the cn.whu.hotel.constants package of hotel-demo, create a class to define the JSON string constant of the mapping mapping:

package cn.whu.hotel.constants;

public class HotelConstants {
    
    
    public static final String MAPPING_TEMPLATE = "{\n" +
            "  \"mappings\": {\n" +
            "    \"properties\": {\n" +
            "      \"id\": {\n" +
            "        \"type\": \"keyword\"\n" +
            "      },\n" +
            "      \"name\":{\n" +
            "        \"type\": \"text\",\n" +
            "        \"analyzer\": \"ik_max_word\",\n" +
            "        \"copy_to\": \"all\"\n" +
            "      },\n" +
            "      \"address\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"index\": false\n" +
            "      },\n" +
            "      \"price\":{\n" +
            "        \"type\": \"integer\"\n" +
            "      },\n" +
            "      \"score\":{\n" +
            "        \"type\": \"integer\"\n" +
            "      },\n" +
            "      \"brand\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"copy_to\": \"all\"\n" +
            "      },\n" +
            "      \"city\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"copy_to\": \"all\"\n" +
            "      },\n" +
            "      \"starName\":{\n" +
            "        \"type\": \"keyword\"\n" +
            "      },\n" +
            "      \"business\":{\n" +
            "        \"type\": \"keyword\"\n" +
            "      },\n" +
            "      \"location\":{\n" +
            "        \"type\": \"geo_point\"\n" +
            "      },\n" +
            "      \"pic\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"index\": false\n" +
            "      },\n" +
            "      \"all\":{\n" +
            "        \"type\": \"text\",\n" +
            "        \"analyzer\": \"ik_max_word\"\n" +
            "      }\n" +
            "    }\n" +
            "  }\n" +
            "}";
}

In the HotelIndexTest test class in hotel-demo, write a unit test to create an index:

@Test
void createHotelIndex() throws IOException {
    
    
    // 1.创建Request对象
    CreateIndexRequest request = new CreateIndexRequest("hotel");
    // 2.准备请求的参数:DSL语句
    request.source(MAPPING_TEMPLATE, XContentType.JSON);
    // 3.发送请求 (request对象传递进来)
    client.indices().create(request, RequestOptions.DEFAULT);
}

Note 1: import static cn.whu.hotel.constants.HotelConstants.MAPPING_TEMPLATE;
The DSL statement MAPPING_TEMPLATE imports constants in a static way
.
Note 2: import org.elasticsearch.client.indices.CreateIndexRequest;
This package has 2 files with the same name yes, don't get me wrong

Finally, query the index library to verify:

# 查询索引库
GET /hotel

4.2. Delete the index library

The DSL statement to delete the index store is very simple:

DELETE /hotel

Compared to creating an index library:

  • Request method changed from PUT to DELTE
  • The request path remains unchanged
  • no request parameters

Therefore, the difference in code should be reflected in the Request object. It is still three steps:

  • 1) Create a Request object. This time it is the DeleteIndexRequest object
  • 2) Prepare parameters. Here is no parameter
  • 3) Send the request. Use the delete method instead

In the HotelIndexTest test class in hotel-demo, write a unit test to delete the index:

@Test
void testDeleteHotelIndex() throws IOException {
    
    
    // 1. 创建Request对象: 参数是索引库名称
    DeleteIndexRequest request = new DeleteIndexRequest("hotel");

    // 2. 准备请求参数 没有参数 (删除知道个名字就行了嘛)

    // 3. 发送请求: (需要传入request对象)
    client.indices().delete(request, RequestOptions.DEFAULT);
}

4.3. Determine whether the index library exists

The essence of judging whether the index library exists is query, and the corresponding DSL is:

GET /hotel

So it is similar to the deleted Java code flow. It is still three steps:

  • 1) Create a Request object. This time the GetIndexRequest object
  • 2) Prepare parameters. Here is no parameter
  • 3) Send the request. Use the exists method instead
@Test
void testExistHotelIndex() throws IOException {
    
    
    // 1. 创建Request对象: 参数是索引库名称
    GetIndexRequest request = new GetIndexRequest("hotel");
    // 2. 发送请求: (需要传入request对象)
    boolean exists = client.indices().exists(request, RequestOptions.DEFAULT);
    // 3. 打印结果
    System.out.println("exists = " + exists);
}

4.4. Summary

The process of JavaRestClient operating elasticsearch is basically similar. The core is the client.indices() method to obtain the operation object of the index library.

The basic steps of index library operation:

  • Initialize RestHighLevelClient
  • Create XxxIndexRequest. XXX is Create, Get, Delete
  • Prepare DSL (required when Create, others are no parameters)
  • send request. Call the RestHighLevelClient#indices().xxx() method, where xxx is create, exists, delete

5. RestClient operation document

In order to separate from the index library operation, we add a test class again, which does two things:

  • Initialize RestHighLevelClient
  • Our hotel data is in the database, we need to use IHotelService to query, so inject this interface
package cn.whu.hotel;

import cn.whu.hotel.service.IHotelService;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;

import java.io.IOException;

@SpringBootTest
public class HotelDocumentTest {
    
    

    @Autowired
    private IHotelService service;

    private RestHighLevelClient client;

    @BeforeEach
    public void setUp(){
    
    
        client = new RestHighLevelClient(RestClient.builder(
                HttpHost.create("http://192.168.141.100:9200") // 最后的'/'不能有
                // 集群的话可以指定多个,中间逗号隔开
        ));
    }

    @AfterEach
    public void tearDown() throws IOException {
    
    
        client.close();
    }
    
    @Test
    public void testInit(){
    
    
        System.out.println(service.getById(36934));
    }
}

5.1. New document

We want to query the hotel data from the database and write it into elasticsearch.

5.1.1. Index library entity class

The result of the database query is a Hotel type object. The structure is as follows:

@Data
@TableName("tb_hotel")
public class Hotel {
    
    
    @TableId(type = IdType.INPUT)
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String longitude;
    private String latitude;
    private String pic;
}

There are differences with our index library structure:

  • longitude and latitude need to be merged into location

Therefore, we need to define a new type that matches the structure of the index library:

package cn.itcast.hotel.pojo;

import lombok.Data;
import lombok.NoArgsConstructor;

@Data
@NoArgsConstructor
public class HotelDoc {
    
    
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;
	
	// 通过构造方法 将Hotel转换为HotelDoc
    public HotelDoc(Hotel hotel) {
    
    
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
        this.pic = hotel.getPic();
    }
}

5.1.2. Syntax Description

The DSL statement of the newly added document is as follows:

POST /{
    
    索引库名}/_doc/1
{
    
    
    "name": "Jack",
    "age": 21
}

The corresponding java code is shown in the figure:

insert image description here

You can write a small demo first to try

@Test
void testAddDocument() throws IOException {
    
    
    // 1. 准备request对象: 参数为索引名称  和  索引id(必须字符串类型)
    IndexRequest request = new IndexRequest("hotel").id("1");

    // 2. 准备json文档
    request.source("{\"name\":\"jack\",\"price\":21}", XContentType.JSON);

    // 3. 发送请求
    client.index(request, RequestOptions.DEFAULT);
}

You can see that it is similar to creating an index library, and it is also a three-step process:

  • 1) Create a Request object
  • 2) Prepare the request parameters, which is the JSON document in the DSL
  • 3) Send request

The change is that the API of client.xxx() is directly used here, and client.indices() is no longer needed.

5.1.3. Complete code

When we import hotel data, the basic process is the same, but there are a few changes that need to be considered:

  • The hotel data comes from the database, we need to query it first to get the hotel object
  • The hotel object needs to be converted to a HotelDoc object
  • HotelDoc needs to be serialized into json format

Therefore, the overall steps of the code are as follows:

  • 1) Query hotel data Hotel according to id
  • 2) Package Hotel as HotelDoc
  • 3) Serialize HotelDoc to JSON
  • 4) Create IndexRequest, specify the index library name and id
  • 5) Prepare the request parameters, which is the JSON document
  • 6) Send request

In the HotelDocumentTest test class of hotel-demo, write unit tests:

@Test
 void testAddDocument() throws IOException {
    
    
     // 准备数据
     // 1. 根据id查询hotel数据
     Hotel hotel = service.getById(36934L);
     // 2. 转换为es匹配的文档类型
     HotelDoc hotelDoc = new HotelDoc(hotel);
     // 3. 转换为json格式
     String json = JSON.toJSONString(hotelDoc);

     // 1. 准备request对象: 参数为索引名称  和  索引id(必须字符串类型)
     IndexRequest request = new IndexRequest("hotel").id(hotel.getId().toString());
     // 2. 准备json文档
     request.source(json, XContentType.JSON);
     // 3. 发送请求
     client.index(request, RequestOptions.DEFAULT);
 }

5.2. Query documents

5.2.1. Syntax Description

The query DSL statement is as follows:

GET /hotel/_doc/{
    
    id}

It's very simple, so the code is roughly divided into two steps:

  • Prepare the Request object
  • send request

However, the purpose of the query is to get the result, which is parsed into HotelDoc, so the difficulty is the parsing of the result. The complete code is as follows:

insert image description here

As you can see, the result is a JSON, in which the document is placed _sourcein an attribute, so the parsing is to get it _sourceand deserialize it into a Java object.

Similar to before, it is also a three-step process:

  • 1) Prepare the Request object. This time it's a query, so GetRequest
  • 2) Send the request and get the result. Because it is a query, the client.get() method is called here
  • 3) The parsing result is to deserialize the JSON

5.2.2. Complete code

In the HotelDocumentTest test class of hotel-demo, write unit tests:

@Test
public void testGetDocumentById() throws IOException {
    
    
    // 1. 准备request
    GetRequest request = new GetRequest("hotel","36934");
    // 2. 发送请求,得到响应
    GetResponse response = client.get(request, RequestOptions.DEFAULT);
    // 3. 解析响应,得到json
    String json = response.getSourceAsString();
    // 4. 打印结果
    // 4.1 打印json
    System.out.println(json);
    // 4.2 json转为Object再打印: 反序列化
    HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class);
    System.out.println(hotelDoc);
}

5.3. Delete document

The DSL for removal is something like this:

DELETE /hotel/_doc/{
    
    id}

Compared with queries, only the request method changes from DELETE to GET. It is conceivable that Java code should still go in three steps:

  • 1) Prepare the Request object, because it is deleted, this time it is the DeleteRequest object. To specify the index library name and id
  • 2) Prepare parameters, no parameters (no need for this step)
  • 3) Send the request. Because it is deleted, it is the client.delete() method

In the HotelDocumentTest test class of hotel-demo, write unit tests:

@Test
void testDeleteDocument() throws IOException {
    
    
    // 1. 准备request
    DeleteRequest request = new DeleteRequest("hotel","36934");
    // 2. 发送请求
    client.delete(request, RequestOptions.DEFAULT);//只是删es,不会删数据库哈
}

5.4. Modify the document

5.4.1. Syntax Description

Modify we have talked about two ways:

  • Full modification: the essence is to delete according to the id first, and then add
  • Incremental modification: Modify the specified field value in the document

In the RestClient API, the full modification is exactly the same as the newly added API, and the judgment is based on the ID:

  • If the ID already exists when adding, modify it
  • If the ID does not exist when adding, add it

I won't go into details here, we mainly focus on incremental modification.

The code example is shown in the figure:

insert image description here

Similar to before, it is also a three-step process:

  • 1) Prepare the Request object. This time it is a modification, so it is an UpdateRequest
  • 2) Prepare parameters. That is, the JSON document, which contains the fields to be modified
  • 3) Update the documentation. Call the client.update() method here

5.4.2. Complete code

In the HotelDocumentTest test class of hotel-demo, write unit tests:

@Test
void testUpdateDocument() throws IOException {
    
    
    // 1. 准备request
    UpdateRequest request = new UpdateRequest("hotel","36934");
    // 2. 准备参数
    request.doc(
            // 可变参数 逗号隔开
            "price","345",
            "starName","三钻"
    );
    // 3. 发送请求
    client.update(request, RequestOptions.DEFAULT);
}

5.5. Batch import documents

Case requirements: Use BulkRequest to import database data into the index library in batches.

Proceed as follows:

  • Use mybatis-plus to query hotel data

  • Convert the queried hotel data (Hotel) to document type data (HotelDoc)

  • Add documents in batches by using BulkRequest batch processing in JavaRestClient

5.5.1. Syntax Description

The essence of batch processing BulkRequest is to combine and send multiple ordinary CRUD requests together.

It provides an add method to add other requests:

insert image description here

As you can see, the requests that can be added include:

  • IndexRequest, which is to add
  • UpdateRequest, which is to modify
  • DeleteRequest, that is, delete

Therefore, multiple IndexRequests are added to Bulk, which is a new function in batches. Example:

insert image description here

In fact, there are still three steps:

  • 1) Create a Request object. Here is the BulkRequest
  • 2) Prepare parameters. The parameters of batch processing are other Request objects, here are multiple IndexRequests
  • 3) Initiate a request. Here is batch processing, the method called is client.bulk() method

When we import hotel data, we can transform the above code into a for loop.

5.5.2. Complete code

In the HotelDocumentTest test class of hotel-demo, write unit tests:

@Test
public void testBulkRequest() throws IOException {
    
    
    // 0. 准备数据: 批量查询酒店数据
    List<Hotel> list = service.list();
    // 1. 创建Request
    BulkRequest request = new BulkRequest();
    // 2. 准备参数,添加多个新增的Request
    for (Hotel hotel : list) {
    
    
        HotelDoc doc = new HotelDoc(hotel);
        request.add(new IndexRequest("hotel")
                .id(doc.getId().toString())
                .source(JSON.toJSONString(doc),XContentType.JSON));
    }
    // 3. 发送请求
    client.bulk(request,RequestOptions.DEFAULT);
}

Batch query data on the browser side

# 批量查询
GET /hotel/_search

insert image description here

More DSL statement functions: refer to this blog or other Baidu blogs

5.6. Summary

The basic steps of document operation:

  • Initialize RestHighLevelClient
  • Create XxxRequest. XXX is Index, Get, Update, Delete, Bulk
  • Prepare parameters (required for Index, Update, and Bulk)
  • send request. Call the RestHighLevelClient#.xxx() method, where xxx is index, get, update, delete, bulk
  • Parsing result (required for Get)

Guess you like

Origin blog.csdn.net/hza419763578/article/details/131736726