The first glimpse of Spring Cloud's comprehensive study notes

Table of contents

foreword

This article is to record the notes during the learning process of Dark Horse's SpringCloud. This article is the next article of the practical article. It records the introduction principles and usage tutorials of service components such as Docker, MQ, ES, etc. Finally, thank you for reading, and hope you will learn something in the end won


Docker

Docker mainly solves the deployment problems and difficulties


Seeing a baby whale for the first time

scene description

Although microservices have various advantages, the splitting and generalization of services has brought a lot of trouble to deployment.

  • In a distributed system, there are many dependent components, and some conflicts often occur when deploying between different components.
  • Repeated deployment in hundreds or thousands of services, the environment is not necessarily consistent, and various problems will be encountered

Due to the large number of components in large-scale projects and the complexity of the operating environment, some problems may be encountered during deployment:

  • 依赖关系复杂,容易出现兼容性问题

  • 开发、测试、生产环境有差异

Among them, the dependencies are intricate under the cumbersome service components, and conflicts are easy to occur

For example, in a project, the deployment needs to depend on node.js, Redis, RabbitMQ, MySQL, etc. The function libraries and dependencies required for the deployment of these services are different, and there may even be conflicts. Bring great difficulty to the deployment


Docker's solution to dependency compatibility issues

  • ① Package the application's Libs (function library), Deps (dependency), configuration and application together

  • ② Put each application in an isolated container to run to avoid mutual interference

insert image description here

The packaged application package not only contains the application itself, but also protects the Libs and Deps required by the application. There is no need to install these on the operating system, and naturally there is no compatibility problem between different applications.

Although the compatibility problem of different applications has been solved, there will be differences in development, testing and other environments, as well as differences in operating system versions

For example, some services use Ubuntu, but some services use CentOS


The ability of the little whale has just been demonstrated, so these problems are naturally out of the question

Docker resolves operating system environment differences

To solve the problem of differences in different operating system environments, one must first understand the operating system structure . Take an Ubuntu operating system as an example

insert image description here

Structures include:

  • Computer hardware: such as CPU, memory, disk, etc.
  • System kernel: The kernel of all Linux distributions is Linux, such as CentOS, Ubuntu, Fedora, etc. The kernel can interact with computer hardware and provide kernel instructions to operate computer hardware.
  • System applications: applications and function libraries provided by the operating system itself. These function libraries are packages of kernel instructions, which are more convenient to use.

The process of interaction between the application and the computer is as follows :

1) The application calls the operating system application (function library) to realize various functions

2) The system function library isEncapsulation of the kernel instruction set, will call the kernel instruction

3) Kernel instructions operate computer hardware


Both Ubuntu and CentO are based on the Linux kernel, nothing more than different system applications, providedfunction libraryDifferences

If you install an Ubuntu version of the MySQL application to the CentOS system, when MySQL calls the Ubuntu function library, it will find that it cannot find or does not match, and it will report an error

The solution is as follows

  • Docker packages the user program with the system (such as Ubuntu) function library that needs to be called
  • When Docker runs to different operating systems, it is directly based on the packaged function library and runs with the help of the Linux kernel of the operating system

insert image description here


conclusion of issue

How does Docker solve the compatibility problems of complex dependencies and dependencies of different components in large-scale projects?

  • Docker allows applications, dependencies, function libraries, and configurations to be packaged together during development to form a portable image
  • Docker applications run in containers and use the sandbox mechanism to isolate them from each other

How does Docker solve the problem of differences in development, testing, and production environments?

  • The Docker image contains a complete operating environment, including system function libraries, and only depends on the Linux kernel of the system, so it can run on any Linux operating system

Docker is a technology for quickly delivering and running applications, with the following advantages:

  • The program, its dependencies, and the operating environment can be packaged together into a mirror image, which can be migrated to any Linux operating system
  • The sandbox mechanism is used to form an isolated container during runtime, and each application does not interfere with each other
  • Both startup and removal can be completed with one line of commands, which is convenient and quick

Differences between Docker and virtual machines:

  • Docker only encapsulates the function library and does not simulate a complete operating system

  • docker is a system process; a virtual machine is an operating system within an operating system

  • docker is small in size, fast in startup speed and good in performance; the virtual machine is large in size, slow in startup speed and average in performance


Docker Architecture

There are several important concepts in Docker:

Image : Docker packages applications and their required dependencies, function libraries, environments, configurations, and other files together, called images.

Container : The process formed after the application in the image runs is a container , but Docker will isolate the container process, which is invisible to the outside world.

insert image description here


DockerHub

There are so many open source applications that packaging them is often a duplication of effort. In order to avoid these duplication of efforts, people will put their own packaged application images, such as Redis and MySQL images, on the network for shared use, just like GitHub's code sharing.

  • DockerHub: DockerHub is an official Docker image hosting platform. Such a platform is called a Docker Registry.

  • There are also public services similar to DockerHub in China, such as NetEase Cloud Mirror Service , Alibaba Cloud Mirror Library , etc.

On the one hand, you can share your own image to DockerHub, on the other hand, you can also pull the image from DockerHub:

insert image description here


If we want to use Docker to operate images and containers, we must install Docker.

Docker is a program of CS architecture, which consists of two parts:

  • Server (server): Docker daemon process, responsible for processing Docker instructions, managing images, containers , etc.

  • Client (client): Send instructions to the Docker server through commands or RestAPI. Commands can be sent to the server locally or remotely .

insert image description here


Docker installation

uninstall (optional)

If you have installed an old version of Docker before, you can use the following command to uninstall it:

yum remove docker \
                  docker-client \
                  docker-client-latest \
                  docker-common \
                  docker-latest \
                  docker-latest-logrotate \
                  docker-logrotate \
                  docker-selinux \
                  docker-engine-selinux \
                  docker-engine \
                  docker-ce

install docker

First of all, you need to connect the virtual machine to the Internet and install the yum tool

yum install -y yum-utils \
           device-mapper-persistent-data \
           lvm2 --skip-broken

Then update the local mirror source:

# 设置docker镜像源
yum-config-manager \
    --add-repo \
    https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
    
sed -i 's/download.docker.com/mirrors.aliyun.com\/docker-ce/g' /etc/yum.repos.d/docker-ce.repo

yum makecache fast

Then enter the command:

yum install -y docker-ce

docker-ce is a free version for the community. Wait for a while, docker will be installed successfully.


start docker

Docker applications need to use various ports, and modify the firewall settings one by one. It is very troublesome, so it is recommended that you close the firewall directly!

Before starting docker, be sure to close the firewall! ! Or open its port 2375

// 永久开放指定端口
firewall-cmd --add-port=2375/tcp --permanent
//重启防火墙
firewall-cmd --reload
# 关闭
systemctl stop firewalld
# 禁止开机启动防火墙
systemctl disable firewalld

(Close the firewall and open the port to choose one)

Start docker by command:

systemctl start docker  # 启动docker服务

systemctl stop docker  # 停止docker服务

systemctl restart docker  # 重启docker服务

Then enter the command to view the docker version:

docker -v

As shown in the picture:

insert image description here


Configuring Mirroring Acceleration

Same as github, its mirror warehouse is foreign, and the access speed is very low. You can replace it with Ali's mirror or NetEase's mirror.

For example, here is the use of Alibaba Cloud’s mirroring.
Refer to Alibaba Cloud’s mirroring acceleration document: https://cr.console.aliyun.com/cn-hangzhou/instances/mirrors

Simply copy and paste and press Enter.
Configure the mirror accelerator
. For users whose Docker client version is greater than 1.10.0

sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<-'EOF'
{
    
    
  "registry-mirrors": ["https://97wchhmj.mirror.aliyuncs.com"]
}
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker

Docker Basics

Mirror related operations

镜像的名称组成

  • A mirror name generally consists of two parts: [repository]:[tag].
  • When no tag is specified, the default is latest, representing the latest version of the image

insert image description here


镜像基本命令

insert image description here


Case 1 Pull the nginx image from DockerHub and view the image
1) First go to the image warehouse to search for the nginx image, such as the DockerHub image warehouse:
a little slow is normal
insert image description here

insert image description here


2) According to the viewed image name, pull the image you need, and use the command: docker pull nginx

If no tag is specified, the latest version will be defaulted

insert image description here


3) Use the command: docker images to view the pulled images

insert image description here


Case 2 Use docker save to export the nginx image to the disk, and then load it back through load

Command format:

docker save -o [保存的目标文件名称] [镜像名称]

Use docker save to export the image to disk

Run the command:

docker save -o nginx.tar nginx:latest

insert image description here

First delete the local nginx mirror:

docker rmi nginx:latest

Then run the command to load the local file:

docker load -i nginx.tar

result:
insert image description here


Container related operations

Containers protect three states:

  • running: the process is running normally
  • Suspended: The process is suspended, the CPU is no longer running, the process is suspended, and the memory is not released
  • Stop: the process terminates, reclaiming the memory, CPU and other resources occupied by the process
    insert image description here
    docker run: create and run a container, in the running state
    docker pause: pause a running container
    docker unpause: resume a container from a paused state
    docker stop: stop a running container
    docker start: stop a container The container is running again
    docker rm: delete a container

Case: Create and run an nginx container

insert image description here
The meaning of the last two parameters

  • -d: run the container in the background
  • nginx: mirror name, such as nginx

The container port is mapped to the host port. The mapped host port is variable and does not have to be consistent with the container. It can also be 8080 here

insert image description here

docker ps #查看容器 加上-a可以查看停止的容器
docker logs 容器名  #查看容器日志

View the container log, add the -f parameter to view the log continuously


Case - into the container

into the container. The command to enter the nginx container we just created is:

docker exec -it mn bash

Command interpretation:

  • docker exec: enter the container and execute a command

  • -it : Create a standard input and output terminal for the currently entered container, allowing us to interact with the container

  • mn : the name of the container to enter

  • bash: the command executed after entering the container, bash is a linux terminal interactive command

insert image description here

It is not recommended to modify files in the container


hands-on demo

insert image description here

insert image description here

insert image description here

Enter the redis client

insert image description here

store key value

insert image description here

RDM software connects to the server ip, connects to redis
to view the stored key values

insert image description here


data volume

From the above case, we can see some problems of the coupling between the container and the data. There is no vi editor inside the container , so it is very troublesome to modify the file

. To solve these problems, the data must be decoupled from the container, which requires the use of data volumes.


What is a data volume

A data volume (volume) is a virtual directory that points to a directory in the host file system.

The data volume is the bridge between the server host and the container

insert image description here
一旦完成数据卷挂载,对容器的一切操作都会作用在数据卷对应的宿主机目录了。
这样,我们操作宿主机的/var/lib/docker/volumes/html目录,就等于操作容器内的/usr/share/nginx/html目录了
而容器删除,数据卷不会删除,这样再次加入新的容器,挂载在数据卷下,就又连接起来了


数据卷的基本语法
数据卷操作命令是二级命令

数据卷操作的基本语法如下:

docker volume [COMMAND]

docker volume命令是数据卷操作,根据命令后跟随的command来确定下一步的操作:

  • create 创建一个volume
  • inspect 显示一个或多个volume的信息
  • ls 列出所有的volume
  • prune 删除未使用的volume
  • rm 删除一个或多个指定的volume

insert image description here

小结

数据卷的作用:

将容器与数据分离,解耦合,方便操作容器内数据,保证数据安全

数据卷操作:

  • docker volume create:创建数据卷
  • docker volume ls:查看所有数据卷
  • docker volume inspect:查看数据卷详细信息,包括关联的宿主机目录位置
  • docker volume rm:删除指定数据卷
  • docker volume prune:删除所有未使用的数据卷

挂载数据卷实操
我们在创建容器时,可以通过 -v 参数来挂载一个数据卷到某个容器内目录,命令格式如下:

docker run \
  --name mn \
  -v html:/root/html \
  -p 8080:80
  nginx \

这里的-v就是挂载数据卷的命令:

  • -v html:/root/htm :把html数据卷挂载到容器内的/root/html这个目录中

案例-给nginx挂载数据卷
需求:创建一个nginx容器,修改容器内的html目录内的index.html内容

Analysis : In the previous case, we entered the nginx container and already knew the location of the nginx html directory /usr/share/nginx/html, we need to mount this directory to the html data volume to facilitate the operation of its contents.

Tip : Use the -v parameter to mount the data volume when running the container

step:

① Create a container and mount the data volume to the HTML directory in the container

docker run --name mn -v html:/usr/share/nginx/html -p 80:80 -d nginx

② Enter the location of the html data volume and modify the HTML content

# 查看html数据卷的位置
docker volume inspect html
# 进入该目录
cd /var/lib/docker/volumes/html/_data
# 修改文件
vi index.html

insert image description here

And there is no need to restart the container, it takes effect immediately

Note: When we mount the data volume of the container , if the data volume does not exist, Docker will automatically create the data volume for me


Case - Mount local directory for MySQL
The container can not only mount data volumes, but also directly mount to the host directory. The relationship is as follows:

  • With data volume mode: host directory --> data volume --> container directory
  • Direct mount mode: host directory —> container directory

insert image description here

Grammar :

The syntax of directory mount and data volume mount is similar:

  • -v [host directory]:[container directory]
  • -v [host file]:[file in container]

Requirement : Create and run a MySQL container, mount the host directory directly to the container

The implementation idea is as follows:

1) Upload the mysql.tar file in the pre-class materials to the virtual machine, and load it as a mirror image through the load command

2) Create directory /tmp/mysql/data

3) Create a directory /tmp/mysql/conf, and upload the hmy.cnf file provided by the pre-class materials to /tmp/mysql/conf

4) Go to DockerHub to check the information, create and run the MySQL container, and require:

① Mount /tmp/mysql/data to the data storage directory in the mysql container

② Mount /tmp/mysql/conf/hmy.cnf to the configuration file of the mysql container

③ Set MySQL password


Summarize

In the command of docker run, the file or directory is mounted into the container through the -v parameter:

  • -v volume name: directory inside the container
  • -v host file: file in the container
  • -v host directory: container directory

Data volume mount and directory mount directly

  • The coupling degree of data volume mounting is low, and the directory is managed by docker, but the directory is deep and hard to find
  • The coupling degree of directory mounting is high, we need to manage the directory ourselves, but the directory is easy to find and view

Dockerfile custom image

Common images can be found on DockerHub, but for projects we write ourselves, we must build images ourselves.

Mirroring structure
Mirroring is a package of applications and their required system function libraries, environments, configurations, and dependencies.
To put it simply, a mirror is a file formed by adding a combination of application files, configuration files, and dependent files based on the system function library and operating environment, and then writing a startup script and packaging them together.

Mirroring is a hierarchical structure, and each layer is called a Layer
BaseImage层: it contains basic system function libraries, environment variables, and file systems
Entrypoint: entry, which is the command to start the application in the mirror
其它: add dependencies, install programs, and complete the entire application on the basis of BaseImage installation and configuration of

Take MySQL as an example to see the composition structure of the image:
insert image description here


When building a custom image, there is no need to copy and package each file.

We only need to add the composition of our image, which BaseImages are needed, what files need to be copied, what dependencies need to be installed, and what the startup script is to the Dockerfile.

Use Dockerfile to describe build information

Dockerfile is a text file, which contains instructions (Instructions) one by one , using instructions to explain what operations to perform to build the image. Each instruction will form a Layer

insert image description here

Build a java project as a demonstration below

①步骤1:新建一个空文件夹docker-demo,把相应的java的jar包和JDK压缩包以及创建Dockerfile都添加该目录中

②编写Dockerfile文件,将构建镜像相关信息都添加到Dockerfile文件中

For example as follows

# 指定基础镜像
FROM ubuntu:16.04
# 配置环境变量,JDK的安装目录
ENV JAVA_DIR=/usr/local

# 拷贝jdk和java项目的包
COPY ./jdk8.tar.gz $JAVA_DIR/
COPY ./docker-demo.jar /tmp/app.jar

# 安装JDK
RUN cd $JAVA_DIR \
 && tar -xf ./jdk8.tar.gz \
 && mv ./jdk1.8.0_144 ./java8

# 配置环境变量
ENV JAVA_HOME=$JAVA_DIR/java8
ENV PATH=$PATH:$JAVA_HOME/bin

# 暴露端口
EXPOSE 8090
# 入口,java项目的启动命令
ENTRYPOINT java -jar /tmp/app.jar

③在docker-demo目录下执行命令构建镜像

docker build -t javaweb:1.0 .

insert image description here

Run the built image

docker run --name web -p 8090:8090 -d javaweb:1.0

insert image description here


Building Java projects based on java8

Although you can add any installation package you need to build a mirror image, it is more troublesome. So in most cases, we can modify some basic images with some software installed .

For example, the image for building a java project can be built on the basis of a base image that has been prepared with a JDK.
insert image description here

For example, the following
requirements: Based on the java:8-alpine mirror, build a Java project as a mirror

The implementation idea is as follows:

  • ① Create a new empty directory, and then create a new file in the directory, named Dockerfile

  • ② Copy the docker-demo.jar provided by the pre-class materials to this directory

  • ③ Write Dockerfile:

    • a) Based on java:8-alpine as the base image

    • b) Copy app.jar into the image

    • c) Expose the port

    • d) Write entry ENTRYPOINT

      The content is as follows:

      FROM java:8-alpine
      COPY ./app.jar /tmp/app.jar
      EXPOSE 8090
      ENTRYPOINT java -jar /tmp/app.jar
      
  • ④ Use the docker build command to build a mirror image

  • ⑤ Use docker run to create a container and run it

summary

  1. The essence of Dockerfile is a file that describes the construction process of the image through instructions

  2. The first line of the Dockerfile must be FROM to build from a base image

  3. The base image can be a base OS like Ubuntu, CentOS. It can also be an image made by others, for example: java:8-alpine


Docker-Compose

When there are many microservices, it is impossible for us to create and run containers one by one. To quickly deploy application services, we need to use Compose files

Docker Compose can help us quickly deploy distributed applications based on Compose files without manually creating and running containers one by one

Compose file is a text file that defines how each container in the cluster runs through instructions (similar to Dockerfile)

insert image description here

The Compose file above describes a project that contains two containers:

  • mysql: a mysql:5.7.25container built based on the image, and two directories are mounted
  • web: a docker buildmirror container based on ad-hoc build, port 8090 is mapped

Install Compose

1. Download compose

# 安装
curl -L https://github.com/docker/compose/releases/download/1.23.1/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose

But because the download from the external network is very slow, put the compoes compressed package here directly, and the download will be faster
Compose download address
extraction code: 8tfx

Upload the downloaded file to /usr/local/bin/the directory.


2. Modify file permissions

# 修改权限
chmod +x /usr/local/bin/docker-compose

insert image description here


3. Add Base auto-completion
command①

echo "199.232.68.133 raw.githubusercontent.com" >> /etc/hosts

 systemctl restart docker

curl -L http://raw.githubusercontent.com/docker/compose/1.29.1/contrib/completion/bash/docker-compose > /etc/bash_completion.d/docker-compose

Just copy and paste and run in order, this is also meThe order of summary after stepping on the pit, otherwise it will keep downloading

final result graph
insert image description here


Deploy a microservice cluster

Requirement : Deploy the previously learned cloud-demo microservice cluster using Docker Compose

Implementation idea :

① Write the docker-compose file to build the project image

② Modify your own cloud-demo project, and name the database and nacos address as the service name in docker-compose

③ Use the maven packaging tool to package each microservice in the project as app.jar

④ Copy the packaged app.jar to each corresponding subdirectory in cloud-demo

⑤ Upload cloud-demo to the virtual machine and use docker-compose up -d to deploy


Compose file writing

Each microservice prepares a separate directory:

insert image description here
(dockerfile can be said to build a custom image, and docker-compose is a container that runs the image build)

docker-compose中的build参数是要求dockerfile文件的位置,根据dockerfile来构建镜像

version: "3.2"

services:
  nacos:
    image: nacos/nacos-server
    environment:
      MODE: standalone
    ports:
      - "8848:8848"
  mysql:
    image: mysql:5.7.25
    environment:
      MYSQL_ROOT_PASSWORD: 123
    volumes:
      - "$PWD/mysql/data:/var/lib/mysql"
      - "$PWD/mysql/conf:/etc/mysql/conf.d/"
  userservice:
    build: ./user-service
  orderservice:
    build: ./order-service
  gateway:
    build: ./gateway
    ports:
      - "10010:10010"

Modify microservice configuration

Because microservices will be deployed as docker containers in the future, and the interconnection between containers is not through IP addresses , but throughcontainer name. Here we change the mysql and nacos addresses of order-service, user-service and gateway services to access based on container names.

As follows:

spring:
  datasource:
    url: jdbc:mysql://mysql:3306/cloud_order?useSSL=false
    username: root
    password: 123
    driver-class-name: com.mysql.jdbc.Driver
  application:
    name: orderservice
  cloud:
    nacos:
      server-addr: nacos:8848 # nacos服务地址

Replace the original localhost with the container name nacos, and the MySQL localhost also needs to be replaced


Pack

Next we need to package each of our microservices. Because the name of the jar package in the Dockerfile is app.jar, each of our microservices needs to use this name.

It can be achieved by modifying the package name in pom.xml, which needs to be modified for each microservice:

<build>
  <!-- 服务打包的最终名称 -->
  <finalName>app</finalName>
  <plugins>
    <plugin>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-maven-plugin</artifactId>
    </plugin>
  </plugins>
</build>

insert image description here


Copy the jar package to the deployment directory

The compiled and packaged app.jar file needs to be placed in the same directory as the Dockerfile. Note: The app.jar of each microservice is placed in the directory corresponding to the service name, don't make a mistake.

For example user-service:
insert image description here


deploy

Finally, we need to upload the entire cloud-demo folder to the virtual machine and deploy it through DockerCompose.

Then enter the cloud-demo directory, and run the following command:

docker-compose up -d

insert image description here

Finally, note that if nacos starts too slowly, other services will fail to connect and report an error. Finally, restart the service except nacos and the connection will be successful.


Docker mirror warehouse

Mirror warehouses such as DockerHub, mirror warehouses (Docker Registry) have two forms: public and private. Generally, enterprises will build their own private mirror warehouses.

The following introduces how to build a private Docker Registry locally
and build a mirror warehouse based on the official Docker Registry provided by Docker.


Simplified version of the mirror warehouse
Docker's official Docker Registry is a basic version of the Docker mirror warehouse, with complete functions of warehouse management, but without a graphical interface.

The construction method is relatively simple, the command is as follows:

docker run -d \
    --restart=always \
    --name registry	\
    -p 5000:5000 \
    -v registry-data:/var/lib/registry \
    registry

The command mounts a data volume registry-data to the /var/lib/registry directory in the container, which is the directory where the private mirror library stores data.


The version with a graphical interface
Use DockerCompose to deploy DockerRegistry with a graphical interface, the command is as follows:

version: '3.0'
services:
  registry:
    image: registry
    volumes:
      - ./registry-data:/var/lib/registry
  ui:
    image: joxit/docker-registry-ui:static
    ports:
      - 8080:80
    environment:
      - REGISTRY_TITLE=本地私有仓库
      - REGISTRY_URL=http://registry:5000
    depends_on:
      - registry

Configure the Docker trust address
Our private server uses the http protocol, which is not trusted by Docker by default, so a configuration is required:

# 打开要修改的文件
vi /etc/docker/daemon.json
# 添加内容:
"insecure-registries":["http://本机ip:8080"]
# 重加载
systemctl daemon-reload
# 重启docker
systemctl restart docker

insert image description here

insert image description here

Finally, when the browser visits the warehouse address configured by itself, you will see the graphical interface


Push and pull images
To push an image to a private image service, you must first tag it. The steps are as follows:

① Re-tag the local mirror, and the name prefix is ​​the address of the private warehouse: Warehouse IP:8080/

docker tag nginx:latest 自己ip地址:8080/nginx:1.0 

② Push image

docker push 自己ip地址:8080/nginx:1.0 

③ Pull the image

docker pull 自己ip地址:8080/nginx:1.0 

insert image description here


asynchronous communication

Getting to know MQ

There are two ways of communication between microservices: synchronous and asynchronous:

Synchronous communication: Like making a phone call , real-time response is required.

Asynchronous communication: Just like sending a message , there is no need to reply immediately.


synchronous communication

Let's take the user shopping business as an example
insert image description here

The Feign call is a synchronous method. Although the call can get the result in real time, there are the following problems:
insert image description here


Advantages of synchronous calls :

  • Time-sensitive, results can be obtained immediately

Problems with synchronous calls :

  • High coupling
  • Degraded performance and throughput
  • has additional resource consumption
  • There is cascading failure problem

asynchronous communication

Asynchronous calls can avoid the above problems:

Still take the above user shopping business as an example.
In order to decouple the event publisher and subscriber, the two do not communicate directly, but have a middleman (Broker). Publishers publish events to Broker, regardless of who subscribes to events. Subscribers subscribe to events from Broker and don't care who sends the message.

insert image description here


There is another important function to solve the problem of high concurrency and achieve peak clipping
insert image description here
Put a large number of requests at a certain point in time into the broker, and the background will process them as usual, preventing a large number of requests from going directly to the service. The pressure is carried by the broker, acting as a buffer layer.


Broker is something like a data bus. All services to receive data and send data are sent to this bus. This bus is like a protocol, making communication between services standard and controllable.
insert image description here


advantage:

  • Throughput improvement: there is no need to wait for subscribers to complete processing, and the response is faster

  • Fault isolation: the service is not called directly, and there is no cascading failure problem

  • There is no blocking between calls, which will not cause invalid resource occupation

  • The degree of coupling is extremely low, each service can be flexibly plugged and replaced

  • Traffic peak clipping: No matter how much the traffic of the published event fluctuates, it will be received by Broker, and subscribers can process events at their own speed

shortcoming:

  • The structure is complicated, and the business has no obvious process line, which is difficult to manage
  • Need to rely on Broker's reliability, security, and performance

Synchronous calls are used in most scenarios, because timeliness is required for its return results; asynchronous calls are only used in high-concurrency business


MQ common framework

MQ, Chinese is Message Queue (MessageQueue), literally it is a queue for storing messages. That is, the Broker in the event-driven architecture.

More common MQ implementations:

  • ActiveMQ
  • RabbitMQ
  • RocketMQ
  • Kafka

Comparison of several common MQs:

RabbitMQ ActiveMQ RocketMQ Kafka
Company/Community Rabbit Apache Ali Apache
Development language Erlang Java Java Scala&Java
protocol support AMQP,XMPP,SMTP,STOMP OpenWire,STOMP,REST,XMPP,AMQP custom protocol custom protocol
availability high generally high high
Stand-alone throughput generally Difference high very high
message delay microsecond level Millisecond Millisecond within milliseconds
message reliability high generally high generally

Pursuit of availability : Kafka, RocketMQ, RabbitMQ

Pursuit of reliability : RabbitMQ, RocketMQ

Pursuit of throughput : RocketMQ, Kafka

Pursue low message latency : RabbitMQ, Kafka


RabbitMQ Quick Start

RabbitMQ overview and installation

①Install rabbitmq:mirror

docker pull rabbitmq:3-management

②Run the MQ container

Among them, the account password of the management interface is set by yourself, which is not a big problem.

docker run \
 -e RABBITMQ_DEFAULT_USER=管理界面的账号 \
 -e RABBITMQ_DEFAULT_PASS=管理界面的密码 \
 --name mq \
 --hostname mq1 \
 -p 15672:15672 \
 -p 5672:5672 \
 -d \
 rabbitmq:3-management

③ Open port
Open MQ port 5672 and its management port 15672

 sudo firewall-cmd --zone=public --permanent --add-port=15672/tcp
 sudo firewall-cmd --zone=public --permanent --add-port=5672/tcp
 firewall-cmd --reload

insert image description here

If it is a server, add an open rule to the firewall


Then ip+port number to access the management interface of MQ, the account password is just set by yourself.

insert image description here

The basic structure of MQ:
insert image description here


Common Message Model

Two message queues and three subscription modes

insert image description here


insert image description here
The official HelloWorld is implemented based on the most basic message queue model, including only three roles:

  • publisher: message publisher, send the message to the queue queue
  • queue: message queue, responsible for accepting and caching messages
  • consumer: Subscribe to the queue and process messages in the queue

quick start

Next, implement the HelloWorld basic message queue

Ideas:

  • establish connection
  • Create Channels
  • declare queue
  • Send a message
  • Close connections and channels

publisher implementation

Code ideas:

  • establish connection
  • Create Channels
  • declare queue
  • Send a message
  • Close connections and channels
public class PublisherTest {
    
    
    @Test
    public void testSendMessage() throws IOException, TimeoutException {
    
    
        // 1.建立连接
        ConnectionFactory factory = new ConnectionFactory();
        // 1.1.设置连接参数,分别是:主机名、端口号、vhost、用户名、密码
        factory.setHost("192.168.150.101");
        factory.setPort(5672);
        factory.setVirtualHost("/");
        factory.setUsername("itcast");
        factory.setPassword("123321");
        // 1.2.建立连接
        Connection connection = factory.newConnection();

        // 2.创建通道Channel
        Channel channel = connection.createChannel();

        // 3.创建队列
        String queueName = "simple.queue";
        channel.queueDeclare(queueName, false, false, false, null);

        // 4.发送消息
        String message = "hello, rabbitmq!";
        channel.basicPublish("", queueName, null, message.getBytes());
        System.out.println("发送消息成功:【" + message + "】");

        // 5.关闭通道和连接
        channel.close();
        connection.close();

    }
}

Visit the management interface of rabbitmq after sending, and you can see the sent message

insert image description here


consumer implementation

Code idea:

  • establish connection
  • Create Channels
  • declare queue
  • subscribe news
public class ConsumerTest {
    
    

    public static void main(String[] args) throws IOException, TimeoutException {
    
    
        // 1.建立连接
        ConnectionFactory factory = new ConnectionFactory();
        // 1.1.设置连接参数,分别是:主机名、端口号、vhost、用户名、密码
        factory.setHost("192.168.150.101");
        factory.setPort(5672);
        factory.setVirtualHost("/");
        factory.setUsername("itcast");
        factory.setPassword("123321");
        // 1.2.建立连接
        Connection connection = factory.newConnection();

        // 2.创建通道Channel
        Channel channel = connection.createChannel();

        // 3.创建队列
        String queueName = "simple.queue";
        channel.queueDeclare(queueName, false, false, false, null);

        // 4.订阅消息
        channel.basicConsume(queueName, true, new DefaultConsumer(channel){
    
    
            @Override
            public void handleDelivery(String consumerTag, Envelope envelope,
                                       AMQP.BasicProperties properties, byte[] body) throws IOException {
    
    
                // 5.处理消息
                String message = new String(body);
                System.out.println("接收到消息:【" + message + "】");
            }
        });
        System.out.println("等待接收消息。。。。");
    }
}

After consumption, the message does not exist, and it burns after reading

insert image description here


The message sending process of the basic message queue :
1. Establish a connection
2. Create a channel
3. Use the channel to declare the queue
4. Use the channel to send messages to the queue


The message receiving process of the basic message queue :
1. Establish a connection
2. Create a channel
3. Use the channel to declare the queue
4. Define the consumer's consumption behavior handleDelivery()
5. Use the channel to bind the consumer to the queue


Both sending and receiving repeatedly establish connections, channels, and queues for double insurance, because it is not clear who executes the sending and receiving codes first and who executes later, but it must be ensured that no matter who executes, the connection channel and queue must exist.


SpringAMQP

What is Spring AMQP

insert image description here

Spring AMQP provides three functions:

  • Automatic declaration of queues, exchanges and their bindings
  • Annotation-based listener mode to receive messages asynchronously
  • Encapsulates the RabbitTemplate tool for sending messages

Basic Queue simple queue model

The process is as follows:
1. Introduce the dependency of spring-amqp in the parent project (after the parent project introduces the dependency, the sender and receiver of the submodule do not need to be introduced repeatedly) 2.
Use RabbitTemplate in the publisher service to send messages to simple.queue This queue
3. Write the consumption logic in the consumer service and bind the simple.queue queue


message sending

①在父工程mq-demo中引入依赖

<!--AMQP依赖,包含RabbitMQ-->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-amqp</artifactId>
</dependency>

②yml配置文件中添加MQ的配置信息

spring:
  rabbitmq:
    host: 192.168.150.101 # 主机名
    port: 5672 # 端口
    virtual-host: / # 虚拟主机
    username: rabbitmq # 用户名
    password: 123456 # 密码

③在publisher服务中编写测试类SpringAmqpTest,并利用RabbitTemplate实现消息发送

The premise is that the queue (simple.queue) in the code must have been created

@RunWith(SpringRunner.class)
@SpringBootTest
public class SpringAmqpTest {
    
    

    @Autowired
    private RabbitTemplate rabbitTemplate;

    @Test
    public void testSimpleQueue() {
    
    
        // 队列名称
        String queueName = "simple.queue";
        // 消息
        String message = "hello, spring amqp!";
        // 发送消息
        rabbitTemplate.convertAndSend(queueName, message);
    }
}

message reception

①yml配置文件中添加配置信息

spring:
  rabbitmq:
    host: 192.168.150.101 # 主机名
    port: 5672 # 端口
    virtual-host: / # 虚拟主机
    username: ribbit # 用户名
    password: 123456 # 密码

②在consumer服务的listener包中新建一个类SpringRabbitListener

@Component
public class SpringRabbitListener {
    
    

    @RabbitListener(queues = "simple.queue")
    public void listenSimpleQueueMessage(String msg) throws InterruptedException {
    
    
        System.out.println("spring 消费者接收到消息:【" + msg + "】");
    }
}

Finally start the test

Start the consumer service, then run the test code in the publisher service, and send MQ messages


Summarize

What is AMQP?
A protocol for message communication between applications, independent of language and platform.

How does SpringAMQP send messages?
①Introduce the starter dependency of amqp
②Configure the RabbitMQ address
③Use the convertAndSend method of RabbitTemplate

How does SpringAMQP receive messages?
①Introduce the starter dependency of amqp
②Configure the RabbitMQ address
③Define the class, add the @Component annotation
④Declare the method in the class, add the @RabbitListener annotation, the method parameter is the message

Note: Once the message is consumed, it will be deleted from the queue. RabbitMQ has no message backtracking function


Work Queue work queue model

When message processing is time-consuming, the speed of message production may be far greater than the speed of message consumption. If things go on like this, more and more messages will pile up and cannot be processed in time.

At this time, the work model can be used, and multiple consumers can jointly process message processing, and the speed can be greatly improved.
insert image description here
Simply put, it is to let multiple consumers bind to a queue and consume messages in the queue together .


send message

We send it in a loop to simulate the accumulation of a large number of messages.

Add a test method to the SpringAmqpTest class in the publisher service:

/**
     * workQueue
     * 向队列中不停发送消息,模拟消息堆积。
     */
@Test
public void testWorkQueue() throws InterruptedException {
    
    
    // 队列名称
    String queueName = "simple.queue";
    // 消息
    String message = "hello, message_";
    for (int i = 0; i < 50; i++) {
    
    
        // 发送消息
        rabbitTemplate.convertAndSend(queueName, message + i);
        Thread.sleep(20);
    }
}

message reception

To simulate multiple consumers binding to the same queue, we add two new methods to the SpringRabbitListener of the consumer service:

@RabbitListener(queues = "simple.queue")
public void listenWorkQueue1(String msg) throws InterruptedException {
    
    
    System.out.println("消费者1接收到消息:【" + msg + "】" + LocalTime.now());
    Thread.sleep(20);
}

@RabbitListener(queues = "simple.queue")
public void listenWorkQueue2(String msg) throws InterruptedException {
    
    
    System.err.println("消费者2........接收到消息:【" + msg + "】" + LocalTime.now());
    Thread.sleep(200);
}

Note that the consumer sleeps for 1000 seconds, and the simulation task takes time.


run test

After starting the ConsumerApplication, execute the sending test method testWorkQueue just written in the publisher service.

It can be seen that consumer 1 quickly completed its 25 messages. Consumer 2 is slowly processing its own 25 messages.

That is to say, the message is evenly distributed to each consumer, without taking into account the processing power of the consumer. This is obviously problematic.


Able people should do more work
Message prefetching is to allocate the message first, and then process it after the allocation is completed. This is the prefetching of the message.
There is a simple configuration in spring that can solve this problem. We modify the application.yml file of the consumer service and add the configuration:

spring:
  rabbitmq:
    listener:
      simple:
        prefetch: 1 # 每次只能获取一条消息,处理完成才能获取下一个消息

Restart the application startup class of the consumer without taking effect


Summarize

Use of the Work model:

  • Multiple consumers are bound to a queue, and the same message will only be processed by one consumer
  • Control the number of messages prefetched by consumers by setting prefetch

publish/subscribe

insert image description here

As you can see, in the subscription model, there is an additional exchange role, and the process has changed slightly:

  • Publisher: The producer, that is, the program to send the message, but no longer sent to the queue, but to X (exchange)
  • Exchange: switch, X in the figure. On the one hand, receive messages sent by producers. On the other hand, knowing how to process the message, such as delivering it to a particular queue, delivering it to all queues, or discarding the message. How it works depends on the type of Exchange. Exchange has the following 3 types:
    • Fanout: Broadcast, hand over the message to all queues bound to the exchange
    • Direct: directional, deliver the message to the queue that matches the specified routing key
    • Topic: wildcard, send the message to the queue that matches the routing pattern (routing pattern)
  • Consumer: The consumer, as before, subscribes to the queue, no change
  • Queue: The message queue is the same as before, receiving messages and buffering messages.

Publish, subscribe model - Fanout

Fanout, the English translation is fan out, I think it is more appropriate to call broadcast in MQ

insert image description here

In broadcast mode, the message sending process is as follows:

  • 1) There can be multiple queues
  • 2) Each queue must be bound to Exchange (exchange)
  • 3) The message sent by the producer can only be sent to the switch, and the switch decides which queue to send to, and the producer cannot decide
  • 4) The switch sends the message to all bound queues
  • 5) Consumers who subscribe to the queue can get the message

code idea

  • Create a switch itcast.fanout, the type is Fanout
  • Create two queues fanout.queue1 and fanout.queue2, bound to the switch itcast.fanout

insert image description here


Declare exchanges and queues

Spring provides an interface Exchange to represent different types of switches

insert image description here

Create a class in consumer to declare queues and switches:

@Configuration
public class FanoutConfig {
    
    
    /**
     * 声明交换机
     * @return Fanout类型交换机
     */
    @Bean
    public FanoutExchange fanoutExchange(){
    
    
        return new FanoutExchange("itcast.fanout");
    }

    /**
     * 第1个队列
     */
    @Bean
    public Queue fanoutQueue1(){
    
    
        return new Queue("fanout.queue1");
    }

    /**
     * 绑定队列和交换机
     */
    @Bean
    public Binding bindingQueue1(Queue fanoutQueue1, FanoutExchange fanoutExchange){
    
    
        return BindingBuilder.bind(fanoutQueue1).to(fanoutExchange);
    }

    /**
     * 第2个队列
     */
    @Bean
    public Queue fanoutQueue2(){
    
    
        return new Queue("fanout.queue2");
    }

    /**
     * 绑定队列和交换机
     */
    @Bean
    public Binding bindingQueue2(Queue fanoutQueue2, FanoutExchange fanoutExchange){
    
    
        return BindingBuilder.bind(fanoutQueue2).to(fanoutExchange);
    }
}

send message

Add a test method to the SpringAmqpTest class of the publisher service:

@Test
public void testFanoutExchange() {
    
    
    // 队列名称
    String exchangeName = "itcast.fanout";
    // 消息
    String message = "hello, everyone!";
    rabbitTemplate.convertAndSend(exchangeName, "", message);
}

message reception

Add two methods to the SpringRabbitListener of the consumer service as a consumer:

@RabbitListener(queues = "fanout.queue1")
public void listenFanoutQueue1(String msg) {
    
    
    System.out.println("消费者1接收到Fanout消息:【" + msg + "】");
}

@RabbitListener(queues = "fanout.queue2")
public void listenFanoutQueue2(String msg) {
    
    
    System.out.println("消费者2接收到Fanout消息:【" + msg + "】");
}

summary

What is the role of the switch?

  • Receive messages sent by the publisher
  • Route the message to the queue bound to it according to the rules
  • Messages cannot be cached , routing fails, messages are lost
  • FanoutExchange will route messages to each bound queue

What are the beans that declare queues, switches, and binding relationships?

  • Queue
  • FanoutExchange
  • Binding

Publish, subscribe model-Direct

In Fanout mode, a message will be consumed by all subscribed queues. However, in some scenarios, we want different messages to be consumed by different queues. At this time, the Direct type of Exchange will be used.

insert image description here

Under the Direct model:

  • The binding between the queue and the switch cannot be arbitrary, but a RoutingKey(routing key) must be specified
  • The sender of the message must also specify the message ID when sending the message to Exchange RoutingKey.
  • Exchange no longer delivers messages to each bound queue, but Routing Keyjudges based on the message. Only when the queue is completelyRoutingkey consistent with the message will the message be received.Routing key

The case requirements are as follows :

  1. Use @RabbitListener to declare Exchange, Queue, RoutingKey

  2. In the consumer service, write two consumer methods to listen to direct.queue1 and direct.queue2 respectively

  3. Write a test method in the publisher and send a message to itcast.direct

Declare queues and exchanges based on annotations

It is cumbersome to declare queues and switches based on @Bean. Spring also provides annotation-based declarations.

Add two consumers to the consumer's SpringRabbitListener, and declare queues and switches based on annotations:

@RabbitListener(bindings = @QueueBinding(
    value = @Queue(name = "direct.queue1"),
    exchange = @Exchange(name = "itcast.direct", type = ExchangeTypes.DIRECT),
    key = {
    
    "red", "blue"}
))
public void listenDirectQueue1(String msg){
    
    
    System.out.println("消费者接收到direct.queue1的消息:【" + msg + "】");
}

@RabbitListener(bindings = @QueueBinding(
    value = @Queue(name = "direct.queue2"),
    exchange = @Exchange(name = "itcast.direct", type = ExchangeTypes.DIRECT),
    key = {
    
    "red", "yellow"}
))
public void listenDirectQueue2(String msg){
    
    
    System.out.println("消费者接收到direct.queue2的消息:【" + msg + "】");
}

send message

Add a test method to the SpringAmqpTest class of the publisher service:

@Test
public void testSendDirectExchange() {
    
    
    // 交换机名称
    String exchangeName = "itcast.direct";
    // 消息
    String message = "红警";
    // 发送消息
    rabbitTemplate.convertAndSend(exchangeName, "red", message);
}

Summarize

Describe the difference between a Direct switch and a Fanout switch?

  • Fanout exchange routes messages to each queue bound to it
  • The Direct switch determines which queue to route to according to the RoutingKey
  • Similar to Fanout functionality if multiple queues have the same RoutingKey

What are the common annotations for declaring queues and exchanges based on the @RabbitListener annotation?

  • @Queue
  • @Exchang

Publish, subscribe model-Topic

TopicExchangeCompared with the other types Direct, messages can RoutingKeybe routed to different queues according to the type. It's just that Topicthe type Exchangeallows the queue Routing keyto use wildcards when binding!

RoutingkeyGenerally, it consists of one or more words, and multiple words are separated by ".", for example:item.insert

Wildcard rules:

#: match one or more words

*: Match exactly 1 word

For example, the following figure

insert image description here
explain:

  • Queue1: is bound china.#, so all that china.start routing keywill be matched. Including china.news and china.weather
  • Queue2: The binding is #.news, so all .newsending routing keywill be matched. Including china.news and japan.news

The idea of ​​code implementation is as follows:

  1. And use @RabbitListener to declare Exchange, Queue, RoutingKey

  2. In the consumer service, write two consumer methods to listen to topic.queue1 and topic.queue2 respectively

  3. Write a test method in the publisher and send a message to itcast.topic

insert image description here

message reception

Add the method in the SpringRabbitListener of the consumer service:

@RabbitListener(bindings = @QueueBinding(
    value = @Queue(name = "topic.queue1"),
    exchange = @Exchange(name = "itcast.topic", type = ExchangeTypes.TOPIC),
    key = "china.#"
))
public void listenTopicQueue1(String msg){
    
    
    System.out.println("消费者接收到topic.queue1的消息:【" + msg + "】");
}

@RabbitListener(bindings = @QueueBinding(
    value = @Queue(name = "topic.queue2"),
    exchange = @Exchange(name = "itcast.topic", type = ExchangeTypes.TOPIC),
    key = "#.news"
))
public void listenTopicQueue2(String msg){
    
    
    System.out.println("消费者接收到topic.queue2的消息:【" + msg + "】");
}

send message

在publisher服务的SpringAmqpTest类中添加测试方法:

```java
/**
     * topicExchange
     */
@Test
public void testSendTopicExchange() {
    
    
    // 交换机名称
    String exchangeName = "itcast.topic";
    // 消息
    String message = "喜报!孙悟空大战哥斯拉,胜!";
    // 发送消息
    rabbitTemplate.convertAndSend(exchangeName, "china.news", message);
}

summary

Describe the difference between a Direct switch and a Topic switch?

  • The message RoutingKey received by the Topic switch must be multiple words, .separated
  • The bindingKey when the Topic switch is bound to the queue can specify a wildcard
  • #: represents 0 or more words
  • *: represents 1 word

message converter

Spring will serialize the message you send into bytes and send it to MQ, and when receiving the message, it will also deserialize the bytes into Java objects.

However, by default, the serialization method used by Spring is JDK serialization. As we all know, JDK serialization has the following problems:

  • Data size is too large
  • has a security hole
  • poor readability

Test the default converter

Modify the code for message sending and send a Map object:

@Test
public void testSendMap() throws InterruptedException {
    
    
    // 准备消息
    Map<String,Object> msg = new HashMap<>();
    msg.put("name", "Jack");
    msg.put("age", 21);
    // 发送消息
    rabbitTemplate.convertAndSend("simple.queue","", msg);
}

Stop the consumer service

Check the console after sending the message:

insert image description here


Configure the JSON converter

Obviously, the JDK serialization method is not suitable. We want the message body to be smaller and more readable, so we can use the JSON method for serialization and deserialization.

Introduce dependencies in both publisher and consumer services (just add dependencies to the parent project to avoid repeated references):

<dependency>
    <groupId>com.fasterxml.jackson.dataformat</groupId>
    <artifactId>jackson-dataformat-xml</artifactId>
    <version>2.9.10</version>
</dependency>

Configure message converters.

Just add a Bean to the startup class:

@Bean
public MessageConverter jsonMessageConverter(){
    
    
    return new Jackson2JsonMessageConverter();
}

The message obtained in this way is the original content.

insert image description here


distributed search

Getting to know elasticsearch

ES introduction

Elasticsearch is a very powerful open source search engine with many powerful functions that can help us quickly find what we need from massive amounts of data
insert image description here

The ELK technology stack
elasticsearch combines kibana, Logstash, and Beats, which is the elastic stack (ELK). It is widely used in log data analysis, real-time monitoring and other fields

Elasticsearch is the core of the elastic stack, responsible for storing, searching, and analyzing data.


Compared with lucene, elasticsearch has the following advantages:

  • Support distributed, horizontal expansion
  • Provide a Restful interface that can be called by any language

What is elasticsearch?

  • An open source distributed search engine that can be used to implement functions such as search, log statistics, analysis, and system monitoring

What is elastic stack (ELK)?

  • A technology stack with elasticsearch as the core, including beats, Logstash, kibana, elasticsearch

What is Lucene?

  • It is Apache's open source search engine class library, which provides the core API of the search engine

Inverted index

The concept of an inverted index is based on a forward index like MySQL. Generally, search engines use inverted indexes to search based on keywords.

Forward index is to create an index based on id in the database table. If you want to search based on keywords, you will use fuzzy matching, and fuzzy matching may invalidate the index. If the index fails, you will scan the entire table as follows
insert image description here
. When the volume is large, the efficiency is hard to describe.


Inverted index

Creating an inverted index is a special treatment for a forward index, and the process is as follows:

  • Use the algorithm to segment the data of each document to get each entry
  • Create a table, each row of data includes information such as the entry, the document id where the entry is located, and the location
  • Because of the uniqueness of the entry, you can create an index for the entry, such as a hash table structure index
    insert image description here

ES basic concept

There are many unique concepts in elasticsearch, which are slightly different from mysql, but there are also similarities.

文档和字段
elasticsearch是面向 文档Document存储的,可以是数据库中的一条商品数据,一个订单信息。文档数据会被序列化为json格式后存储在elasticsearch中:
insert image description here
而Json文档中往往包含很多的 字段(Field),类似于数据库中的列。

索引和映射

索引(Index),就是相同类型的文档的集合

例如:

  • 所有用户文档,就可以组织在一起,称为用户的索引;
  • 所有商品的文档,可以组织在一起,称为商品的索引;
  • 所有订单的文档,可以组织在一起,称为订单的索引;

insert image description here

因此,我们可以把索引当做是数据库中的

数据库的表会有约束信息,用来定义表的结构、字段的名称、类型等信息。因此,索引库中就有映射(mapping),是索引中文档的字段约束信息,类似表的结构约束


mysql与elasticsearch对比

insert image description here

两者各自有自己的擅长支出,它们是一种互补的关系

  • Mysql:擅长事务类型操作,可以确保数据的安全和一致性

  • Elasticsearch:擅长海量数据的关键词搜索、分析、计算


使用时一般是二者结合使用

  • 对安全性要求较高的写操作,使用mysql实现
  • 对查询性能要求较高的搜索需求,使用elasticsearch实现
  • 两者再基于某种方式,实现数据的同步,保证一致性
    insert image description here

安装es、kibana

安装elasticsearch

①安装es镜像
把导入镜像的压缩包构建成镜像

es and kibana image compression package download: es and kibana image
extraction code: icnb

docker load -i es.tar

We also need to deploy the kibana container, so we need to interconnect the es and kibana containers. Here first create a network:

docker network create es-net

②Run the image If the docker service is not started, start docker first

systemctl start docker

Run the docker command to deploy single point es:

docker run -d \
	--name es \
    -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
    -e "discovery.type=single-node" \
    -v es-data:/usr/share/elasticsearch/data \
    -v es-plugins:/usr/share/elasticsearch/plugins \
    --privileged \
    --network es-net \
    -p 9200:9200 \
    -p 9300:9300 \
elasticsearch:7.12.1

Command explanation:

  • -e "cluster.name=es-docker-cluster": set the cluster name
  • -e "http.host=0.0.0.0": The listening address, which can be accessed from the external network
  • -e "ES_JAVA_OPTS=-Xms512m -Xmx512m":memory size
  • -e "discovery.type=single-node": non-cluster mode
  • -v es-data:/usr/share/elasticsearch/data: Mount the logical volume, bind the data directory of es
  • -v es-logs:/usr/share/elasticsearch/logs: Mount the logical volume, bind the log directory of es
  • -v es-plugins:/usr/share/elasticsearch/plugins: Mount the logical volume, bind the plug-in directory of es
  • --privileged: grant logical volume access
  • --network es-net: join a network named es-net
  • -p 9200:9200: port mapping configuration

③Browser access test
First open port 9200 of the firewall

sudo firewall-cmd --zone=public --permanent --add-port=9200/tcp

firewall-cmd --reload

insert image description here


departmentkibana

Kibana can provide us with an elasticsearch visual interface for us to learn.

①Import the image compression package and build the image

docker load -i kibana.tar

② Run the image
Run the docker command to deploy kibana

docker run -d \
--name kibana \
-e ELASTICSEARCH_HOSTS=http://es:9200 \
--network=es-net \
-p 5601:5601  \
kibana:7.12.1
  • --network es-net: join a network named es-net, in the same network as elasticsearch
  • -e ELASTICSEARCH_HOSTS=http://es:9200": Set the address of elasticsearch, because kibana is already in the same network as elasticsearch, so you can directly access elasticsearch with the container name
  • -p 5601:5601: port mapping configuration

Kibana is generally slow to start and needs to wait for a while. You can use the command:

docker logs -f kibana

Check the running log. When you see the following log, it means success:

insert image description here

③Browser access test

Open the port, if it is a server, add an open rule to the firewall

sudo firewall-cmd --zone=public --permanent --add-port=9200/tcp

firewall-cmd --reload

insert image description here


Click Dev tools to write DSL to operate elasticsearch in this interface. And there is an automatic completion function for DSL statements.
insert image description here

insert image description here


However, es is better for English word segmentation, but for Chinese word segmentation, it can only be divided into each Chinese character and each Chinese character, which is obviously not the effect we want, so install the ik word segmenter below, which is more friendly to Chinese word segmentation

Install ik plugin online (slower)

# 进入容器内部
docker exec -it elasticsearch /bin/bash

# 在线下载并安装
./bin/elasticsearch-plugin  install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.12.1/elasticsearch-analysis-ik-7.12.1.zip

#退出
exit
#重启容器
docker restart elasticsearch

Install the ik plugin offline (recommended)

View the data volume directory

To install the plug-in, you need to know the location of the plugins directory of elasticsearch, and we use the data volume mount, so we need to view the data volume directory of elasticsearch, and check it by the following command:

docker volume inspect es-plugins

Show results:

[
    {
    
    
        "CreatedAt": "2022-05-06T10:06:34+08:00",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/es-plugins/_data",
        "Name": "es-plugins",
        "Options": null,
        "Scope": "local"
    }
]

It shows that the plugins directory is mounted to: /var/lib/docker/volumes/es-plugins/_data this directory.
Unzip the compressed package of the ik tokenizer into this directory (the compressed package is downloaded in the link of the es image download above)

Finally restart the container

docker restart es

test

The IK tokenizer contains two modes:

  • ik_smart: least cut (coarse cut)

  • ik_max_word: Finest Slicing (Fine Slicing)

Least segmentation , less nonsense, directly above the picture is more clear

insert image description here


Finest cut

insert image description here


extended dictionary

With the development of the Internet, "word-making movements" have become more and more frequent. Many new words appeared that did not exist in the original vocabulary list. For example: "Olige", "Chicken You Are So Beautiful", "Rick Thimble" and so on.

Therefore, our vocabulary also needs to be constantly updated, and the IK tokenizer provides the function of expanding vocabulary.

To expand the vocabulary of the ik tokenizer, you only need to modify the IkAnalyzer.cfg.xml file in the config directory in the ik tokenizer directory. The file location is in the config directory under the ik folder
insert image description here

insert image description here
Then create a new ext.dic in the directory where the IkAnalyzer.cfg.xml file is located, you can refer to copy a configuration file in the config directory for modification

Note that the encoding of the current file must be in UTF-8 format, and editing with Windows Notepad is strictly prohibited (the encoding format of Windows Notepad defaults to gbk)

insert image description here
Restart elasticsearch to take effect

docker restart es

The effect is as follows
insert image description here


stop words

For sensitive words, deactivation is also required here; for sensitive words such as religion and politics, then we should also ignore the current vocabulary when searching.

The IK tokenizer also provides a powerful stop word function, allowing us to ignore the contents of the current stop vocabulary when indexing.
In the same way as the above extension, creating a file is to fill in the stopword.dic in IkAnalyzer.cfg.xml above
to add prohibited sensitive words

The same steps will not be written again


Summarize

What is the function of the tokenizer?
Word segmentation of documents when creating an inverted index
When users search, word segmentation of input content

How many modes does the IK tokenizer have?
ik_smart: intelligent segmentation, coarse-grained
ik_max_word: the finest segmentation, fine-grained

How does the IK tokenizer expand entries? How to deactivate an entry?
Use the IkAnalyzer.cfg.xml file in the config directory to add extended dictionaries and disabled dictionaries
Add extended entries or disabled entries to the dictionary


Index library operations

The index library is similar to the database table, and the mapping mapping is similar to the table structure.

If we want to store data in es, we must first create "library" and "table".

mapping mapping properties

Mapping is a constraint on documents in the index library. Common mapping attributes include:

  • type: field data type, common simple types are:
    • String: text (word-segmentable text), keyword (exact value, for example: brand, country, ip address does not require word-segmentation text )
    • Value: long, integer, short, byte, double, float,
    • Boolean: boolean
    • date: date
    • Object: object
  • index: Whether to create an index, the default is true (not all need to create an inverted index, such as the url of the picture, creating an inverted index is not useful)
  • analyzer: which tokenizer to use
  • properties: subfields of this field

For example the following json document:

{
    
    
    "age": 21,
    "weight": 52.1,
    "isMarried": false,
    "info": "什么是快乐星球",
    "email": "[email protected]",
    "score": [99.1, 99.5, 98.9],
    "name": {
    
    
        "firstName": "云",
        "lastName": "赵"
    }
}

Corresponding to each field mapping (mapping):

  • age: The type is integer; participate in the search, so the index needs to be true; no tokenizer is required
  • weight: The type is float; participate in the search, so the index needs to be true; no tokenizer is required
  • isMarried: The type is boolean; participate in the search, so the index needs to be true; no tokenizer is required
  • info: The type is a string, word segmentation is required, so it is text; to participate in the search, so the index needs to be true; the word segmenter can use ik_smart
  • email: The type is a string, but word segmentation is not required, so it is a keyword; it does not participate in the search, so the index needs to be false; no word segmentation is required
  • score: Although it is an array, we only look at the type of the element, which is float; participate in the search, so the index needs to be true; no tokenizer is required
  • name: The type is object, and multiple sub-attributes need to be defined
    • name.firstName; the type is a string, but word segmentation is not required, so it is a keyword; it participates in the search, so the index needs to be true; no word segmentation is required
    • name.lastName; the type is a string, but word segmentation is not required, so it is a keyword; it participates in the search, so the index needs to be true; no word segmentation is required

summary

What are the common attributes of mapping ?
type: data type
index: whether to index
analyzer: tokenizer
properties: subfield

What are the common types of type?
String: text, keyword
Number: long, integer, short, byte, double, float
Boolean: boolean
Date: date
Object: object


CRUD for the index library

Unify the way of writing DSL with Kibana to demonstrate

Create index repository and mapping

Take the following code as an example to create the basic template of the index library

PUT /test
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "info":{
    
    
        "type": "text",
        "analyzer": "ik_smart"
      },
      "email":{
    
    
        "type": "keyword",
        "index": false
      },
      "name":{
    
    
        "properties": {
    
    
          "firstName":{
    
    
            "type": "keyword"
          },
          "lastName":{
    
    
            "type": "keyword"
          }
        }
      }
    }
  }
}

Query and delete index library

GET /索引库名

insert image description here


DELETE /索引库名

insert image description here


modify index library

Although the inverted index structure is not complicated, once the data structure changes (for example, the tokenizer is changed), the inverted index needs to be recreated, which is a disaster. Therefore, once the index library is created, the mapping cannot be modified .

虽然无法修改mapping中已有的字段,但是却允许添加新的字段到mapping中,因为不会对倒排索引产生影响。
如下示例

PUT /索引库名/_mapping
{
    
    
  "properties": {
    
    
    "新字段名":{
    
    
      "type": "integer"
    }
  }
}

insert image description here


小结

文档操作有哪些?

创建文档:POST /索引库名/_doc/文档id { json文档 }
查询文档:GET /索引库名/_doc/文档id
删除文档:DELETE /索引库名/_doc/文档id

修改文档:

  • 全量修改:PUT /索引库名/_doc/文档id { json文档 }
  • 增量修改:POST /索引库名/_update/文档id { “doc”: {字段}}

文档操作

新增文档

语法如下

POST /索引库名/_doc/文档id
{
    
    
    "字段1": "值1",
    "字段2": "值2",
    "字段3": {
    
    
        "子属性1": "值3",
        "子属性2": "值4"
    },
    // ...
}

下面来个具体实现来看

insert image description here


查询和删除文档

查询文档

根据rest风格,新增是post,查询应该是get,不过查询一般都需要条件,这里我们把文档id带上。
文档也就是es中的一条数据,是JSON格式的(类似数据库中的row,一行数据)

语法:

GET /{
    
    索引库名称}/_doc/{
    
    id}

通过kibana查看数据:

GET /test/_doc/1

查看结果:

insert image description here


删除文档

删除使用DELETE请求,同样,需要根据id进行删除:

语法:

DELETE /{
    
    索引库名}/_doc/id值

示例:

# 根据id删除数据
DELETE /test/_doc/1

修改文档

修改有两种方式:

  • 全量修改:直接覆盖原来的文档
  • 增量修改:修改文档中的部分字段

全量修改

全量修改是覆盖原来的文档,其本质是:

  • 根据指定的id删除文档
  • 新增一个相同id的文档

注意:如果根据id删除时,id不存在,第二步的新增也会执行,也就从修改变成了新增操作了。

语法:

PUT /{
    
    索引库名}/_doc/文档id
{
    
    
    "字段1": "值1",
    "字段2": "值2",
    // ... 略
}

示例:

PUT /heima/_doc/1
{
    
    
    "info": "什么是快乐星球",
    "email": "[email protected]",
    "name": {
    
    
        "firstName": "云",
        "lastName": "赵"
    }
}

增量修改

增量修改是只修改指定id匹配的文档中的部分字段。

语法:

POST /{
    
    索引库名}/_update/文档id
{
    
    
    "doc": {
    
    
         "字段名": "新的值",
    }
}

示例:

POST /test/_update/1
{
    
    
  "doc": {
    
    
    "email": "[email protected]"
  }
}

小结

文档操作有哪些?

  • 创建文档:POST /{索引库名}/_doc/文档id { json文档 }
  • 查询文档:GET /{索引库名}/_doc/文档id
  • 删除文档:DELETE /{索引库名}/_doc/文档id
  • 修改文档:
    • Full modification: PUT /{index library name}/_doc/document id { json document }
    • Incremental modification: POST /{index library name}/_update/document id { “doc”: {field}}

RestAPI

ES officially provides clients in various languages ​​to operate ES. The essence of these clients is to assemble DSL statements and send them to ES through http requests. Official document address: https://www.elastic.co/guide/en/elasticsearch/client/index.html

The Java Rest Client includes two types:

  • Java Low Level Rest Client
  • Java High Level Rest Client

Generally, the Java HighLevel Rest Client client uses more, and it is also the API for learning the Java HighLevel Rest Client client


quick start

Here is a demo case, clear and clear

1. Import the sql file into the database and create hotel related table data

insert image description here

The data structure is as follows:

CREATE TABLE `tb_hotel` (
  `id` bigint(20) NOT NULL COMMENT '酒店id',
  `name` varchar(255) NOT NULL COMMENT '酒店名称;例:7天酒店',
  `address` varchar(255) NOT NULL COMMENT '酒店地址;例:航头路',
  `price` int(10) NOT NULL COMMENT '酒店价格;例:329',
  `score` int(2) NOT NULL COMMENT '酒店评分;例:45,就是4.5分',
  `brand` varchar(32) NOT NULL COMMENT '酒店品牌;例:如家',
  `city` varchar(32) NOT NULL COMMENT '所在城市;例:上海',
  `star_name` varchar(16) DEFAULT NULL COMMENT '酒店星级,从低到高分别是:1星到5星,1钻到5钻',
  `business` varchar(255) DEFAULT NULL COMMENT '商圈;例:虹桥',
  `latitude` varchar(32) NOT NULL COMMENT '纬度;例:31.2497',
  `longitude` varchar(32) NOT NULL COMMENT '经度;例:120.3925',
  `pic` varchar(255) DEFAULT NULL COMMENT '酒店图片;例:/img/1.jpg',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

2. Project construction

insert image description here


3.mapping mapping analysis

To create an index library (that is, to create a table), the most important thing is the mapping (table constraint), and the information to be considered by the mapping mapping includes:

  • field name
  • field data type
  • Whether to participate in the search
  • Do you need word segmentation
  • If word segmentation, what is the tokenizer?

in:

  • Field name, field data type, you can refer to the name and type of the data table structure
  • Whether to participate in the search should be judged by analyzing the business, such as the image address, there is no need to participate in the search
  • Whether word segmentation depends on the content, if the content is a whole, there is no need for word segmentation, otherwise, word segmentation is required
  • Tokenizer, we can use ik_max_word uniformly

Let’s take a look at the index library structure of the hotel data:

PUT /hotel
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "id": {
    
    
        "type": "keyword"
      },
      "name":{
    
    
        "type": "text",
        "analyzer": "ik_max_word",
        "copy_to": "all"
      },
      "address":{
    
    
        "type": "keyword",
        "index": false
      },
      "price":{
    
    
        "type": "integer"
      },
      "score":{
    
    
        "type": "integer"
      },
      "brand":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
    
    
        "type": "keyword",
        "copy_to": "all"
      },
      "starName":{
    
    
        "type": "keyword"
      },
      "business":{
    
    
        "type": "keyword"
      },
      "location":{
    
    
        "type": "geo_point"
      },
      "pic":{
    
    
        "type": "keyword",
        "index": false
      },
      "all":{
    
    
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

Description of several special fields:

  • location: Geographical coordinates, including accuracy and latitude
  • all: a combination field, its purpose is to combine the values ​​of multiple fields using copy_to, and provide it to users for searching

Description of geographic coordinates:

insert image description here

copy_to description:
insert image description here


4. Initialize RestClient
In the API provided by elasticsearch, all interactions with elasticsearch are encapsulated in a class called RestHighLevelClient. You must first complete the initialization of this object and establish a connection with elasticsearch.

Divided into three steps:

1) Introduce the RestHighLevelClient dependency of es:

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
</dependency>

2) Because the default ES version of SpringBoot is 7.6.2, we need to override the default ES version:

<properties>
    <java.version>1.8</java.version>
    <elasticsearch.version>7.12.1</elasticsearch.version>
</properties>

3) Initialize RestHighLevelClient:

The initialization code is as follows:

RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(
        HttpHost.create("http://ip:9200")
));

For the convenience of unit testing, we create a test class HotelIndexTest, and then write the initialization code in the @BeforeEach method:

    @BeforeEach
    void setUp() {
    
    
        this.client = new RestHighLevelClient(RestClient.builder(
                HttpHost.create("http://ip:9200")
        ));
    }

    @AfterEach
    void tearDown() throws IOException {
    
    
        this.client.close();
    }

Operation index library

Create an index library

insert image description here

The code is divided into three steps:

  • 1) Create a Request object. Because it is an operation to create an index library, the Request is CreateIndexRequest.
  • 2) Adding request parameters is actually the JSON parameter part of the DSL. Because the json string is very long, the static string constant MAPPING_TEMPLATE is defined here to make the code look more elegant.
  • 3) To send a request, the return value of the client.indices() method is the IndicesClient type, which encapsulates all methods related to index library operations.

Under the cn.hotel.constants package of hotel-demo, create a class to define the JSON string constant of mapping mapping:
it is the JSON statement to build the index library, as follows

public class HotelConstants {
    
    
    public static final String MAPPING_TEMPLATE = "{\n" +
            "  \"mappings\": {\n" +
            "    \"properties\": {\n" +
            "      \"id\": {\n" +
            "        \"type\": \"keyword\"\n" +
            "      },\n" +
            "      \"name\":{\n" +
            "        \"type\": \"text\",\n" +
            "        \"analyzer\": \"ik_max_word\",\n" +
            "        \"copy_to\": \"all\"\n" +
            "      },\n" +
            "      \"address\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"index\": false\n" +
            "      },\n" +
            "      \"price\":{\n" +
            "        \"type\": \"integer\"\n" +
            "      },\n" +
            "      \"score\":{\n" +
            "        \"type\": \"integer\"\n" +
            "      },\n" +
            "      \"brand\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"copy_to\": \"all\"\n" +
            "      },\n" +
            "      \"city\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"copy_to\": \"all\"\n" +
            "      },\n" +
            "      \"starName\":{\n" +
            "        \"type\": \"keyword\"\n" +
            "      },\n" +
            "      \"business\":{\n" +
            "        \"type\": \"keyword\"\n" +
            "      },\n" +
            "      \"location\":{\n" +
            "        \"type\": \"geo_point\"\n" +
            "      },\n" +
            "      \"pic\":{\n" +
            "        \"type\": \"keyword\",\n" +
            "        \"index\": false\n" +
            "      },\n" +
            "      \"all\":{\n" +
            "        \"type\": \"text\",\n" +
            "        \"analyzer\": \"ik_max_word\"\n" +
            "      }\n" +
            "    }\n" +
            "  }\n" +
            "}";
}

In the HotelIndexTest test class in hotel-demo, write a unit test to create an index:

@Test
void createHotelIndex() throws IOException {
    
    
    // 1.创建Request对象
    CreateIndexRequest request = new CreateIndexRequest("hotel");
    // 2.准备请求的参数:DSL语句
    request.source(MAPPING_TEMPLATE, XContentType.JSON);
    // 3.发送请求
    client.indices().create(request, RequestOptions.DEFAULT);
}

delete index library

The DSL statement to delete the index store is very simple:

DELETE /hotel

Compared to creating an index library:

  • Request method changed from PUT to DELTE
  • The request path remains unchanged
  • no request parameters

Therefore, the difference in code should be reflected in the Request object. It is still three steps:

  • 1) Create a Request object. This time it is the DeleteIndexRequest object
  • 2) Prepare parameters. Here is no parameter
  • 3) Send the request. Use the delete method instead

In the HotelIndexTest test class in hotel-demo, write a unit test to delete the index:

@Test
void testDeleteHotelIndex() throws IOException {
    
    
    // 1.创建Request对象
    DeleteIndexRequest request = new DeleteIndexRequest("hotel");
    // 2.发送请求
    client.indices().delete(request, RequestOptions.DEFAULT);
}

Determine whether the index library exists

The essence of judging whether the index library exists is query, and the corresponding DSL is:

GET /hotel

So it is similar to the deleted Java code flow. It is still three steps:

  • 1) Create a Request object. This time the GetIndexRequest object
  • 2) Prepare parameters. Here is no parameter
  • 3) Send the request. Use the exists method instead
@Test
void testExistsHotelIndex() throws IOException {
    
    
    // 1.创建Request对象
    GetIndexRequest request = new GetIndexRequest("hotel");
    // 2.发送请求
    boolean exists = client.indices().exists(request, RequestOptions.DEFAULT);
    // 3.输出
    System.err.println(exists ? "索引库已经存在!" : "索引库不存在!");
}

Summarize

The process of JavaRestClient operating elasticsearch is basically similar. The core is the client.indices() method to obtain the operation object of the index library.

The basic steps of index library operation:

  • Initialize RestHighLevelClient
  • Create XxxIndexRequest. XXX is Create, Get, Delete
  • Prepare DSL (required when Create, others are no parameters)
  • send request. Call the RestHighLevelClient#indices().xxx() method, where xxx is create, exists, delete

Operational Documentation

new document

It is equivalent to inserting a piece of data in the database table, but it is inserted into the index library

The DSL statement of the newly added document is as follows:

POST /{
    
    索引库名}/_doc/1
{
    
    
    "name": "Jack",
    "age": 21
}

The corresponding java code is shown in the figure:

insert image description here
You can see that it is similar to creating an index library , and it is also a three-step process:

  • 1) Create a Request object
  • 2) Prepare the request parameters, which is the JSON document in the DSL
  • 3) Send request

The change is that the API of client.xxx() is directly used here, and client.indices() is no longer needed.


The result of the database query is a Hotel type object, which is not consistent with the fields in the index library (for example, longitude and latitude need to be merged into location). Here, a new type needs to be defined, which matches the structure of the index library:

@Data
@NoArgsConstructor
public class HotelDoc {
    
    
    private Long id;
    private String name;
    private String address;
    private Integer price;
    private Integer score;
    private String brand;
    private String city;
    private String starName;
    private String business;
    private String location;
    private String pic;

    public HotelDoc(Hotel hotel) {
    
    
        this.id = hotel.getId();
        this.name = hotel.getName();
        this.address = hotel.getAddress();
        this.price = hotel.getPrice();
        this.score = hotel.getScore();
        this.brand = hotel.getBrand();
        this.city = hotel.getCity();
        this.starName = hotel.getStarName();
        this.business = hotel.getBusiness();
        this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
        this.pic = hotel.getPic();
    }
}

To insert the hotel object in the database into the index library, there are three points to note

  • The hotel data comes from the database, we need to query it first to get the hotel object
  • The hotel object needs to be converted to a HotelDoc object
  • HotelDoc needs to be serialized into json format

The overall steps of the code are as follows:

  • 1) Query hotel data Hotel according to id
  • 2) Package Hotel as HotelDoc
  • 3)将HotelDoc序列化为JSON
  • 4)创建IndexRequest,指定索引库名和id
  • 5)准备请求参数,也就是JSON文档
  • 6)发送请求
@Test
void testAddDocument() throws IOException {
    
    
    // 1.根据id查询酒店数据
    Hotel hotel = hotelService.getById(61083L);
    // 2.转换为文档类型
    HotelDoc hotelDoc = new HotelDoc(hotel);
    // 3.将HotelDoc转json
    String json = JSON.toJSONString(hotelDoc);

    // 1.准备Request对象
    IndexRequest request = new IndexRequest("hotel").id(hotelDoc.getId().toString());
    // 2.准备Json文档
    request.source(json, XContentType.JSON);
    // 3.发送请求
    client.index(request, RequestOptions.DEFAULT);
}

insert image description here


查询文档

查询的DSL语句如下:

GET /hotel/_doc/{
    
    id}

非常简单,因此代码大概分两步:

  • 准备Request对象
  • 发送请求

不过查询的目的是得到结果,解析为HotelDoc,因此难点是结果的解析。完整代码如下:

insert image description here
可以看到,结果是一个JSON,其中文档放在一个_source属性中,因此解析就是拿到_source,反序列化为Java对象即可。

与之前类似,也是三步走:

  • 1)准备Request对象。这次是查询,所以是GetRequest
  • 2)发送请求,得到结果。因为是查询,这里调用client.get()方法
  • 3)解析结果,就是对JSON做反序列化
@Test
void testGetDocumentById() throws IOException {
    
    
    // 1.准备Request
    GetRequest request = new GetRequest("hotel", "61083");
    // 2.发送请求,得到响应
    GetResponse response = client.get(request, RequestOptions.DEFAULT);
    // 3.解析响应结果
    String json = response.getSourceAsString();

    HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class);
    System.out.println(hotelDoc);
}

insert image description here


删除文档

删除的DSL为是这样的:

DELETE /hotel/_doc/{
    
    id}

与查询相比,仅仅是请求方式从DELETE变成GET,可以想象Java代码应该依然是三步走:

  • 1)准备Request对象,因为是删除,这次是DeleteRequest对象。要指定索引库名和id
  • 2)准备参数,无参
  • 3)发送请求。因为是删除,所以是client.delete()方法

在hotel-demo的HotelDocumentTest测试类中,编写单元测试:

@Test
void testDeleteDocument() throws IOException {
    
    
    // 1.准备Request
    DeleteRequest request = new DeleteRequest("hotel", "61083");
    // 2.发送请求
    client.delete(request, RequestOptions.DEFAULT);
}

insert image description here


修改文档

文档修改有两种方式:

  • 全量修改:本质是先根据id删除,再新增
  • 增量修改:修改文档中的指定字段值

在RestClient的API中,全量修改与新增的API完全一致,判断依据是ID:

  • 如果新增时,ID已经存在,则修改
  • 如果新增时,ID不存在,则新增

主要关注增量修改,因为全量修改就是新增覆盖和新增文档一致
代码示例如图:

insert image description here
与之前类似,也是三步走:

  • 1)准备Request对象。这次是修改,所以是UpdateRequest
  • 2)准备参数。也就是JSON文档,里面包含要修改的字段
  • 3)更新文档。这里调用client.update()方法
@Test
void testUpdateDocument() throws IOException {
    
    
    // 1.准备Request
    UpdateRequest request = new UpdateRequest("hotel", "61083");
    // 2.准备请求参数
    request.doc(
        "price", "952",
        "starName", "四钻"
    );
    // 3.发送请求
    client.update(request, RequestOptions.DEFAULT);
}

把要修改的参数和值写在doc中,其间都是逗号隔开

结果如下insert image description here


批量导入文档

案例需求:利用BulkRequest批量将数据库数据导入到索引库中。

步骤如下:

  • 利用mybatis-plus查询酒店数据

  • 将查询到的酒店数据(Hotel)转换为文档类型数据(HotelDoc)

  • 利用JavaRestClient中的BulkRequest批处理,实现批量新增文档

批量处理BulkRequest,其本质就是将多个普通的CRUD请求组合在一起发送。

其中提供了一个add方法,用来添加其他请求:

insert image description here
可以看到,能添加的请求包括:

  • IndexRequest,也就是新增
  • UpdateRequest,也就是修改
  • DeleteRequest,也就是删除

因此Bulk中添加了多个IndexRequest,就是批量新增功能了。示例:
insert image description here
其实还是三步走:

  • 1)创建Request对象。这里是BulkRequest
  • 2)准备参数。批处理的参数,就是其它Request对象,这里就是多个IndexRequest
  • 3)发起请求。这里是批处理,调用的方法为client.bulk()方法

在导入酒店数据时,将上述代码改造成for循环处理即可。

@Test
void testBulkRequest() throws IOException {
    
    
    // 批量查询酒店数据
    List<Hotel> hotels = hotelService.list();

    // 1.创建Request
    BulkRequest request = new BulkRequest();
    // 2.准备参数,添加多个新增的Request
    for (Hotel hotel : hotels) {
    
    
        // 2.1.转换为文档类型HotelDoc
        HotelDoc hotelDoc = new HotelDoc(hotel);
        // 2.2.创建新增文档的Request对象
        request.add(new IndexRequest("hotel")
                    .id(hotelDoc.getId().toString())
                    .source(JSON.toJSONString(hotelDoc), XContentType.JSON));
    }
    // 3.发送请求
    client.bulk(request, RequestOptions.DEFAULT);
}

然后es开发工具 批量查询 验证效果

GET /索引库名/_search

insert image description here


小结:
文档操作的基本步骤:

  • 初始化RestHighLevelClient
  • 创建XxxRequest。XXX是Index、Get、Update、Delete、Bulk
  • 准备参数(Index、Update、Bulk时需要)
  • 发送请求。调用RestHighLevelClient#.xxx()方法,xxx是index、get、update、delete、bulk
  • 解析结果(Get时需要)

elasticsearch搜索功能

DSL查询文档

elasticsearch的查询依然是基于JSON风格的DSL来实现的。
(DSL类似于数据库的DQL查询语句)


DSL查询分类

Elasticsearch provides a JSON-based DSL ( Domain Specific Language ) to define queries. Common query types include:

  • Query all : Query all data, for general testing. For example: match_all

  • Full-text search (full text) query : Use the word segmenter to segment the user input content, and then match it in the inverted index database. For example:

    • match_query
    • multi_match_query
  • Precise query : Find data based on precise entry values, generally searching for keyword, numeric, date, boolean and other types of fields. For example:

    • ids
    • range
    • term
  • Geographic (geo) query : query based on latitude and longitude. For example:

    • geo_distance
    • geo_bounding_box
  • Compound (compound) query : compound query can combine the above-mentioned various query conditions and merge query conditions. For example:

    • bool
    • function_score

query all

The query syntax is basically the same:

GET /indexName/_search
{
    
    
  "query": {
    
    
    "查询类型": {
    
    
      "查询条件": "条件值"
    }
  }
}

Let's take the query all as an example, where:

  • The query type is match_all
  • no query condition
// 查询所有
GET /indexName/_search
{
    
    
  "query": {
    
    
    "match_all": {
    
    
    }
  }
}

Other queries are nothing more than changes in query types and query conditions .

insert image description here


Full text search query

Usage scenario:
The basic process of full-text search query is as follows:

  • Segment the content of the user's search and get the entry
  • According to the entry to match in the inverted index library, get the document id
  • Find the document according to the document id and return it to the user

That is, when searching, the search results are returned according to the keywords
. Because the entries are used to match, the fields participating in the search must also be text-type fields that can be segmented.

basic grammar

Common full-text search queries include:

  • match query: single field query
  • multi_match query: multi-field query, any field meets the conditions even if it meets the query conditions

The match query syntax is as follows:

GET /indexName/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "FIELD": "TEXT"
    }
  }
}

insert image description here


The mulit_match syntax is as follows:

GET /indexName/_search
{
    
    
  "query": {
    
    
    "multi_match": {
    
    
      "query": "TEXT",
      "fields": ["FIELD1", " FIELD12"]
    }
  }
}

insert image description here

It can be seen that the results of the two queries are the same, why?

Because we copied the brand, name, and business values ​​into the all field using copy_to . Therefore, you can search based on three fields, and of course the same effect as searching based on all fields.

However, the more search fields, the greater the impact on query performance, so it is recommended to use copy_to and then single-field query.

insert image description here
The difference between match and multi_match

  • match: query based on a field
  • multi_match: Query based on multiple fields, the more fields involved in the query, the worse the query performance

Accurate query

Precise query is generally to search for keyword, value, date, boolean and other types of fields. Therefore, the word segmentation of the search conditions will not be performed . The common ones are:

  • term: query based on the exact value of the term
  • range: query based on the range of values

term query

Because the field search for exact query is a field without word segmentation, the query condition must also be an entry without word segmentation . When querying, only when the content entered by the user exactly matches the automatic value is considered to meet the condition. If the user enters too much content, the data cannot be searched.

Grammar description:

// term查询
GET /indexName/_search
{
    
    
  "query": {
    
    
    "term": {
    
    
      "FIELD": {
    
    
        "value": "VALUE"
      }
    }
  }
}

In the following example
insert image description here, however, when the content of our search is not an entry, but a phrase formed by multiple words, it cannot be searched:

insert image description here


range query

Range query is generally used when performing range filtering on numeric types. For example, do price range filtering.

Basic syntax:

// range查询
GET /indexName/_search
{
    
    
  "query": {
    
    
    "range": {
    
    
      "FIELD": {
    
    
        "gte": 10, // 这里的gte代表大于等于,gt则代表大于
        "lte": 20 // lte代表小于等于,lt则代表小于
      }
    }
  }
}

insert image description here


Summarize

What are the common types of precise query?

  • Term query: Exact match based on terms, general search keyword type, numeric type, Boolean type, date type fields
  • range query: query based on the range of values, which can be ranges of values ​​and dates

Geographical coordinate query

所谓的地理坐标查询,其实就是根据经纬度查询,官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html

常见的使用场景包括:

  • 携程:搜索我附近的酒店
  • 滴滴:搜索我附近的出租车
  • 微信:搜索我附近的人

矩形范围查询

矩形范围查询,也就是geo_bounding_box查询,查询坐标落在某个矩形范围的所有文档:

insert image description here

询时,需要指定矩形的左上右下两个点的坐标,然后画出一个矩形,落在该矩形内的都是符合条件的点。

语法如下:

// geo_bounding_box查询
GET /indexName/_search
{
    
    
  "query": {
    
    
    "geo_bounding_box": {
    
    
      "FIELD": {
    
    
        "top_left": {
    
     // 左上点
          "lat": 31.1,
          "lon": 121.5
        },
        "bottom_right": {
    
     // 右下点
          "lat": 30.9,
          "lon": 121.7
        }
      }
    }
  }
}

但这种并不符合“附近的人”这样的需求,且用之甚少,不过多解析


附近查询

附近查询,也叫做距离查询(geo_distance):查询到指定中心点小于某个距离值的所有文档。

换句话来说,在地图上找一个点作为圆心,以指定距离为半径,画一个圆,落在圆内的坐标都算符合条件:
insert image description here

// geo_distance 查询
GET /indexName/_search
{
    
    
  "query": {
    
    
    "geo_distance": {
    
    
      "distance": "15km", // 半径
      "FIELD": "31.21,121.5" // 圆心
    }
  }
}

示例
我们搜索以陆家嘴坐标为圆心,附近15km的酒店:

insert image description here
发现共有47家酒店。


复合查询

复合(compound)查询:复合查询可以将其它简单查询组合起来,实现更复杂的搜索逻辑。常见的有两种:

  • fuction score:算分函数查询,可以控制文档相关性算分,控制文档排名
  • bool query:布尔查询,利用逻辑关系组合多个其它的查询,实现复杂搜索

相关性算分

当我们利用match查询时,文档结果会根据与搜索词条的关联度打分(_score),返回结果时按照分值降序排列

insert image description here


The TF-IDF algorithm has a flaw, that is, the higher the term frequency, the higher the document score, and a single term has a greater impact on the document. However, BM25 will have an upper limit for the score of a single entry, and the curve will be smoother:
insert image description here

In the later version 5.1 upgrade, elasticsearch improved the algorithm to BM25 algorithm


Calculation function query

Scoring based on relevance is a reasonable requirement, but reasonable ones are not necessarily what product managers need .

Taking Baidu as an example, in your search results, it is not that the higher the relevance, the higher the ranking, but the higher the ranking is for who pays more. As shown in the picture:
insert image description here

Grammar Description

insert image description here

The function score query contains four parts:

  • Original query condition: query part, search for documents based on this condition, and score the document based on the BM25 algorithm, the original score (query score)
  • Filter condition : the filter part, documents that meet this condition will be recalculated
  • Calculation function : Documents that meet the filter conditions need to be calculated according to this function, and the obtained function score (function score), there are four functions
    • weight: the result of the function is a constant
    • field_value_factor: Use a field value in the document as the function result
    • random_score: Use random numbers as the result of the function
    • script_score: custom scoring function algorithm
  • Calculation mode : the result of the calculation function, the correlation calculation score of the original query, and the calculation method between the two, including:
    • multiply: Multiply
    • replace: replace query score with function score
    • Others, such as: sum, avg, max, min

The operation process of function score is as follows:

  • 1) Query and search documents according to the original conditions , and calculate the relevance score, called the original score (query score)
  • 2) According to filter conditions , filter documents
  • 3) For documents that meet the filter conditions , the function score is obtained based on the calculation of the score function
  • 4) The original score (query score) and function score (function score) are calculated based on the operation mode , and the final result is obtained as a correlation score.

The key points are:

  • Filter conditions: determine which documents have their scores modified
  • Scoring function: the algorithm to determine the score of the function
  • Calculation mode: determine the final calculation result

example

Requirements: Rank hotels with the brand "Home Inn" higher

Translate this requirement into the four points mentioned before:

  • Original condition: Uncertain, can change arbitrarily
  • Filter condition: brand = "Home Inn"
  • Calculation function: It can be simple and rude, and directly give a fixed calculation result, weight
  • Operation mode: such as summation

So the final DSL statement is as follows:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "function_score": {
    
    
      "query": {
    
      .... }, // 原始查询,可以是任意条件
      "functions": [ // 算分函数
        {
    
    
          "filter": {
    
     // 满足的条件,品牌必须是如家
            "term": {
    
    
              "brand": "如家"
            }
          },
          "weight": 2 // 算分权重为2
        }
      ],
      "boost_mode": "sum" // 加权模式,求和
    }
  }
}

Test, when the scoring function is not added, Home Inn's score is as follows:
insert image description here
After adding the scoring function, Home Inn's score is improved:

insert image description here

summary

  • Filter criteria: which documents should be added points
  • Calculation function: how to calculate function score
  • Weighting method: how to calculate function score and query score

Boolean query

A Boolean query is a combination of one or more query clauses, each of which is a subquery . Subqueries can be combined in the following ways:

  • must: must match each subquery, similar to "and"
  • should: Selective matching subquery, similar to "or"
  • must_not: must not match, does not participate in scoring , similar to "not"
  • filter: must match, do not participate in scoring

For example, when searching for hotels, in addition to keyword search, we may also filter according to fields such as brand, price, city, etc. At this time, we need to combine queries:
each different field has different query conditions and methods. It must be multiple different queries, and to combine these queries, you must use bool queries.
It should be noted that when searching, the more fields involved in scoring, the worse the query performance will be . Therefore, it is recommended to do this when querying with multiple conditions:

  • The keyword search in the search box is a full-text search query, use must query, and participate in scoring
  • For other filter conditions, use filter query. Do not participate in scoring

Grammatical explanation
The query city is Shanghai, the brand is Crowne Plaza or Ramada; the price is greater than 500, and the hotel score is greater than or equal to 4.5 points

GET /hotel/_search
{
    
    
  "query": {
    
    
    "bool": {
    
    
      "must": [
        {
    
    "term": {
    
    "city": "上海" }}
      ],
      "should": [
        {
    
    "term": {
    
    "brand": "皇冠假日" }},
        {
    
    "term": {
    
    "brand": "华美达" }}
      ],
      "must_not": [
        {
    
     "range": {
    
     "price": {
    
     "lte": 500 } }}
      ],
      "filter": [
        {
    
     "range": {
    
    "score": {
    
     "gte": 45 } }}
      ]
    }
  }
}

Example
Requirement: search for hotels whose name contains "Home Inn", the price is not higher than 400, and within 10km around the coordinates 31.21, 121.5.

analyze:

  • Name search is a full-text search query and should be involved in scoring. put in must
  • If the price is not higher than 400, use range to query, which belongs to the filter condition and does not participate in the calculation of points. put in must_not
  • Within the range of 10km, use geo_distance to query, which belongs to the filter condition and does not participate in the calculation of points. put in filter

insert image description here


summary

How many logical relationships does bool query have?

  • must: conditions that must be matched, can be understood as "and"
  • should: The condition for selective matching, which can be understood as "or"
  • must_not: conditions that must not match, do not participate in scoring
  • filter: conditions that must be matched, do not participate in scoring

Search result processing

to sort

elasticsearch默认是根据相关度算分(_score)来排序,但是也支持自定义方式对搜索结果排序。可以排序字段类型有:keyword类型、数值类型、地理坐标类型、日期类型等。

普通字段排序

keyword、数值、日期类型排序的语法基本一致。

语法

GET /indexName/_search
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  },
  "sort": [
    {
    
    
      "FIELD": "desc"  // 排序字段、排序方式ASC、DESC
    }
  ]
}

排序条件是一个数组,也就是可以写多个排序条件。按照声明的顺序,当第一个条件相等时,再按照第二个条件排序,以此类推

示例

需求描述:酒店数据按照用户评价(score)降序排序,评价相同的按照价格(price)升序排序

insert image description here


地理坐标排序

这个场景,我们并不陌生,打车,点外卖,去游玩,app总是会把据我们位置,距离最近的商家排在前面

地理坐标排序略有不同。

语法说明

GET /indexName/_search
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  },
  "sort": [
    {
    
    
      "_geo_distance" : {
    
    
          "FIELD" : "纬度,经度", // 文档中geo_point类型的字段名、目标坐标点
          "order" : "asc", // 排序方式
          "unit" : "km" // 排序的距离单位
      }
    }
  ]
}

这个查询的含义是:

  • 指定一个坐标,作为目标点
  • 计算每一个文档中,指定字段(必须是geo_point类型)的坐标 到目标点的距离是多少
  • 根据距离排序

示例:

需求描述:实现对酒店数据按照到你的位置坐标的距离升序排序

提示:获取你的位置的经纬度的方式:https://lbs.amap.com/demo/jsapi-v2/example/map/click-to-get-lnglat/

insert image description here


分页

elasticsearch 默认情况下只返回top10的数据。而如果要查询更多数据就需要修改分页参数了。elasticsearch中通过修改from、size参数来控制要返回的分页结果:

  • from:从第几个文档开始
  • size:总共查询几个文档

类似于mysql中的limit

基本的分页

分页的基本语法如下:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  },
  "from": 0, // 分页开始的位置,默认为0
  "size": 10, // 期望获取的文档总数
  "sort": [
    {
    
    "price": "asc"}
  ]
}

深度分页

要查询990~1000的数据,查询逻辑这么写:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  },
  "from": 990, // 分页开始的位置,默认为0
  "size": 10, // 期望获取的文档总数
  "sort": [
    {
    
    "price": "asc"}
  ]
}

这里是查询990开始的数据,也就是 第990~第1000条 数据。

However, when paging inside elasticsearch, you must first query 0~1000 entries, and then intercept the 10 entries of 990~1000

insert image description here
Query TOP1000, if es is a single-point mode, this does not have much impact.

But elasticsearch must be a cluster in the future. For example, my cluster has 5 nodes, and I want to query TOP1000 data. It is not enough to query 200 items per node.

Because the TOP200 of node A may be ranked beyond 10,000 on another node.

Therefore, if you want to obtain the TOP1000 of the entire cluster, you must first query the TOP1000 of each node. After summarizing the results, re-rank and re-intercept the TOP1000.

insert image description here
If I want to query the data of 9900~10000, I need to query TOP10000 first, then each node needs to query 10000 items, which are summarized into the memory. There are too many data and the pressure on the memory is too large, so elasticsearch will prohibit requests with from+ size exceeding 10000

For deep paging, ES provides two solutions, official documents :

  • search after: sorting is required when paging, the principle is to query the next page of data starting from the last sorting value. The official recommended way to use.
  • scroll: The principle is to form a snapshot of the sorted document ids and store them in memory. It is officially deprecated.

summary

Common implementation schemes and advantages and disadvantages of pagination query:

  • from + size

    • Advantages: Support random page turning
    • Disadvantages: deep paging problem, the default query upper limit (from + size) is 10000
    • Scenario: Random page-turning searches such as Baidu, JD.com, Google, and Taobao
  • after search

    • Advantages: no query upper limit (the size of a single query does not exceed 10000)
    • Disadvantage: can only query backward page by page, does not support random page turning
    • Scenario: Search without random page turning requirements, such as mobile phone scrolling down to turn pages
  • scroll

    • Advantages: no query upper limit (the size of a single query does not exceed 10000)
    • Disadvantages: There will be additional memory consumption, and the search results are not real-time
    • Scenario: Acquisition and migration of massive data. It is not recommended starting from ES7.1. It is recommended to use the after search solution.

highlight

When we search on Baidu and Jingdong, the keywords will turn red, which is more eye-catching. This is called highlighting

The implementation of highlighting is divided into two steps:

  • 1) Add a label to all keywords in the document, such as <em>label
  • 2)页面给<em>标签编写CSS样式

实现高亮

GET /hotel/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "FIELD": "TEXT" // 查询条件,高亮一定要使用全文检索查询
    }
  },
  "highlight": {
    
    
    "fields": {
    
     // 指定要高亮的字段
      "FIELD": {
    
    
        "pre_tags": "<em>",  // 用来标记高亮字段的前置标签//可以不加,不加默认是它
        "post_tags": "</em>" // 用来标记高亮字段的后置标签
      }
    }
  }
}

注意:

  • 高亮是对关键字高亮,因此搜索条件必须带有关键字,而不能是范围这样的查询。
  • 默认情况下,高亮的字段,必须与搜索指定的字段一致,否则无法高亮
  • 如果要对非搜索字段高亮,则需要添加一个属性:required_field_match=false

示例
insert image description here


总结

查询的DSL是一个大的JSON对象,包含下列属性:

  • query:查询条件
  • from和size:分页条件
  • sort:排序条件
  • highlight:高亮条件

示例:
insert image description here


RestClient查询文档

快速入门

操作几乎和前面的CRUD步骤基本相同
1.导入RestClient的依赖

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
</dependency>

因为SpringBoot默认的ES版本是7.6.2,所以我们需要覆盖默认的ES版本:

<properties>
    <java.version>1.8</java.version>
    <elasticsearch.version>7.12.1</elasticsearch.version>
</properties>

2.初始化RestClient
为了单元测试方便,创建一个测试类,将初始化的代码编写在@BeforeEach方法中

private RestHighLevelClient client;

@BeforeEach
void setUp() {
    
    
    this.client = new RestHighLevelClient(RestClient.builder(
            HttpHost.create("http://47.100.200.177:9200")
    ));
}

@AfterEach
void tearDown() throws IOException {
    
    
    this.client.close();
}

3.编写java代码,代替DSL查询语句

基本步骤包括:

  • 1)准备Request对象
  • 2)准备请求参数
  • 3)发起请求
  • 4)解析响应

原DSL的查询请求格式
insert image description here
代码解读:

  • 第一步,创建SearchRequest对象,指定索引库名

  • 第二步,利用request.source()构建DSL,DSL中可以包含查询、分页、排序、高亮等

    • query():代表查询条件,利用QueryBuilders.matchAllQuery()构建一个match_all查询的DSL
  • 第三步,利用client.search()发送请求,得到响应

这里关键的API有两个,一个是request.source(),其中包含了查询、排序、分页、高亮等所有功能:

insert image description here

The other is QueryBuildersthat it contains various queries such as match, term, function_score, bool, etc.:

insert image description here

Finally parse the returned result
insert image description here

The result returned by elasticsearch is a JSON string, the structure contains:

  • hits: the result of the hit
    • total: The total number of entries, where value is the specific total entry value
    • max_score: the relevance score of the highest scoring document across all results
    • hits: An array of documents for search results, each of which is a json object
      • _source: the original data in the document, also a json object

Therefore, we parse the response result, which is to parse the JSON string layer by layer. The process is as follows:

  • SearchHits: Obtained through response.getHits(), which is the outermost hits in JSON, representing the result of the hit
    • SearchHits.getTotalHits().value: Get the total number of information
    • SearchHits.getHits(): Get the SearchHit array, which is the document array
      • SearchHit.getSourceAsString(): Get the _source in the document result, which is the original json document data

The code is implemented as follows

    @Test
    void testMatchAll() throws IOException {
    
    
        //准备Request
        SearchRequest request = new SearchRequest("hotel");
        //组织DSL参数
        request.source().query(QueryBuilders.matchAllQuery());
        //发送请求,得到相应结果
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);

        /**
         * 解析查询返回的json字符串
         */
        handleResponse(response);
    }

    private void handleResponse(SearchResponse response) {
    
    
        SearchHits searchHits = response.getHits();
        //获取总条数
        TotalHits total = searchHits.getTotalHits();
        //获取查询结果的数组
        SearchHit[] hits = searchHits.getHits();
        for (SearchHit hit : hits) {
    
    
            //获取文档的source的json串
            String json = hit.getSourceAsString();
            //反序列化为HotelDoc对象
            HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class);
            System.out.println(hotelDoc);
        }
    }

summary

The basic steps of a query are:

  1. Create a SearchRequest object

  2. Prepare Request.source(), which is DSL.

    ① QueryBuilders to build query conditions

    ② Pass in the query() method of Request.source()

  3. send request, get result

  4. Parsing results (refer to JSON results, from outside to inside, parse layer by layer)


match query

The match and multi_match queries of full-text search are basically the same as the API of match_all. The difference is the query condition, which is the query part.
insert image description here
Therefore, the difference in the Java code is mainly the parameters in request.source().query(). The method provided by QueryBuilders is also used
insert image description here
, and the result parsing code is completely consistent, which can be extracted and shared.

The complete code is as follows:

@Test
void testMatch() throws IOException {
    
    
    // 1.准备Request
    SearchRequest request = new SearchRequest("hotel");
    // 2.准备DSL
    request.source()
        .query(QueryBuilders.matchQuery("all", "如家"));
    // 3.发送请求
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4.解析响应
    handleResponse(response);

}

Exact query

Exact queries are mainly two:

  • term: term exact match
  • range: range query

Compared with the previous query, the difference is also in the query condition, and everything else is the same.

The API for query condition construction is as follows:
insert image description here

    @Test
    void testExact() throws IOException {
    
    
        //准备request
        SearchRequest request = new SearchRequest("hotel");
        //准备DSL
        request.source().query(QueryBuilders.termQuery("city", "杭州"));//精确匹配城市杭州的酒店
//        request.source().query(QueryBuilders.rangeQuery("price").gte(100).lte(200));//范围查询价格大于等于100小于等于200
        //发送请求
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        //解析结果
        handleResponse(response);
    }

Boolean query

Boolean query is to combine other queries with must, must_not, filter, etc. The code example is as follows:
insert image description here
As you can see, the difference between API and other queries is also in the construction of query conditions, QueryBuilders, result analysis and other codes are completely unchanged.

@Test
void testBool() throws IOException {
    
    
    // 1.准备Request
    SearchRequest request = new SearchRequest("hotel");
    // 2.准备DSL
    // 2.1.准备BooleanQuery
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    // 2.2.添加term
    boolQuery.must(QueryBuilders.termQuery("city", "杭州"));
    // 2.3.添加range
    boolQuery.filter(QueryBuilders.rangeQuery("price").lte(250));

    request.source().query(boolQuery);
    // 3.发送请求
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4.解析响应
    handleResponse(response);

}

sorting, pagination

The sorting and paging of search results are parameters at the same level as query, so they are also set using request.source().

The corresponding API is as follows:
insert image description here
complete code example:

@Test
void testPageAndSort() throws IOException {
    
    
    // 页码,每页大小
    int page = 1, size = 5;

    // 1.准备Request
    SearchRequest request = new SearchRequest("hotel");
    // 2.准备DSL
    // 2.1.query
    request.source().query(QueryBuilders.matchAllQuery());
    // 2.2.排序 sort
    request.source().sort("price", SortOrder.ASC);
    // 2.3.分页 from、size
    request.source().from((page - 1) * size).size(5);
    // 3.发送请求
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4.解析响应
    handleResponse(response);

}

highlight

The highlighted code is quite different from the previous code, there are two points:

  • Query DSL: In addition to query conditions, you also need to add highlight conditions, which are also at the same level as query.
  • Result parsing: In addition to parsing the _source document data, the result also needs to parse the highlighted result

Highlight request build

insert image description here

The above code omits the query condition part, but please don’t forget: the highlight query must use full-text search query, and there must be a search keyword, so that keywords can be highlighted in the future.

The complete code is as follows:

@Test
void testHighlight() throws IOException {
    
    
    // 1.准备Request
    SearchRequest request = new SearchRequest("hotel");
    // 2.准备DSL
    // 2.1.query
    request.source().query(QueryBuilders.matchQuery("all", "如家"));
    // 2.2.高亮
    request.source().highlighter(new HighlightBuilder().field("name").requireFieldMatch(false));
    // 3.发送请求
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4.解析响应
    handleResponse(response);

}

Highlight result analysis
insert image description here
Code interpretation:

  • Step 1: Get the source from the result. hit.getSourceAsString(), this part is the non-highlighted result, json string. It also needs to be deserialized into a HotelDoc object
  • Step 2: Obtain the highlighted result. hit.getHighlightFields(), the return value is a Map, the key is the highlight field name, and the value is the HighlightField object, representing the highlight value
  • Step 3: Obtain the highlighted field value object HighlightField from the map according to the highlighted field name
  • Step 4: Get Fragments from HighlightField and convert them to strings. This part is the real highlighted string
  • Step 5: Replace non-highlighted results in HotelDoc with highlighted results

The response code for parsing is modified as follows:

private void handleResponse(SearchResponse response) {
    
    
    // 4.解析响应
    SearchHits searchHits = response.getHits();
    // 4.1.获取总条数
    long total = searchHits.getTotalHits().value;
    System.out.println("共搜索到" + total + "条数据");
    // 4.2.文档数组
    SearchHit[] hits = searchHits.getHits();
    // 4.3.遍历
    for (SearchHit hit : hits) {
    
    
        // 获取文档source
        String json = hit.getSourceAsString();
        // 反序列化
        HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class);
        // 获取高亮结果
        Map<String, HighlightField> highlightFields = hit.getHighlightFields();
        if (!CollectionUtils.isEmpty(highlightFields)) {
    
    
            // 根据字段名获取高亮结果
            HighlightField highlightField = highlightFields.get("name");
            if (highlightField != null) {
    
    
                // 获取高亮值
                String name = highlightField.getFragments()[0].string();
                // 覆盖非高亮结果
                hotelDoc.setName(name);
            }
        }
        System.out.println("hotelDoc = " + hotelDoc);
    }
}

Tourism project case

First import the dependencies of RestClient

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
</dependency>

Because the default ES version of SpringBoot is 7.6.2, we need to override the default ES version:

<properties>
    <java.version>1.8</java.version>
    <elasticsearch.version>7.12.1</elasticsearch.version>
</properties>

Secondly, register the RestClient as a bean, complete the initialization, and inject the bean into the startup class

    @Bean
    public RestHighLevelClient client(){
    
    
        return new RestHighLevelClient(RestClient.builder(
                HttpHost.create("http://47.100.200.177:9200")
        ));
    }

Define the front-end request parameter entity class

insert image description here

insert image description here

@Data
public class RequestParams {
    
    
    private String key;
    private Integer page;
    private Integer size;
    private String sortBy;
}

Define the response result entity class that the server should return

The paging query returns the paging result PageResult, which contains two attributes:

  • total: total number
  • List<HotelDoc>: Data of the current page
@Data
public class PageResult {
    
    
    private Long total;
    private List<HotelDoc> hotels;

    public PageResult() {
    
    
    }

    public PageResult(Long total, List<HotelDoc> hotels) {
    
    
        this.total = total;
        this.hotels = hotels;
    }
}

Hotel Search and Pagination

Define a HotelController, declare the query interface, and meet the following requirements:

  • Request method: Post
  • Request path: /hotel/list
  • Request parameter: object, of type RequestParam
  • Return value: PageResult, which contains two attributes
    • Long total: total number
    • List<HotelDoc> hotels: hotel data
@Slf4j
@RestController
@RequestMapping("/hotel")
public class HotelController {
    
    
    @Autowired
    private IHotelService hotelService;
    
    //搜索酒店数据
    @PostMapping("/list")
    public PageResult search(@RequestBody RequestParams params){
    
    
        return hotelService.search(params);
    }
}

Then implement the search business in the service layer
1. IHotelServiceDefine a method in the interface:

/**
 * 根据关键字搜索酒店信息
 * @param params 请求参数对象,包含用户输入的关键字 
 * @return 酒店文档列表
 */
PageResult search(RequestParams params);

2. Implement the search method in cn.itcast.hotel.service.impl:HotelService

@Override
public PageResult search(RequestParams params) {
    
    
    try {
    
    
        // 1.准备Request
        SearchRequest request = new SearchRequest("hotel");
        // 2.准备DSL
        // 2.1.query
        String key = params.getKey();
        if (key == null || "".equals(key)) {
    
    
            boolQuery.must(QueryBuilders.matchAllQuery());
        } else {
    
    
            boolQuery.must(QueryBuilders.matchQuery("all", key));
        }

        // 2.2.分页
        int page = params.getPage();
        int size = params.getSize();
        request.source().from((page - 1) * size).size(size);

        // 3.发送请求
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        // 4.解析响应
        return handleResponse(response);
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

// 结果解析
private PageResult handleResponse(SearchResponse response) {
    
    
    // 4.解析响应
    SearchHits searchHits = response.getHits();
    // 4.1.获取总条数
    long total = searchHits.getTotalHits().value;
    // 4.2.文档数组
    SearchHit[] hits = searchHits.getHits();
    // 4.3.遍历
    List<HotelDoc> hotels = new ArrayList<>();
    for (SearchHit hit : hits) {
    
    
        // 获取文档source
        String json = hit.getSourceAsString();
        // 反序列化
        HotelDoc hotelDoc = JSON.parseObject(json, HotelDoc.class);
		// 放入集合
        hotels.add(hotelDoc);
    }
    // 4.4.封装返回
    return new PageResult(total, hotels);
}

It should be noted that the method of processing the returned result needs to be modified, and the final returned value is encapsulated into the PageResult object we defined

insert image description here

insert image description here
insert image description here


Hotel results filter

Requirements: Add filter functions such as brand, city, star rating, price, etc.
insert image description here
The passed parameters are as shown in the figure:
insert image description here

Included filters are:

  • brand: brand value
  • city: city
  • minPrice~maxPrice: price range
  • starName: star

We need to do two things:

  • ①Modify the object RequestParams of the request parameters and receive the above parameters
  • ②Modify the business logic and add some filter conditions in addition to the search conditions

Modify entity class
Entity class RequestParams to add city, brand, star rating, price parameters

@Data
public class RequestParams {
    
    
    private String key;
    private Integer page;
    private Integer size;
    private String sortBy;
    // 下面是新增的过滤条件参数
    private String city;
    private String brand;
    private String starName;
    private Integer minPrice;
    private Integer maxPrice;
}

Modify the search service
In the search method of HotelService, there is only one place that needs to be modified: the query condition in requet.source().query( ... ).

In the previous business, there was only match query, and it was searched according to keywords. Now it is necessary to add conditional filtering, including:

  • Brand filtering: keyword type, query by term
  • Star filter: keyword type, use term query
  • Price filtering: it is a numeric type, query with range
  • City filter: keyword type, query with term

The combination of multiple query conditions must be combined with boolean queries:

  • Put the keyword search in the must, and participate in the score calculation
  • Other filter conditions are placed in the filter and do not participate in the calculation of points

Because the logic of conditional construction is more complicated, it is encapsulated as a function first:

insert image description here
The code for buildBasicQuery is as follows

private void buildBasicQuery(RequestParams params, SearchRequest request) {
    
    
    // 1.构建BooleanQuery
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    // 2.关键字搜索
    String key = params.getKey();
    if (key == null || "".equals(key)) {
    
    
        boolQuery.must(QueryBuilders.matchAllQuery());
    } else {
    
    
        boolQuery.must(QueryBuilders.matchQuery("all", key));
    }
    // 3.城市条件
    if (params.getCity() != null && !params.getCity().equals("")) {
    
    
        boolQuery.filter(QueryBuilders.termQuery("city", params.getCity()));
    }
    // 4.品牌条件
    if (params.getBrand() != null && !params.getBrand().equals("")) {
    
    
        boolQuery.filter(QueryBuilders.termQuery("brand", params.getBrand()));
    }
    // 5.星级条件
    if (params.getStarName() != null && !params.getStarName().equals("")) {
    
    
        boolQuery.filter(QueryBuilders.termQuery("starName", params.getStarName()));
    }
	// 6.价格
    if (params.getMinPrice() != null && params.getMaxPrice() != null) {
    
    
        boolQuery.filter(QueryBuilders
                         .rangeQuery("price")
                         .gte(params.getMinPrice())
                         .lte(params.getMaxPrice())
                        );
    }
	// 7.放入source
    request.source().query(boolQuery);
}

Hotels near me

On the right side of the hotel list page, there is a small map, click the location button of the map, the map will find your location:
and, a query request will be initiated on the front end to send your coordinates to the server:
insert image description here

insert image description here

What we have to do is to sort the surrounding hotels according to the distance based on the location coordinates. The implementation idea is as follows:

  • Modify the RequestParams parameter to receive the location field
  • Modify the business logic of the search method, if the location has a value, add the function of sorting according to geo_distance
  • Modify the method of processing the response result, and parse the distance value from the JSON string

Geographical coordinate sorting has only learned DSL syntax, as follows:

GET /indexName/_search
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  },
  "sort": [
    {
    
    
      "price": "asc"  
    },
    {
    
    
      "_geo_distance" : {
    
    
          "FIELD" : "纬度,经度",
          "order" : "asc",
          "unit" : "km"
      }
    }
  ]
}

Corresponding java code example:
insert image description here


Add distance sorting

insert image description here

@Override
public PageResult search(RequestParams params) {
    
    
    try {
    
    
        // 1.准备Request
        SearchRequest request = new SearchRequest("hotel");
        // 2.准备DSL
        // 2.1.query
        buildBasicQuery(params, request);

        // 2.2.分页
        int page = params.getPage();
        int size = params.getSize();
        request.source().from((page - 1) * size).size(size);

        // 2.3.排序
        String location = params.getLocation();
        if (location != null && !location.equals("")) {
    
    
            request.source().sort(SortBuilders
                                  .geoDistanceSort("location", new GeoPoint(location))
                                  .order(SortOrder.ASC)
                                  .unit(DistanceUnit.KILOMETERS)
                                 );
        }

        // 3.发送请求
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        // 4.解析响应
        return handleResponse(response);
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

After the sorting is completed, the page also needs to obtain the specific distance value of each hotel near me. This value is independent in the response result: in the
insert image description here
result parsing stage, in addition to parsing the source part, we also need to get the sort part, which is the sorted distance, and put it in the response result.

We do two things:

  • Modify HotelDoc, add a sorting distance field for page display
  • Modify the handleResponse method in the HotelService class to add the acquisition of the sort value

1) Modify the HotelDoc class and add the distance field distance
insert image description here

2) Modify the handleResponse method in HotelService

insert image description here

The final result is as shown below
insert image description here


Hotel PPC

Requirement: To make the specified hotel rank top in the search results,
modify its relevance score, the higher the score, the higher the ranking

To make the specified hotel rank at the top of the search results, the effect is as shown in the figure:

insert image description here
The page adds an ad tag to the specified hotel .

The function_score query learned before can affect the calculation score. The higher the calculation score, the higher the natural ranking. And function_score contains 3 elements:

  • Filter criteria: which documents should be added points
  • Calculation function: how to calculate function score
  • Weighting method: how to calculate function score and query score

The demand here is: to make the designated hotel rank high. Therefore, we need to add a mark to these hotels, so that in the filter condition, we can judge according to this mark whether to increase the score .

For example, we add a field to the hotel: isAD, Boolean type:

  • true: is an advertisement
  • false: not an ad

In this way, function_score contains 3 elements and it is easy to determine:

  • Filter condition: determine whether isAD is true
  • Calculation function: we can use the simplest violent weight, fixed weighted value
  • Weighting method: You can use the default multiplication method to greatly improve the calculation score

Therefore, the implementation steps of the business include:

  1. Add isAD field to HotelDoc class, Boolean type

  2. Pick a few hotels you like, add the isAD field to its document data, and the value is true

  3. Modify the search method, add the function score function, and add weight to the hotel whose isAD value is true


Modify HotelDoc entity
HotelDoc class to add isAD field
insert image description here

Add an advertisement tag
Next, we pick a few hotels, add the isAD field, and set it to true:

POST /hotel/_update/1902197537
{
    
    
    "doc": {
    
    
        "isAD": true
    }
}
POST /hotel/_update/2056126831
{
    
    
    "doc": {
    
    
        "isAD": true
    }
}
POST /hotel/_update/1989806195
{
    
    
    "doc": {
    
    
        "isAD": true
    }
}
POST /hotel/_update/2056105938
{
    
    
    "doc": {
    
    
        "isAD": true
    }
}

Add calculation function query

Next, we will modify the query conditions. The boolean query was used before, but now it needs to be changed to function_socre query.
The function_score query structure is as follows:
insert image description here
The corresponding Java API is as follows:
insert image description here
you can put the previously written boolean query as the original query condition into the query, and then add filter conditions , scoring functions , and weighting modes . So the original code can still be used.

private void buildBasicQuery(RequestParams params, SearchRequest request) {
    
    
    // 1.构建BooleanQuery
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    // 关键字搜索
    String key = params.getKey();
    if (key == null || "".equals(key)) {
    
    
        boolQuery.must(QueryBuilders.matchAllQuery());
    } else {
    
    
        boolQuery.must(QueryBuilders.matchQuery("all", key));
    }
    // 城市条件
    if (params.getCity() != null && !params.getCity().equals("")) {
    
    
        boolQuery.filter(QueryBuilders.termQuery("city", params.getCity()));
    }
    // 品牌条件
    if (params.getBrand() != null && !params.getBrand().equals("")) {
    
    
        boolQuery.filter(QueryBuilders.termQuery("brand", params.getBrand()));
    }
    // 星级条件
    if (params.getStarName() != null && !params.getStarName().equals("")) {
    
    
        boolQuery.filter(QueryBuilders.termQuery("starName", params.getStarName()));
    }
    // 价格
    if (params.getMinPrice() != null && params.getMaxPrice() != null) {
    
    
        boolQuery.filter(QueryBuilders
                         .rangeQuery("price")
                         .gte(params.getMinPrice())
                         .lte(params.getMaxPrice())
                        );
    }

    // 2.算分控制
    FunctionScoreQueryBuilder functionScoreQuery =
        QueryBuilders.functionScoreQuery(
        // 原始查询,相关性算分的查询
        boolQuery,
        // function score的数组
        new FunctionScoreQueryBuilder.FilterFunctionBuilder[]{
    
    
            // 其中的一个function score 元素
            new FunctionScoreQueryBuilder.FilterFunctionBuilder(
                // 过滤条件
                QueryBuilders.termQuery("isAD", true),
                // 算分函数
                ScoreFunctionBuilders.weightFactorFunction(10)
            )
        });
    request.source().query(functionScoreQuery);
}

data aggregation

type of aggregation

There are three common types of aggregation:

  • Bucket aggregation: used to group documents, just like garbage classification in life, put different garbage into different trash bins

    • TermAggregation: group by document field value, such as group by brand value, group by country
    • Date Histogram: Group by date ladder, for example, a week as a group, or a month as a group
  • Metric aggregation: used to calculate some values, such as: maximum value, minimum value, average value, etc.

    • Avg: Average
    • Max: find the maximum value
    • Min: Find the minimum value
    • Stats: Simultaneously seek max, min, avg, sum, etc.
  • Pipeline (pipeline) * aggregation: aggregation based on the results of other aggregations

Note : The fields participating in the aggregation must bekeyword, date, value, Boolean


DSL for Aggregation

Now, we want to count the hotel brands in all the data . In fact, we group the data according to the brand . At this point, aggregation can be done based on the name of the hotel brand, that is, Bucket aggregation.

Bucket (bucket) aggregation syntax

The syntax is as follows:

GET /hotel/_search
{
    
    
  "size": 0,  // 设置size为0,结果中不包含文档,只包含聚合结果
  "aggs": {
    
     // 定义聚合
    "brandAgg": {
    
     //给聚合起个名字
      "terms": {
    
     // 聚合的类型,按照品牌值聚合,所以选择term
        "field": "brand", // 参与聚合的字段
        "size": 20 // 希望获取的聚合结果数量
      }
    }
  }
}

The result is shown in the figure:
insert image description here


Sort aggregated results

By default, Bucket aggregation will count the number of documents in the Bucket, record it as _count, and sort in descending order of _count .

We can specify the order attribute to customize the sorting method of the aggregation:

GET /hotel/_search
{
    
    
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
    
      "terms": {
    
    
        "field": "brand",
        "order": {
    
    
          "_count": "asc" // 按照_count升序排列
        },
        "size": 20
      }
    }
  }
}

Limit aggregation scope

By default, Bucket aggregation aggregates all documents in the index library, but in real scenarios, users will enter search conditions, so the aggregation must be the aggregation of search results. Then the aggregation has to be qualified.

We can limit the range of documents to be aggregated by adding query conditions:

GET /hotel/_search
{
    
    
  "query": {
    
    
    "range": {
    
    
      "price": {
    
    
        "lte": 200 // 只对200元以下的文档聚合
      }
    }
  }, 
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
    
      "terms": {
    
    
        "field": "brand",
        "size": 20
      }
    }
  }
}

This time, the aggregated brands are significantly less:

insert image description here


Metric (metric) aggregation syntax

We group hotels by brand to form buckets. Now we need to perform calculations on the hotels in the bucket to obtain the min, max, and avg values ​​of the user ratings for each brand.

This requires the use of Metric aggregation, such as stat aggregation: you can get results such as min, max, and avg.

The syntax is as follows:

GET /hotel/_search
{
    
    
  "size": 0, 
  "aggs": {
    
    
    "brandAgg": {
    
     
      "terms": {
    
     
        "field": "brand", 
        "size": 20
      },
      "aggs": {
    
     // 是brands聚合的子聚合,也就是分组后对每组分别计算
        "score_stats": {
    
     // 聚合名称
          "stats": {
    
     // 聚合类型,这里stats可以计算min、max、avg等
            "field": "score" // 聚合字段,这里是score
          }
        }
      }
    }
  }
}

This time the score_stats aggregation is a sub-aggregation nested inside the brandAgg aggregation . Because we need to calculate separately in each bucket.

In addition, we can also sort the aggregation results, for example, according to the average hotel score of each bucket:

insert image description here


summary

aggs stands for aggregation, which is at the same level as query. What is the function of query at this time?

  • Scope the aggregated documents

The three elements necessary for aggregation:

  • aggregate name
  • aggregation type
  • aggregate field

Aggregate configurable properties are:

  • size: specify the number of aggregation results
  • order: specify the sorting method of aggregation results
  • field: specify the aggregation field

RestAPI implements aggregation

API syntax

Aggregation conditions are at the same level as query conditions, so request.source() needs to be used to specify aggregation conditions.

Syntax for aggregate conditions:

insert image description here
The aggregation result is also different from the query result, and the API is also special. However, JSON is also parsed layer by layer :
insert image description here


Business needs

Requirements: The brand, city and other information on the search page should not be hard-coded on the page, but obtained from the hotel data in the aggregated index database: that is, after selecting a condition each time, the content in the
column will change. Relevant stays, no relevant removes;

For example, if I first choose the price of less than 100 yuan, then the five-diamond and four-diamond hotels in the star column should not be there, because there are no four-star or five-star hotels below 100 in the data.

insert image description here

analyze:

At present, the city list, star list, and brand list on the page are all hard-coded, and will not change as the search results change. But when the user's search conditions change, the search results will change accordingly.

For example, if a user searches for "Oriental Pearl", the searched hotel must be near the Shanghai Oriental Pearl Tower. Therefore, the city can only be Shanghai. At this time, Beijing, Shenzhen, and Hangzhou should not be displayed in the city list.

That is to say, which cities are included in the search results, which cities should be listed on the page; which brands are included in the search results, which brands should be listed on the page.

Use the aggregation function and Bucket aggregation to group the documents in the search results based on brands and cities, and you can know which brands and cities are included.

Because it is an aggregation of search results, the aggregation is a limited-range aggregation , that is to say, the limiting conditions of the aggregation are consistent with the conditions of the search document.
Looking at the browser, it can be found that the front end has actually sent such a request:
insert image description here
the request parameters are exactly the same as the parameters of the search document .

The return value type is the final result to be displayed on the page:

insert image description here
The result is a Map structure:

  • key is a string, city, star, brand, price
  • value is a collection, such as the names of multiple cities

business realization

HotelControllerAdd a method to the following requirements:

  • Request method:POST
  • Request path:/hotel/filters
  • Request parameters: RequestParams, consistent with the parameters of the search document
  • Return value type:Map<String, List<String>>

code:

    @PostMapping("filters")
    public Map<String, List<String>> getFilters(@RequestBody RequestParams params){
    
    
        return hotelService.getFilters(params);
    }

The getFilters method in IHotelService is called here, which has not been implemented yet.

IHotelServiceDefine the new method in :

Map<String, List<String>> filters(RequestParams params);
@Override
public Map<String, List<String>> filters(RequestParams params) {
    
    
    try {
    
    
        // 1.准备Request
        SearchRequest request = new SearchRequest("hotel");
        // 2.准备DSL
        // 2.1.query
        buildBasicQuery(params, request);
        // 2.2.设置size
        request.source().size(0);
        // 2.3.聚合
        buildAggregation(request);
        // 3.发出请求
        SearchResponse response = client.search(request, RequestOptions.DEFAULT);
        // 4.解析结果
        Map<String, List<String>> result = new HashMap<>();
        Aggregations aggregations = response.getAggregations();
        // 4.1.根据品牌名称,获取品牌结果
        List<String> brandList = getAggByName(aggregations, "brandAgg");
        result.put("品牌", brandList);
        // 4.2.根据品牌名称,获取品牌结果
        List<String> cityList = getAggByName(aggregations, "cityAgg");
        result.put("城市", cityList);
        // 4.3.根据品牌名称,获取品牌结果
        List<String> starList = getAggByName(aggregations, "starAgg");
        result.put("星级", starList);

        return result;
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

private void buildAggregation(SearchRequest request) {
    
    
    request.source().aggregation(AggregationBuilders
                                 .terms("brandAgg")
                                 .field("brand")
                                 .size(100)
                                );
    request.source().aggregation(AggregationBuilders
                                 .terms("cityAgg")
                                 .field("city")
                                 .size(100)
                                );
    request.source().aggregation(AggregationBuilders
                                 .terms("starAgg")
                                 .field("starName")
                                 .size(100)
                                );
}

private List<String> getAggByName(Aggregations aggregations, String aggName) {
    
    
    // 4.1.根据聚合名称获取聚合结果
    Terms brandTerms = aggregations.get(aggName);
    // 4.2.获取buckets
    List<? extends Terms.Bucket> buckets = brandTerms.getBuckets();
    // 4.3.遍历
    List<String> brandList = new ArrayList<>();
    for (Terms.Bucket bucket : buckets) {
    
    
        // 4.4.获取key
        String key = bucket.getKeyAsString();
        brandList.add(key);
    }
    return brandList;
}

Autocomplete

When the user enters a character in the search box, we should prompt the search item related to the character, as shown in the figure:

insert image description here
This function of prompting complete entries based on the letters entered by the user is automatic completion.

Because it needs to be inferred based on the pinyin letters, the pinyin word segmentation function is used.

Pinyin word breaker

To achieve completion based on letters, it is necessary to segment the document according to pinyin. There happens to be a pinyin word segmentation plugin for elasticsearch on GitHub. Address: https://github.com/medcl/elasticsearch-analysis-pinyin

The installation method is the same as the IK tokenizer, in three steps:

① Decompression

② Upload to the virtual machine, the plugin directory of elasticsearch

③ Restart elasticsearch

The test usage is as follows:

POST /_analyze
{
    
    
  "text": "如家酒店还不错",
  "analyzer": "pinyin"
}

result:
insert image description here


custom tokenizer

The default pinyin word breaker divides each Chinese character into pinyin, but we want each entry to form a set of pinyin, so we need to customize the pinyin word breaker to form a custom word breaker.

The composition of the analyzer in elasticsearch consists of three parts:

  • Character filters: Process the text before the tokenizer. e.g. delete characters, replace characters
  • tokenizer: Cut the text into terms according to certain rules. For example, keyword is not participle; there is also ik_smart
  • tokenizer filter: further process the entries output by the tokenizer. For example, case conversion, synonyms processing, pinyin processing, etc.

When document word segmentation, the document will be processed by these three parts in turn:

insert image description here
The syntax for declaring a custom tokenizer is as follows:

PUT /test
{
    
    
  "settings": {
    
    
    "analysis": {
    
    
      "analyzer": {
    
     // 自定义分词器
        "my_analyzer": {
    
      // 分词器名称
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
    
     // 自定义tokenizer filter
        "py": {
    
     // 过滤器名称
          "type": "pinyin", // 过滤器类型,这里是pinyin
		  "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    
    
    "properties": {
    
    
      "name": {
    
    
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

test:
insert image description here


Summarize:

How to use Pinyin tokenizer?

  • ①Download the pinyin tokenizer

  • ② Unzip and put it in the plugin directory of elasticsearch

  • ③Restart

How to customize the tokenizer?

  • ① When creating an index library, configure it in settings, which can contain three parts

  • ②character filter

  • ③tokenizer

  • ④filter

Precautions for pinyin word breaker?

  • In order to avoid searching for homophones, do not use the pinyin word breaker when searching

autocomplete query

Elasticsearch provides Completion Suggester query to achieve automatic completion. This query will match terms beginning with the user input and return them. In order to improve the efficiency of the completion query, there are some constraints on the types of fields in the document:

  • The fields participating in the completion query must be of completion type.

  • The content of the field is generally an array formed by multiple entries for completion.

For example, an index library like this:

// 创建索引库
PUT test
{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "title":{
    
    
        "type": "completion"
      }
    }
  }
}

Then insert the following data:

// 示例数据
POST test/_doc
{
    
    
  "title": ["Sony", "WH-1000XM3"]
}
POST test/_doc
{
    
    
  "title": ["SK-II", "PITERA"]
}
POST test/_doc
{
    
    
  "title": ["Nintendo", "switch"]
}

The query DSL statement is as follows:

// 自动补全查询
GET /test/_search
{
    
    
  "suggest": {
    
    
    "title_suggest": {
    
    
      "text": "s", // 关键字
      "completion": {
    
    
        "field": "title", // 补全查询的字段
        "skip_duplicates": true, // 跳过重复的
        "size": 10 // 获取前10条结果
      }
    }
  }
}

data synchronization

There are three common data synchronization schemes:

  • synchronous call
  • asynchronous notification
  • monitor binlog

synchronization policy

Solution 1: Synchronous call

It is only applicable to single projects. For microservice projects, the efficiency is low and difficult to maintain and manage, and the degree of coupling is high. The
insert image description here
basic steps are as follows:

  • hotel-demo provides an interface to modify the data in elasticsearch
  • After the hotel management service completes the database operation, it directly calls the interface provided by hotel-demo

As long as the database is updated, elasticsearch will be updated, which is equivalent to adding these two operations to a transaction


Solution 2: Asynchronous notification

insert image description here

The process is as follows:

  • Hotel-admin sends MQ message after adding, deleting and modifying mysql database data
  • Hotel-demo listens to MQ and completes elasticsearch data modification after receiving the message

Solution 3: Monitor binlog

insert image description here
The process is as follows:

  • Enable the binlog function for mysql
  • The addition, deletion, and modification operations of mysql will be recorded in the binlog
  • Hotel-demo monitors binlog changes based on canal, and updates the content in elasticsearch in real time

summary

Method 1: Synchronous call

  • Advantages: simple to implement, rough
  • Disadvantages: high degree of business coupling

Method 2: Asynchronous notification

  • Advantages: low coupling, generally difficult to implement
  • Disadvantages: rely on the reliability of mq

Method 3: Monitor binlog

  • Advantages: Complete decoupling between services
  • Disadvantages: Enabling binlog increases database burden and high implementation complexity

Realize data synchronization

code idea

Here we use a technology learned earlier, MQ as an intermediate listener.
When hotel data is added, deleted, or changed, the same operation is required for the data in elasticsearch.

step:

  • Import the hotel-admin project provided by the pre-course materials, start and test the CRUD of hotel data

  • Declare exchange, queue, RoutingKey

  • Complete message sending in the add, delete, and change business in hotel-admin

  • Complete message monitoring in hotel-demo and update data in elasticsearch

  • Start and test the data sync function


Import demo project

Import the hotel-admin project provided by the pre-class materials, and modify the configuration information of the database in yml.
After running, visit http://localhost:8099

insert image description here
It contains the CRUD function of the hotel:
insert image description here
all of them are APIs in mp, which can be called directly


Declare queues and exchanges

The MQ structure is shown in the figure:
insert image description here
import dependencies
Introduce rabbitmq dependencies in hotel-admin and hotel-demo:

<!--amqp-->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-amqp</artifactId>
</dependency>

Start the mq container
If you have run the mq container before, you only need to

docker start 容器名  

If you have not run the mq container before, you need

docker run \
 -e RABBITMQ_DEFAULT_USER=管理界面的账号 \
 -e RABBITMQ_DEFAULT_PASS=管理界面的密码 \
 --name mq \
 --hostname mq1 \
 -p 15672:15672 \
 -p 5672:5672 \
 -d \
 rabbitmq:3-management

After startup, you can access ip:15672
insert image description here

Add configuration information
hotel-admin, hotel-demo need to add configuration information
insert image description here

  rabbitmq:
    host: IP
    port: 5672
    username: 你mq管理界面的账号
    password: 你mq管理界面的密码
    virtual-host: /

Declare the name of the queue switch
In order to avoid wrongly writing the name of the queue switch, define their names in the constant class, and uniformly define the
new MqConstants static variable class under the constnts package

 public class MqConstants {
    
    
    /**
     * 交换机
     */
    public final static String HOTEL_EXCHANGE = "hotel.topic";
    /**
     * 监听新增和修改的队列
     */
    public final static String HOTEL_INSERT_QUEUE = "hotel.insert.queue";
    /**
     * 监听删除的队列
     */
    public final static String HOTEL_DELETE_QUEUE = "hotel.delete.queue";
    /**
     * 新增或修改的RoutingKey
     */
    public final static String HOTEL_INSERT_KEY = "hotel.insert";
    /**
     * 删除的RoutingKey
     */
    public final static String HOTEL_DELETE_KEY = "hotel.delete";
}

Declare queue switch
In hotel-demo, define configuration class, declare queue and switch:


@Configuration
public class MqConfig {
    
    
    @Bean
    public TopicExchange topicExchange(){
    
    
        return new TopicExchange(MqConstants.HOTEL_EXCHANGE, true, false);
    }

    @Bean
    public Queue insertQueue(){
    
    
        return new Queue(MqConstants.HOTEL_INSERT_QUEUE, true);
    }

    @Bean
    public Queue deleteQueue(){
    
    
        return new Queue(MqConstants.HOTEL_DELETE_QUEUE, true);
    }

    @Bean
    public Binding insertQueueBinding(){
    
    
        return BindingBuilder.bind(insertQueue()).to(topicExchange()).with(MqConstants.HOTEL_INSERT_KEY);
    }

    @Bean
    public Binding deleteQueueBinding(){
    
    
        return BindingBuilder.bind(deleteQueue()).to(topicExchange()).with(MqConstants.HOTEL_DELETE_KEY);
    }
}

Send MQ message

Send MQ messages respectively in the add, delete, and modify services in hotel-admin:

Three parameters: switch, RountingKey, message
insert image description here

Every time the crud of the database is performed in hotel-admin, a message is sent to the column, and the receiver of the subscription message is notified to hotel-demo to update the documents in the es index library to ensure data synchronization


Receive MQ message

Things to do when hotel-demo receives MQ messages include:

  • New message: Query hotel information according to the passed hotel id, and then add a piece of data to the index library
  • Delete message: Delete a piece of data in the index library according to the passed hotel id

1) First, add new and delete services under servicethe package of hotel-demoIHotelService

void deleteById(Long id);

void insertById(Long id);

2) service.implImplement business in HotelService under the package in hotel-demo:

@Override
public void deleteById(Long id) {
    
    
    try {
    
    
        // 1.准备Request
        DeleteRequest request = new DeleteRequest("hotel", id.toString());
        // 2.发送请求
        client.delete(request, RequestOptions.DEFAULT);
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

@Override
public void insertById(Long id) {
    
    
    try {
    
    
        // 0.根据id查询酒店数据
        Hotel hotel = getById(id);
        // 转换为文档类型
        HotelDoc hotelDoc = new HotelDoc(hotel);

        // 1.准备Request对象
        IndexRequest request = new IndexRequest("hotel").id(hotel.getId().toString());
        // 2.准备Json文档
        request.source(JSON.toJSONString(hotelDoc), XContentType.JSON);
        // 3.发送请求
        client.index(request, RequestOptions.DEFAULT);
    } catch (IOException e) {
    
    
        throw new RuntimeException(e);
    }
}

3) Write a listener

Add a new class to the package in hotel-demo cn.itcast.hotel.mq:

@Component
public class HotelListener {
    
    

    @Autowired
    private IHotelService hotelService;

    /**
     * 监听酒店新增或修改的业务
     * @param id 酒店id
     */
    @RabbitListener(queues = MqConstants.HOTEL_INSERT_QUEUE)
    public void listenHotelInsertOrUpdate(Long id){
    
    
        hotelService.insertById(id);
    }

    /**
     * 监听酒店删除的业务
     * @param id 酒店id
     */
    @RabbitListener(queues = MqConstants.HOTEL_DELETE_QUEUE)
    public void listenHotelDelete(Long id){
    
    
        hotelService.deleteById(id);
    }
}

es cluster construction

Stand-alone elasticsearch for data storage will inevitably face two problems: massive data storage and single point of failure.

  • Massive data storage problem: Logically split the index library into N shards (shards) and store them in multiple nodes
  • Single point of failure problem: back up fragmented data on different nodes (replica)

ES cluster related concepts :

  • Cluster (cluster): A group of nodes with a common cluster name.

  • Node (node) : an Elasticearch instance in the cluster

  • 分片(shard):索引可以被拆分为不同的部分进行存储,称为分片。在集群环境下,一个索引的不同分片可以拆分到不同的节点中

解决问题:数据量太大,单点存储量有限的问题。
insert image description here

此处,我们把数据分成3片:shard0、shard1、shard2

  • 主分片(Primary shard):相对于副本分片的定义。

  • 副本分片(Replica shard)每个主分片可以有一个或者多个副本,数据和主分片一样。

数据备份可以保证高可用,但是每个分片备份一份,所需要的节点数量就会翻一倍,成本实在是太高了!

为了在高可用和成本间寻求平衡,我们可以这样做:

  • 首先对数据分片,存储到不同节点
  • 然后对每个分片进行备份,放到对方节点,完成互相备份

这样可以大大减少所需要的服务节点数量,如图,我们以3分片,每个分片备份一份为例:

insert image description here
现在,每个分片都有1个备份,存储在3个节点:

  • node0:保存了分片0和1
  • node1:保存了分片0和2
  • node2:保存了分片1和2

搭建ES集群

部署es集群可以直接使用docker-compose来完成,不过要求你的Linux虚拟机至少有8G的内存空间

4G我试过了,云服务器4G,部署完直接死机,所以至少得8G

首先编写一个docker-compose文件,内容如下:

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.1
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastic
  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.1
    container_name: es02
    environment:
      - node.name=es02
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data02:/usr/share/elasticsearch/data
    networks:
      - elastic
  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.12.1
    container_name: es03
    environment:
      - node.name=es03
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data03:/usr/share/elasticsearch/data
    networks:
      - elastic

volumes:
  data01:
    driver: local
  data02:
    driver: local
  data03:
    driver: local

networks:
  elastic:
    driver: bridge

把编写好的compose文件上传至虚拟机中
insert image description here

es运行需要修改一些linux系统权限,修改/etc/sysctl.conf文件

vi /etc/sysctl.conf

添加下面的内容:

vm.max_map_count=262144

然后执行命令,让配置生效:

sysctl -p

通过docker-compose启动集群:

docker-compose up -d

这里如果你未找到命令请先去安装 docker compose这个插件!安装完成后即可

insert image description here


集群状态监控

kibana可以监控es集群,不过新版本需要依赖es的x-pack 功能,配置比较复杂。

It is recommended to use cerebro to monitor the status of the es cluster. The official website: https://github.com/lmenezes/cerebro
The compressed package is downloaded and decompressed. The
decompressed directory is as follows:
insert image description here
enter the corresponding bin directory:
insert image description here
double-click the cerebro.bat file The service can be started.
Visit http://localhost:9000 to enter the management interface:
insert image description here
connect to the ip and port of the deployed es service
insert image description here

A green bar indicates that the cluster is green (healthy).


Create an index library

① Enter the command in DevTools:

PUT /test
{
    
    
  "settings": {
    
    
    "number_of_shards": 3, // 分片数量
    "number_of_replicas": 1 // 副本数量
  },
  "mappings": {
    
    
    "properties": {
    
    
      // mapping映射定义 ...
    }
  }
}

② You can also use cerebro to create an index library.
insert image description here
Fill in the index library information:
insert image description here
click the create button in the lower right corner:

insert image description here
View fragmentation effect
insert image description here


cluster split brain problem

Cluster nodes in elasticsearch have different responsibilities:
insert image description here

By default, any node in the cluster has the above four roles at the same time.

But a real cluster must separate cluster responsibilities:

  • master node: high CPU requirements, but memory requirements
  • data node: high requirements for CPU and memory
  • Coordinating node: high requirements for network bandwidth and CPU

Separation of duties allows us to allocate different hardware for deployment according to the needs of different nodes. And avoid mutual interference between services.

A typical es cluster responsibility division is shown in the figure:

insert image description here
The one with an asterisk is the primary node, and the rest are candidate primary nodes

split brain problem

A split-brain is caused by the disconnection of nodes in the cluster.

For example, in a cluster, the master node loses connection with other nodes:
insert image description here
at this time, node2 and node3 think that node1 is down, and they will re-elect the master:
insert image description here
when node3 is elected, the cluster continues to provide services to the outside world, node2 and node3 form a cluster by itself, and node1 automatically Into a cluster, the data of the two clusters is not synchronized, and there is a data discrepancy.

When the network is restored, because there are two master nodes in the cluster, the cluster state is inconsistent, and a split-brain situation occurs: the
insert image description here
solution to the split-brain problem is to require votes exceeding (number of eligible nodes + 1)/2 to be elected as the master, so The number of eligible nodes is preferably an odd number. The corresponding configuration item is discovery.zen.minimum_master_nodes, which has become the default configuration after es7.0, so the problem of split brain generally does not occur

For example: for a cluster formed by 3 nodes, the votes must exceed (3 + 1) / 2, which is 2 votes. node3 gets the votes of node2 and node3, and is elected as the master. node1 has only 1 vote for itself and was not elected. There is still only one master node in the cluster, and there is no split brain.

summary

What is the role of the master eligible node?

  • Participate in group election
  • The master node can manage the cluster state, manage sharding information, and process requests to create and delete index libraries

What is the role of the data node?

  • CRUD of data

What is the role of the coordinator node?

  • Route requests to other nodes

  • Combine the query results and return them to the user


Cluster Distributed Storage

When a new document is added, it should be saved in different shards to ensure data balance, so how does the coordinating node determine which shard the data should be stored in?

Insert three pieces of data:
insert image description here
insert three times

You can see from the test that the three pieces of data are in different shards:
insert image description here

The result is as follows:

insert image description here

Finally, all the data can be found in any of the three shards

Shard storage principle

Elasticsearch will use the hash algorithm to calculate which shard the document should be stored in:
insert image description here
Description:

  • _routing defaults to the id of the document
  • The algorithm is related to the number of shards,Therefore, once the index library is created, the number of shards cannot be modified!

The process of adding new documents is as follows:
insert image description here

Parse :

  • 1) Add a document with id=1
  • 2) Do a hash operation on the id, if the result is 2, it should be stored in shard-2
  • 3) The primary shard of shard-2 is on node3, and the data is routed to node3
  • 4) Save the document
  • 5) Synchronize to replica-2 of shard-2, on the node2 node
  • 6) Return the result to the coordinating-node node

Cluster Distributed Query

The elasticsearch query is divided into two stages:

  • scatter phase: In the scatter phase, the coordinating node will distribute the request to each shard

  • gather phase: the gathering phase, the coordinating node summarizes the search results of the data node, and processes it as the final result set and returns it to the user

insert image description here


Cluster failover

The master node of the cluster will monitor the status of the nodes in the cluster. If a node is found to be down, it will immediately migrate the fragmented data of the down node to other nodes to ensure data security. This is called failover.

1) For example, a cluster structure is shown in the figure:
insert image description here
now, node1 is the master node, and the other two nodes are slave nodes.

2) Suddenly, node1 fails:

insert image description here
The first thing after the downtime is to re-elect the master. For example, node2 is selected:
insert image description here
After node2 becomes the master node, it will check the cluster monitoring status and find that: shard-1 and shard-0 have no replica nodes. Therefore, the data on node1 needs to be migrated to node2 and node3:
insert image description here

Guess you like

Origin blog.csdn.net/giveupgivedown/article/details/129323438