ElasticSearch study notes summary

Introduction to ElasticSearch

What is ElasticSearch

ElaticSearch, abbreviated as es, es is an open source, highly-expandable distributed full-text search engine that can store and retrieve data in near real-time; it has good scalability and can be extended to hundreds of servers to process PB-level data. es also uses Java development and uses Lucene as its core to implement all indexing and search functions, but its purpose is to hide the complexity of Lucene through a simple RESTful API, thereby making full-text search simple.

ElasticSearch vs. Solr

  • Solr uses Zookeeper for distributed management, while Elasticsearch itself has distributed coordination and management functions;
  • Solr supports more formats of data, while Elasticsearch only supports json file format;
  • Solr officially provides more functions, while Elasticsearch itself pays more attention to core functions, and most advanced functions are provided by third-party plug-ins;
  • Solr performs better than Elasticsearch in traditional search applications, but it is significantly less efficient than Elasticsearch when dealing with real-time search applications.

What is full text search

Full-text search: extract part of the information from unstructured data, reorganize it to make it have a certain structure, and then search for the data with a certain structure, so as to achieve the purpose of relatively fast search.

For example, a computer indexing program scans each word in the article and builds an index for each word, indicating the number and position of the word in the article. When the user queries, the search program will search according to the index established in advance. And the search result is fed back to the user's search method. This process is similar to the process of looking up characters through the search word list in a dictionary.

ElasticSearch application scenarios

  1. Wikipedia, similar to Baidu Baike, toothpaste, toothpaste-like Wikipedia, full-text search, highlighting, search recommendation.
  2. The Guardian (foreign news website), similar to Sohu News, user behavior logs (clicks, browses, favorites, comments) + social network data (relevant opinions on XX news), data analysis, and gives the author of each news article, Let him know the public feedback (good, bad, popular, junk, contempt, admiration) of his article.
  3. Stack Overflow (foreign program exception discussion forum), IT questions, program errors, submit them, someone will discuss and answer with you, full-text search, search for related questions and answers, if the program reports an error, the error information will be pasted into it Go and search for a corresponding answer.
  4. GitHub (open source code management), search hundreds of billions of lines of code.
  5. Domestic: site search (e-commerce, recruitment, portal, etc.), IT system search (OA, CRM, ERP, etc.), data analysis (a popular use scenario for ES).

Installation and startup of ElasticSearch

Install ES

The official address of ElasticSearch: https://www.elastic.co/products/elasticsearch

  1. Download the compressed package from the official website

  2. Unzip, enter the bin directory, click elasticsearch.bat to start, the log information displayed on the console is as follows:

    image-20200324204405635
    Note: 9300 is the tcp communication port, which is executed between clusters and TCPClient, and 9200 is the RESTful interface of the http protocol.

  3. Enter localhost:9200 through the browser to access the ElasticSearch server, and see the json information returned as follows, indicating that the service started successfully:

Insert picture description here

Note: ElasticSearch is developed using java, and the jdk version required for this version of es is 1.8 or higher, so before installing ElasticSearch, ensure that JDK1.8+ is installed and the JDK environment variables are correctly configured, otherwise the startup of ElasticSearch will fail.

Install ES's graphical interface plug-in

ElasticSearch is different from Solr's own graphical interface. By installing the head plug-in of ElasticSearch, we can complete the effect of the graphical interface and complete the view of index data. There are two ways to install plug-ins, online installation and local installation. This document uses the local installation method to install the head plugin. Installing head in elasticsearch-5-* and above versions requires node and grunt to be installed.

Plug-in official website address: https://github.com/mobz/elasticsearch-head

It is recommended to use docker to install, which is more convenient and run the command:

docker run -p 9100:9100 --name es-head mobz/elasticsearch-head:5

After startup, enter: localhost:9100 in the browser to view the visual interface.

note:

CORS must be enabled in elasticsearch, otherwise the browser will reject elasticsearch-head requests due to the same-origin policy violation

If you cannot successfully connect to the es service, you need to modify the configuration file in the config directory of ElasticSearch: config/elasticsearch.yml, and add the following two commands:

Enable CORS in elasticsearch

http.cors.enabled: true 
http.cors.allow‐origin: "*"

Then restart the ElasticSearch service, and you can connect normally.

ElasticSearch related concepts (terms)

Overview

Elasticsearch is document-oriented, which means it can store entire objects or documents. However, it is not only storage, it also indexes the content of each document so that it can be searched. In Elasticsearch, you can index, search, sort, and filter documents.

Elasticsearch compares traditional relational databases as follows:

Relational DB ‐> Databases ‐> Tables  ‐>  Rows      ‐> Columns
Elasticsearch ‐> Indices   ‐> Types   ‐>  Documents ‐> Fields

Elasticsearch core concepts

Index

An index is a collection of documents with similar characteristics .

For example: you can have an index for customer data, an index for a product catalog, and an index for order data.

An index is identified by a name (must be all lowercase letters), and when we want to index, search, update, and delete documents in this index, we must use this name .

In a cluster, any number of indexes can be defined. You can understand the index as a database.

Type

In an index, you can define one or more types .

A type is a logical classification/partition of your index, and its semantics are entirely up to you**. **

For example, let's suppose you run a blogging platform and store all your data in an index. In this index, you can define a type for user data, a type for blog data, and a type for comment data.

Usually, a type is defined for a document with a set of common fields, and type can be understood as a table .

Document

A document is a basic information unit that can be indexed and is the smallest data unit in ES . The document is represented in JSON format.

For example, you can have a document for a certain customer, a document for a certain product,

In an index/type, you can store as many documents as you want. Note that although a document physically exists in an index, the document must be indexed/given an index type.

It can be understood as a row of data in the data.

Field

It is equivalent to the field of the data table, which classifies and identifies the document data according to different attributes.

It can be understood as a column in the database.

Mapping

Mapping is to set some restrictions on the way and rules of processing data, such as the data type of a field, default value, analyzer, whether to be indexed, etc. These are all settings that can be set in the mapping, and the others are some of the data in es. The use of rule settings is also called mapping. Processing data according to the optimal rules greatly improves performance. Therefore, it is necessary to establish a mapping, and it is necessary to think about how to establish a mapping to achieve better performance.

Near real-time NRT

Elasticsearch is a near real-time search platform. This means that there is a slight delay from indexing a document until it can be searched (usually within 1 second)

Cluster

A cluster is organized by one or more nodes, which together hold the entire data and provide index and search functions together. A cluster is identified by a unique name, which is "elasticsearch" by default.

This name is important because a node can only join the cluster by specifying the name of a cluster.

Node

A node is a server in the cluster. As a part of the cluster, it stores data and participates in the indexing and search functions of the cluster. Similar to a cluster, a node is also identified by a name. By default, this name is the name of a random Marvel comic character. This name will be assigned to the node when it is started. This name is very important for management work, because in this management process, you will determine which servers in the network correspond to which nodes in the Elasticsearch cluster.

A node can join a specified cluster by configuring the cluster name. By default, each node will be arranged to join a cluster called "elasticsearch", which means that if you start several nodes in your network and assume they can discover each other, they will automatically To form and join a cluster called "elasticsearch".

In a cluster, you can have as many nodes as you want. Moreover, if there are currently no Elasticsearch nodes running in your network, start a node at this time, and a cluster called "elasticsearch" will be created and added by default.

Sharding and replication shards&replicas

An index can store a large amount of data beyond the hardware limit of a single node. For example, an index with 1 billion documents occupies 1TB of disk space, and no node has such a large disk space; or a single node processes search requests and responds too slowly. To solve this problem, Elasticsearch provides the ability to divide the index into multiple parts, which are called shards. When you create an index, you can specify the number of shards you want. Each shard itself is also a fully functional and independent "index", this "index" can be placed on any node in the cluster.

Fragmentation is very important for two main reasons:

1) Allows you to split/expand your content capacity horizontally.

2) Allows you to perform distributed and parallel operations on shards (potentially on multiple nodes) to improve performance/throughput.

As for how a shard is distributed and how its documents are aggregated back to search requests, it is completely managed by Elasticsearch, which is transparent to you as a user.

In a network/cloud environment, failure can happen at any time. A certain shard/node is offline for some reason or disappears for any reason. In this case, a failover mechanism is very useful. And it is highly recommended. For this purpose, Elasticsearch allows you to create one or more copies of a shard. These copies are called replicated shards, or simply called replications.

There are two main reasons why replication is important: Provides high availability in the case of shard/node failure. For this reason, it is very important to note that the replicated shard is never placed on the same node as the original/primary shard. Expand your search volume/throughput, because searches can be run in parallel on all replications. In short, each index can be divided into multiple fragments. An index can also be copied 0 times (meaning no copying) or multiple times. Once replicated, each index has a difference between the primary shard (the original shard as the source of replication) and the replicated shard (the copy of the primary shard). The number of shards and replications can be specified when the index is created. After the index is created, you can dynamically change the number of replications at any time, but you cannot change the number of shards afterwards.

By default, each index in Elasticsearch is sharded with 5 primary shards and 1 replication, which means that if your cluster has at least two nodes, your index will have 5 primary shards and There are another 5 replicated shards (1 full copy), so there are a total of 10 shards per index.

ElasticSearch and database analogy

Relational database (such as Mysql) Non-relational database (Elasticsearch)
Database Index
Table Type Type (After 6.0 version, there can only be one under an index, Type is cancelled after 7.0 version)
Row of data Document (JSON format)
Column Field
Constraint Schema Mapping

Integrated use of IK tokenizer and ElasticSearch

Install IK tokenizer

Download link: https://github.com/medcl/elasticsearch-analysis-ik/releases

  1. Go to the official website to download the compressed package
  2. Unzip, copy the unzipped elasticsearch folder to elasticsearch-5.6.8\plugins, and rename the folder to analysis-ik
  3. Restart ElasticSearch to load the IK tokenizer

IK tokenizer test

IK provides two word segmentation algorithms ik_smart and ik_max_word

  • ik_smart is the smallest segmentation,

  • ik_max_word is the most fine-grained division

Test minimum segmentation:

Browser input address:

http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=我是程序员

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-JWE3bTYp-1596695634328)(https://tva1.sinaimg.cn/large/00831rSTly1gd5d9l4gstj31400swgpt.jpg)]

Test the smallest segmentation: enter the address in the browser address bar:

http://127.0.0.1:9200/_analyze?analyzer=ik_max_word&pretty=true&text=我是程序员

image-20200324220451130

Related concepts of ElasticSearch cluster

The ES cluster is a distributed system of P2P type (using gossip protocol). Except for cluster state management, all other requests can be sent to any node in the cluster. This node can find out which nodes it needs to forward to, and Communicate directly with these nodes. Therefore, in terms of network architecture and service configuration, the configuration required to build a cluster is extremely simple. Before Elasticsearch 2.0, under an unobstructed network, all nodes configured with the same cluster.name automatically belong to a cluster. After version 2.0, based on security considerations to avoid the trouble caused by the development environment too casually, starting from version 2.0, the default automatic discovery method has been changed to unicast. The addresses of several nodes are provided in the configuration, and ES regards them as the role of gossip router to complete cluster discovery. Since this is only a small function in ES, the role of gossip router does not need to be configured separately, and each ES node can assume it. Therefore, in a unicast cluster, each node can be configured with the same node list as the router.

There is no limit to the number of nodes in a cluster. Generally, two nodes or more can be regarded as a cluster. Generally, considering high performance and high availability, the number of nodes in a general cluster is 3 or more.

Cluster construction

  1. Prepare three elasticsearch servers

Create an elasticsearch-cluster folder and copy three elasticsearch services internally

  1. Modify the configuration of each server

Modify the elasticsearch-cluster\node*\config\elasticsearch.yml configuration file

node1 node:

 #节点1的配置信息:
#集群名称,保证唯一
cluster.name: my‐elasticsearch
#节点名称,必须不一样
node.name: node‐1
#必须为本机的ip地址
network.host: 127.0.0.1
#服务端口号,在同一机器下必须不一样
http.port: 9200
#集群间通信端口号,在同一机器下必须不一样
transport.tcp.port: 9300
#设置集群自动发现机器ip集合
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9300","127.0.0.1:9301","127.0.0.1:9302"]

node2 node:

#节点2的配置信息:
#集群名称,保证唯一
cluster.name: my‐elasticsearch
#节点名称,必须不一样
node.name: node‐2
#必须为本机的ip地址
network.host: 127.0.0.1
#服务端口号,在同一机器下必须不一样
http.port: 9201
#集群间通信端口号,在同一机器下必须不一样
transport.tcp.port: 9301
#设置集群自动发现机器ip集合
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9300","127.0.0.1:9301","127.0.0.1:9302"]

node3 node:

#节点3的配置信息:
#集群名称,保证唯一
cluster.name: my‐elasticsearch
#节点名称,必须不一样
node.name: node‐3
#必须为本机的ip地址
network.host: 127.0.0.1
#服务端口号,在同一机器下必须不一样
http.port: 9202
#集群间通信端口号,在同一机器下必须不一样
transport.tcp.port: 9302
#设置集群自动发现机器ip集合
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9300","127.0.0.1:9301","127.0.0.1:9302"]
  1. Just start three nodes.
  2. Use elasticsearch-header to view cluster status

image-20200806143316239

ElasticSearch programming operation

Spring Data ElasticSearch 使用

Spring Data ElasticSearch simplifies the elasticSearch operation based on the spring data API, and encapsulates the client API of the original elasticSearch operation. Spring Data provides an integrated search engine for the Elasticsearch project. Spring Data Elasticsearch POJO's key functional area-centric model interacts with Elastichsearch documents and easily writes a repository data access layer.

how to use

  1. Introduce Maven
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
        </dependency>
  1. Place application.yml

    server:
      port: 8080
    spring:
      application:
        name: elastic-search-service
      # 配置es
      data:
        elasticsearch:
          cluster-name: elasticsearch
          cluster-nodes: localhost:9300
    
    
  2. Writing Pojo

    package com.hg.model;
    
    import lombok.Data;
    import org.springframework.data.annotation.Id;
    import org.springframework.data.elasticsearch.annotations.Document;
    import org.springframework.data.elasticsearch.annotations.Field;
    import org.springframework.data.elasticsearch.annotations.FieldType;
    
    /**
     * @Author skh
     * @Date 2020/3/25 19:40
     * @Desc
     */
    @Data
    //指定索引名称和type名称
    @Document(indexName = "index_blog", type = "article")
    public class Article {
    
        @Id
        //定义字段类型,是否存储,是否分词
        @Field(type = FieldType.Long, store = true)
        private Long id;
    
        @Field(type = FieldType.Text, store = true, analyzer = "ik_smart")
        private String title;
    
        @Field(type = FieldType.Text, store = true, analyzer = "ik_smart")
        private String content;
    }
    
    
    1. Write the repository interface to inherit ElasticsearchRepository

      public interface ArticleRepository extends ElasticsearchRepository<Article,Long> {
      
          List<Article> findByTitle(String title);
      }
      
      
  3. Use ArticleRepository to realize the addition, deletion, modification, and checking function of es.

  4. You can also inject ElasticsearchTemplate to realize the addition, deletion, modification, and query function.

Guess you like

Origin blog.csdn.net/kaihuishang666/article/details/107838836