Getting started with ElasticSearch (nanny-level tutorial)

This chapter will introduce: the role of ElasticSearch, build the elasticsearch environment (Windows/Linux), build the ElasticSearch cluster, install and use the visual client plug-in elasticsearch-head, install and use the IK tokenizer; the ElasticSearch operation introduced in this chapter is based on Restful form (using the form of http request), please refer to the next part for java code operation

@

1. Introduction to Elastic Search

Elaticsearch, referred to as es for short, es is an open source highly scalable distributed full-text search engine, which can store and retrieve data in near real time; it
has can be extended to hundreds of servers to process PB-level data . es is also developed in Java and uses Lucene as its core to implement
all indexing and search functions, but its purpose is to hide the complexity of Lucene through a simple RESTful API, thus making full-text search easy
.

1.1 Use cases of ElasticSearch

  • In early 2013, GitHub abandoned Solr and adopted ElasticSearch for PB-level search. "GitHub uses ElasticSearch to search 20TB
    of data, including 1.3 billion files and 130 billion lines of code"
  • Wikipedia: Start the core search architecture based on elasticsearch
  • SoundCloud: "SoundCloud uses ElasticSearch to provide instant and accurate music search services for 180 million users"
  • Baidu: Baidu currently widely uses ElasticSearch for text data analysis, collects various index data and user-defined data on all Baidu servers, and performs multi-dimensional analysis and display of various data to assist in locating and analyzing instance exceptions or business-level exceptions. Currently covering more than 20 business lines within Baidu (including casio, cloud analysis, network alliance, forecasting, library, direct number, wallet, risk control, etc.), a single cluster has a maximum of 100 machines, 200 ES nodes, and imports 30TB+ data every day
  • Sina uses ES to analyze and process 3.2 billion real-time logs
  • Ali uses ES to build Wacai's own log collection and analysis system

1.2 Comparison between ElasticSearch and solr

  • Solr uses Zookeeper for distributed management, while Elasticsearch itself has a distributed coordination management function;
  • Solr supports data in more formats, while Elasticsearch only supports the json file format;
  • Solr officially provides more functions, while Elasticsearch itself focuses more on core functions, and most advanced functions are provided by third-party plug-ins;
  • Solr performs better than Elasticsearch in traditional search applications, but is significantly less efficient than Elasticsearch in real-time search applications

2. ElasticSearch installation (windows)

Download the compressed package:

ElasticSearch的官方地址: Elasticsearch: The Official Distributed Search & Analytics Engine | Elastic

2.1 installation

Note: es is developed using java, using lucene as the core, and the java environment needs to be configured! (jdk1.8 and above)

Similar to tomcat, just decompress it directly. Its directory structure is as follows:

es01

2.2 Modify the configuration file

  • Modify the conf\jvm.option file
将#-Xms2g                                  
  #-Xmx2g修改成为:
-Xms340m
-Xmx340m
否则因为虚拟机内存不够无法启动
  • Modify the conf\elasticsearch.yml file
elasticsearch-5.6.8\config\elasticsearch.yml中末尾加入:
http.cors.enabled: true
http.cors.allow-origin: "*"
network.host: 127.0.0.1
目的是使ES支持跨域请求

2.3 start

Click elasticsearch.bat in the bin directory under ElasticSearch to start, and the log information displayed on the console is as follows:

es02

Note: 9300 is the tcp communication port, es clusters use tcp for communication, and 9200 is the http protocol port.

We can access in the browser:

es03

2.4 Install the graphical plug-in

From the above, it can be found that ElasticSearch is different from Solr's own graphical interface. We can complete the effect of the graphical interface and view the index data by installing the head plug-in of ElasticSearch. There are two ways to install plug-ins, online installation and local installation. This document uses the local installation method to install the head plug-in. Elasticsearch-5-* and above versions need to install node and grunt to install head.

After downloading the compressed package, unzip it.

Double-click to install, and enter node -v through cmd to view the version number

es04

  • Install grunt as a global command, Grunt is a Node.js-based project construction tool

Enter in cmd:

npm install ‐g grunt‐cli

es05

Since you are accessing a foreign server, if the download speed is slow, you can switch to the Taobao mirror

npm install -g cnpm –registry=https://registry.npm.taobao.org

For subsequent use, just replace npm xxx with cnpm xxx

Check if the installation is successful

npm config get registry 

es29
Note: Subsequent use needs to be npmreplaced with cnpm.

  • start head

Enter the head plugin directory, open cmd, and enter:

>npm install
>grunt server

es06

Open the browser and enter http://localhost:9100

es07

3. ES related concepts

3.1 Overview (important)

Elasticsearch is document oriented, which means it can store entire objects or documents. However, it not
only stores, but also indexes (index) the content of each document so that it can be searched. In Elasticsearch, you can index, search, sort, and filter documents (rather than rows and columns of data). Compared with traditional relational databases, Elasticsearch is as follows:

Relational DB ‐> Databases ‐> Tables ‐> Rows ‐> Columns
Elasticsearch ‐> Indices ‐> Types ‐> Documents ‐> Fields

3.2 Core Concepts

1) index index -

An index is a collection of documents with somewhat similar characteristics. For example, you could have an index for customer data, another index for product catalogs, and another index for order data. An index is identified by a name (must be all lowercase letters), and when we want to index, search, update and delete documents corresponding to this index, we will use this name. In a cluster, any number of indexes can be defined. It can be compared to the database in mysql

2) type type

In an index, you can define one or more types. A type is a logical classification/partition of your index, the semantics of which are entirely up to you. Typically, a type is defined for documents with a common set of fields. For example, let's say you run a blogging platform and store all your data in an index. In this index, you can define one type for user data, another type for blog data, and of course, another type for comment data. Can be compared to tables in mysql

3) Filed field

It is equivalent to the field of the data table, which classifies and identifies the document data according to different attributes.

4) Mapping mapping

Mapping is to make some restrictions on the way of processing data and rules, such as the data type of a certain field, default value, analyzer, whether it is indexed, etc. These are all settings that can be set in the mapping, and others are for processing data in es The use of rule settings is also called mapping. Processing data according to optimal rules can greatly improve performance, so it is necessary to establish mapping, and it is necessary to think about how to establish mapping to improve performance. It is equivalent to the process of creating a table in mysql, setting the primary key and foreign key, etc.

5) document document

A document is a basic unit of information that can be indexed. For example, you can have a document for a certain customer, a document for a certain product, and of course, a document for a certain order. Documents are expressed in JSON (Javascript Object Notation) format, and JSON is a ubiquitous Internet data interaction format. In an index/type, you can store as many documents as you like. Note that although a document, physically exists in an index, the document must be indexed/assigned an index type. Insert the index library in units of documents, analogous to a row of data in the database

6) cluster cluster

A cluster is organized by one or more nodes, which jointly hold the entire data and provide indexing and search functions together. A cluster is identified by a unique name, which by default is "elasticsearch". This name is important because a node can only join a cluster by specifying its name.

7) Node node

A node is a server in the cluster that, as part of the cluster, stores data and participates in the cluster's indexing and search functions. Similar to clusters, a node is also identified by a name. By default, this name is the name of a random Marvel Comics character that is assigned to the node at startup. This name is very important for management work, because in this management process, you will determine which servers in the network correspond to which nodes in the Elasticsearch cluster.

A node can join a specified cluster by configuring the cluster name. By default, each node is scheduled to join a cluster called "elasticsearch", which means that if you start several nodes in your network and assume they can discover each other, they will automatically Form and join a cluster called "elasticsearch".

In a cluster, you can have as many nodes as you want. Moreover, if you do not currently have any Elasticsearch nodes running in your network, starting a node at this time will create and join a cluster called "elasticsearch" by default.

8) Fragmentation and replication shards&replicas

An index can store large amounts of data beyond the hardware limitations of a single node. For example, an index with 1 billion documents occupies 1TB of disk space, and no node has such a large disk space; or a single node processes search requests, and the response is too slow. To solve this problem, Elasticsearch provides the ability to divide the index into multiple parts, these parts are called shards. When you create an index, you can specify the number of shards you want. Each shard is itself a fully functional and independent "index" that can be placed on any node in the cluster. Sharding is important for two main reasons: 1) Allows you to horizontally split/scale your content capacity. 2) Allows you to perform distributed, parallel operations on shards (potentially, on multiple nodes), improving performance/throughput.

As for how a shard is distributed and how its documents are aggregated back to search requests, it is completely managed by Elasticsearch, which is transparent to you as a user.

In a network/cloud environment where failures can happen at any time, where a shard/node somehow goes offline, or disappears for any reason, it is very useful to have a failover mechanism And is highly recommended. For this purpose, Elasticsearch allows you to create one or more copies of a shard. These copies are called replicated shards, or simply replicas.

Replication is important for two main reasons: It provides high availability in case of shard/node failure. For this reason, it is very important to note that replicated shards are never placed on the same node as the original/primary shard. Scale your search volume/throughput, as searches can run on all replicas in parallel. In short, each index can be divided into multiple shards. An index can also be replicated 0 times (meaning no replication) or multiple times. Once replicated, each index has a primary shard (the original shard that was copied from) and a replicated shard (a copy of the primary shard). The number of shards and replicas can be specified at index creation time. You can dynamically change the number of replicas at any time after the index is created, but you cannot change the number of shards afterwards.

By default, each index in Elasticsearch is sharded with 5 primary shards and 1 replica, which means, if you have at least two nodes in your cluster, your index will have 5 primary shards and Another 5 replicated shards (1 full copy), for a total of 10 shards per index.

4. ElasticSearch client operation

The above part is a theoretical part. In actual development, there are mainly three ways to serve as the client of es service:

  • Use the elasticsearch-head plugin
  • Use the Restful interface provided by elasticsearch to directly access
  • Use the API provided by elasticsearch to access

4.1 Direct access using the Restful interface

We need to use http requests, and introduce two interface testing tools: postman and Talend API tester.

  • Talend API tester installation:

This is a chrome plugin, no download required;

es08

  • Postman installation:

Postman official website: https://www.getpostman.com

es09

4.2 Use Talend API tester to perform es client operations.

1) Interface syntax of Elasticsearch

curl ‐X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' ‐d '<BODY>'

in:

es10

2) Create an index library index and add mapping------PUT

PUT 

Request body:

article: type type; it is equivalent to the definition of fields in this table in this index library called article defined below,

Fields are not indexed by default;

analyzer: The tokenizer uses the standard tokenizer

{
    "mappings": {
        "article": {
            "properties": {
                "id": {
                    "type": "long",
                    "store": true,
                    "index": "not_analyzed"
                },
                "title": {
                    "type": "text",
                    "store": true,
                    "index": "analyzed",
                    "analyzer": "standard"
                },
                "content": {
                    "type": "text",
                    "store": true,
                    "index": "analyzed",
                    "analyzer": "standard"
                }
            }
        }
    }
}

es11

View in the visualization tool elasticsearch-head:

es12

3) First create the index index, then add mapping ----PUT

We can set the mapping information when creating the index, of course, we can also create the index first and then set the mapping.
In the previous step, without setting the mapping information, directly use the put method to create an index, and then set the mapping information.
Requested url:

PUT   http://127.0.0.1:9200/hello2/article/_mapping

Request body:

{
     "article": {
            "properties": {
                "id": {
                    "type": "long",
                    "store": true,
                    "index": "not_analyzed"
                },
                "title": {
                    "type": "text",
                    "store": true,
                    "index": "analyzed",
                    "analyzer": "standard"
                },
                "content": {
                    "type": "text",
                    "store": true,
                    "index": "analyzed",
                    "analyzer": "standard"
                }
            }
        }
}

es13

4) Delete index index ----DELETE

Request URL:

DELETE  http://127.0.0.1:9200/hello2

es14

5) Create a document document (add content to the index library) --- POST

Request URL:

POST  http://127.0.0.1:9200/hello/article/1

Request body:

{
    "id": 1,
    "title": "ElasticSearch是一个基于Lucene的搜索服务器",
    "content": "它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口。Elasticsearch是用Java开发的,并作为Apache许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。"
}

es15

View in elasticsearch-head:

es16

Note that generally we assign the same value to _id and id.

6) Modify document content ---- POST

Request URL:

POST http://127.0.0.1:9200/hello/article/1

es17

View in elasticsearch-head:

es18

7) Delete document document---DELETE

Request URL:

DELETE http://127.0.0.1:9200/hello/article/2

es19

8) Query document document-----GET

There are three ways to query documents:

  • Query by id;
  • Query by keyword
  • According to the input content, word segmentation first, and then query
i. Query by id

Request URL:

GET http://127.0.0.1:9200/hello/article/1

es20

ii. Query based on keywords - term query

Request URL:

POST http://127.0.0.1:9200/hello/article/_search

Request body:

{
    "query": {
        "term": {
            "title": "搜"
        }
    }
}

you are 21

iii. Query document - querystring query

Request URL:

POST   http://127.0.0.1:9200/hello/article/_search

Request body:

{
    "query": {
        "query_string": {
            "default_field": "title",
            "query": "搜索服务器"
        }
    }
}

Specify:
which field to query on;
what is the content to be queried;

It will segment the query content first, and then query

es22

4.3 Use elasticsearch-head for es client operation

The http request tool is integrated in elasticsearch-head, which can provide review query:
es27

5. Integrated use of IK tokenizer and Elasticsearch

The above tokenizer uses a standard tokenizer, which is not very friendly to Chinese word segmentation. For example, if I am a programmer, I get:

GET http://127.0.0.1:9200/_analyze?analyzer=standard&pretty=true&text=我是程序员
"tokens":[
{
    
    "token": "我", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>",…},
{
    
    "token": "是", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>",…},
{
    
    "token": "程", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>",…},
{
    
    "token": "序", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>",…},
{
    
    "token": "员", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>",…}
]

The participle we hope to achieve is: I, is, program, programmer.

There are many word breakers that support Chinese, word word breaker, Paoding Jieniu, Ansj word breaker, the following note the use of IK word breaker.

5.1 Installation of IK tokenizer

1) Download address: https://github.com/medcl/elasticsearch-analysis-ik/releases

2) Unzip, copy the decompressed elasticsearch folder to elasticsearch-5.6.8\plugins, and rename the folder to analysis-ik (other names are also available, the purpose is not to rename)

3) Restart ElasticSearch to load the IK tokenizer

es23

5.2 IK tokenizer test

IK provides two word segmentation ik_smart and ik_max_word

Among them, ik_smart is the least segmentation, and ik_max_word is the most fine-grained division.

Let's test it out:

  • Minimal segmentation: enter the address in the browser:
GET   http://127.0.0.1:9200/_analyze?analyzer=ik_smart&pretty=true&text=我是程序员

return result:

"tokens":[
{
    
    "token": "我", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR",…},
{
    
    "token": "是", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR",…},
{
    
    "token": "程序员", "start_offset": 2, "end_offset": 5, "type": "CN_WORD",…}
]
  • The latest segmentation: enter the address in the browser:
GET   http://127.0.0.1:9200/_analyze?analyzer=ik_max_word&pretty=true&text=我是程序员

return result:

"tokens":[
{
    
    "token": "我", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR",…},
{
    
    "token": "是", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR",…},
{
    
    "token": "程序员", "start_offset": 2, "end_offset": 5, "type": "CN_WORD",…},
{
    
    "token": "程序", "start_offset": 2, "end_offset": 4, "type": "CN_WORD",…},
{
    
    "token": "员", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR",…}
]

6. ElasticSearch cluster

ES cluster is a distributed system of P2P type (using gossip protocol). Except for cluster state management, all other requests can be sent to any node in the cluster. This node can find out which nodes need to be forwarded to, and Communicate directly with these nodes. Therefore, from the perspective of network architecture and service configuration, the configuration required to build a cluster is extremely simple. Before Elasticsearch 2.0, under an unimpeded network, all nodes configured with the same cluster.name automatically belonged to a cluster. After version 2.0, based on security considerations, to avoid the trouble caused by too casual development environment, starting from version 2.0, the default automatic discovery method is changed to unicast (unicast) method. The addresses of several nodes are provided in the configuration, and ES regards them as gossip routers to complete cluster discovery. Since this is only a small function in ES, the role of gossip router does not need to be configured separately, and each ES node can assume it. Therefore, in a cluster using unicast mode, each node can be configured with the same list of several nodes as a router.

There is no limit to the number of nodes in a cluster, generally two or more nodes can be regarded as a cluster. Generally in terms of high performance and high availability

Consider that the number of nodes in a general cluster is 3 or more.

6.1 Cluster Construction (Windows)

1) Prepare three elasticsearch servers:

es24

2) Modify the configuration of each server

Modify the \comf\elasticsearch.yml configuration file:

#Node节点1:

http.cors.enabled: true
http.cors.allow-origin: "*"
#节点1的配置信息:
#集群名称,保证唯一
cluster.name: my-elasticsearch
#节点名称,必须不一样
node.name: node-1
#必须为本机的ip地址
network.host: 127.0.0.1
#服务端口号,在同一机器下必须不一样
http.port: 9201
#集群间通信端口号,在同一机器下必须不一样
transport.tcp.port: 9301
#设置集群自动发现机器ip集合
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9301","127.0.0.1:9302","127.0.0.1:9303"]

#Node节点2:

http.cors.enabled: true
http.cors.allow-origin: "*"
#节点1的配置信息:
#集群名称,保证唯一
cluster.name: my-elasticsearch
#节点名称,必须不一样
node.name: node-2
#必须为本机的ip地址
network.host: 127.0.0.1
#服务端口号,在同一机器下必须不一样
http.port: 9202
#集群间通信端口号,在同一机器下必须不一样
transport.tcp.port: 9302
#设置集群自动发现机器ip集合
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9301","127.0.0.1:9302","127.0.0.1:9303"]

#Node节点3:

http.cors.enabled: true
http.cors.allow-origin: "*"
#节点1的配置信息:
#集群名称,保证唯一
cluster.name: my-elasticsearch
#节点名称,必须不一样
node.name: node-3
#必须为本机的ip地址
network.host: 127.0.0.1
#服务端口号,在同一机器下必须不一样
http.port: 9203
#集群间通信端口号,在同一机器下必须不一样
transport.tcp.port: 9303
#设置集群自动发现机器ip集合
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9301","127.0.0.1:9302","127.0.0.1:9303"]

3. Start each node server

You can start the elasticsearch.bat under each server separately. Here I use the batch file under windows:

Create a new elasticsearch_cluster_start.bat file and add the following content:

The format is: start "the name of the file to be started" "the path of the file" & means to continue execution after starting A.

start "elasticsearch.bat" "F:\Soft\ES-cluster\cluster01\bin\elasticsearch.bat" &
start "elasticsearch.bat" "F:\Soft\ES-cluster\cluster02\bin\elasticsearch.bat" &
start "elasticsearch.bat" "F:\Soft\ES-cluster\cluster03\bin\elasticsearch.bat" 

The batch processing on Windows will not be elaborated in this chapter.

4. Cluster test

As long as any node in the cluster is connected, its operation mode is basically the same as that of the stand-alone version, and only the storage structure is changed.

Add indexes and maps

PUT  http://127.0.0.1:9201/hello

Request body:

{
    "mappings": {
    "article": {
        "properties": {
            "id": {
                "type": "long",
                "store": true,
                 "index": "not_analyzed"
            },
            "title": {
                "type": "text",
                "store": true,
                "index": true,
                "analyzer": "ik_smart"
            },
            "content": {
                "type": "text",
                "store": true,
                "index": true,
                "analyzer": "ik_smart"
            }
        }
    }
    }
}

return result:

{
"acknowledged": true,
"shards_acknowledged": true,
"index": "hello"
}

add document

POST   http://127.0.0.1:9201/hello/article/1

Request body:

{
"id":1,
"title":"ElasticSearch是一个基于Lucene的搜索服务器",
"content":"它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口。Elasticsearch是用Java开发的,并作为Apache许可条款下的开放源码发布,是当前流行的企业级搜索引擎。设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便。"
}

return value:

{
"_index": "hello",
"_type": "article",
"_id": "1",
"_version": 1,
"result": "created",
"_shards":{
"total": 2,
"successful": 2,
"failed": 0
},
"created": true
}

View in elasticsearch-head:

es25

es26

7. ElasticSearch installation (Linux)

1) Make sure you have installed the java environment

You can directly copy the past in windows, no need to install, just decompress.

tar -zxf elasticsearch-6.3.2.tar.gz

2) Modify the configuration file

  • Modify the conf\jvm.option file
将#-Xms2g                                  
  #-Xmx2g修改成为:
-Xms340m
-Xmx340m
否则因为虚拟机内存不够无法启动
  • Modify the conf\elasticsearch.yml file
#elasticsearch-5.6.8\config\elasticsearch.yml中末尾加入:
http.cors.enabled: true
http.cors.allow-origin: "*"
network.host: 127.0.0.1
目的是使ES支持跨域请求

3) start

Note: In the linux environment, root users are not supported to start directly (the reason is security issues)

  • Add user:
[root@coderxz bin]# useradd rxz -p rongxianzhao
[root@coderxz bin]# chown -R rxz:rxz /usr/local/elasticsearch/*
[root@coderxz bin]# su rxz
  • implement:
#注意:切换为非root用户执行哦
[rxz@coderxz bin]$ ./elasticsearch
  • Check the running status:
检测是否运行:jps (显示所有的Java进程pid的命令)
	              ps aux|grep elasticsearch
				  [root@coderxz ~]# curl -X GET 'http://localhost:9200'

4) To configure external network access to port 9200, the port of the server needs to be opened

Modify the configuration file config/elasticsearch.yml

network.host: 0.0.0.0

5) Background start

If you installed Elasticsearch on a server and you want to develop on your local machine, you will probably need to keep Elasticsearch running when you close the terminal. The easiest way is to use nohup. First press Ctrl + C to stop the currently running Elasticsearch, and use the following command to run Elasticsearch instead.

nohup ./bin/elasticsearch&

Guess you like

Origin blog.csdn.net/m0_60961651/article/details/132272614