Elasticsearch (ES) is an open source, distributed, RESTful full-text search engine based on Lucene components. ES is also a distributed document database, where each field is indexed data and can be searched, it can be extended to hundreds of servers to store and process PB-level data. It can store, search and analyze large amounts of data in a short period of time. It is usually used as the core engine in situations with complex search scenarios.
ES is born to test usability and scalability. This can be done by purchasing a server with higher performance.
Elasticsearch features
- Horizontal scalability: You only need to add a server, do a little configuration, and start ES to be incorporated into the cluster.
- The sharding mechanism provides better distribution: the same index is divided into multiple shards, which is similar to the block mechanism of HDFS; the divide-and-conquer method of mapreduce can improve processing efficiency.
- High availability: Provides a replication (replica) mechanism, a shard can have multiple replicas, so that when a server is down, the cluster can still run as usual, and the data and information lost by the server downtime will be replicated and restored to others Available on the node.
Elasticsearch application scenarios
Large-scale distributed log analysis system ELK, elasticsearch (storing logs) + logstash (collecting logs) + kibana (displaying data)
e-commerce system search system, online disk search engine, etc.
Elasticsearch storage structure
Elasticsearch is a file storage, and Elasticsearch is a document-oriented database. A piece of data is a document here, and json is used as the document serialization format. For example, the following piece of user data:
{
"user":"zfl",
"sex":"0",
"age":"23"
}
Relational database: Database -> Table -> Row -> Column
Elasticsearch: Index (Index) -> Type (type) -> Documents -> Fields
Install ES in Linux environment
- Install JDK environment variables
export JAVA_HOME=/usr/local/jdk1.8.0_181
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
source /etc/profile
-
Download the Elasticsearch installation package
Official document: https://www.elastic.co/downloads/elasticsearch -
Upload the Elasticsearch installation package
-
Decompress the Elasticsearch
directory structure as follows:bin:脚本文件,包括ES启动&安装插件等 config:elasticsearch.yml(ES配置文件)、jvm.options(JVM配置文件)、日志配置文件等 JDK:内置的jdk lib:类库 logs:日志文件 modules:ES所有模块,包括X-pack等 plugins:ES已经安装的插件,默认没有插件 data:ES启动的时候,默认与logs同级,用来存储文档数据。该目录可在elasticsearch.yml中配置
-
修改 elasticsearch.yml
-
Start elasticsearch and
report an error: It is not allowed to run directly as root user
solve:
创建一个分组
groupadd lnhg
创建用户并添加进分组
useradd lnhu -g lnhg -p 123456
授予权限
chown -R lnhu:lnhg es安装目录
su lnhu 切换用户
Continue to report errors:
Edit /etc/sysctl.conf and add the following parameters:
vi /etc/sysctl.conf
vm.max_map_count=655360
sysctl -p
Will report an error, crash...
Edit /etc/security/limits.conf and add the following parameters:
vi /etc/security/limits.conf
* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096
Now that the stand-alone version is built, restart the service and reboot.
- Visit Elasticsearch to
close the firewall systemctl stop firewalld.service
http://192.168.15.130:9200
- Port difference (9200 and 9300)
9200 port: ES node and external communication use
9300 port: ES node communication use
kibana environment installation (windows installation)
-
Download the kibana-6.2.4-windows-x86_64 installation package and unzip it.
Modify config/kibana.yml to
change the default configuration to the following:
server.port: 5601 (default)
server.host: local ip (default localhost)
elasticsearch.url: es address
-
Start kibana and
double-click kibana.bat in the bin directory to start it. -
Visit
http://ip address:5601
Kibana implements additions, deletions, and changes
-
Create index
PUT /lnh
-
Query index
GET /lnh
-
Add document/index name/type/id
PUT /lnh/user/1
{
"user":"lnh",
"sex":"0",
"age":"23"
}
- Query document
GET /lnh/user/1
Query after deleting the index DELETE /lnh
Call RestFul to create a document
- Query the specified data
http://192.168.15.130:9200/lnh/user/1
- Query all documents of a certain type
http://192.168.15.130:9200/lnh/user/_search
Advanced Search
- Query according to id
GET /lnh/user/1 - Query all current types of documents
GET /lnh/user/_search
- Batch query based on multiple IDs,
query multiple ids as 1, 2GET /lnh/user/_mget { "ids":["1","2"] }
- Complex conditions query
query age 23 years old
GET /lnh/user/_search?q=age:23
query age between 20-30 years old
GET /lnh/user/_search?q=age[20 TO 30]
Note: TO Be sure to capitalize the
query age between 20-30 years old, from 0 data to 1 data
GET /lnh/user/_search?q=age[20 TO 30]&from=0&size=1
DSL language query and filtering
What is DSL language
There are two ways to query query in es, one is the simplified query, and the other is the complete request body using JSON, called structured query (DSL).
Example
- According to the name to accurately query the name
GET /lnh/user/_search
{
"query":{
"term":{
"user":"zfl"
}
}
}
Note: term means a complete match, no tokenizer analysis is performed, and the entire search term must be included in the document. For precise search
- Fuzzy query by name
GET /lnh/user/_search
{
"from":0,
"size":2,
"query":{
"match":{
"user":"partner"
}
}
}
Note: match will segment the searched keywords when matching, and then search by segmentation matching, generally used for fuzzy query.
The difference between term and match
- The term query will not segment the field, and will use exact matching.
- Match will perform word segmentation query according to the word segmenter of the field.
Use filter to filter age
/lnh/user/_search
{
"query":{
"bool":{
"must":[
{
"match_all":{
}
}
],
"filter":{
"range":{
"age":{
"gte":20,
"lte":30
}
}
}
}
},
"from":0,
"size":10,
"_source":["user","age"]
}
Description: _source indicates which fields to return
Tokenizer
What is a tokenizer
Because the default standard tokenizer in Elasticsearch is not very friendly to Chinese word segmentation, it will split Chinese words into Chinese characters. Therefore, the Chinese word segmentation device (ik plugin) is introduced.
- Default standard tokenizer
Install ik tokenizer
Download address: https://github.com/medcl/elasticsearch-analysis-ik/releases
Note: The ik tokenizer plug-in version should be consistent with the es version.
- File directory after decompression
- Create a new ik directory under the plugins directory under the es installation directory
- Copy the decompressed file to the plugins/ik directory
- Restart es
instructions:- ik_max_work: The most fine-grained split of the intercom text, such as splitting the "National Anthem of the People's Republic of China" into "People's Republic of China, Chinese People, China, Chinese, People's Republic, People, People, People, Republic, Republic, "He, the national anthem" will exhaust all possible combinations.
- ik_smart: Will do the coarsest resolution, such as splitting the "National Anthem of the People's Republic of China" into "The National Anthem of the People's Republic of China".
Custom extended dictionary
- Create a new custom directory under the plugins/ik/config directory under the es installation directory and create a new_word.dic file
- Add custom words to the new_word.dic file
- Modify the config/IKAnalyzer.cfg.xml file to fill in the location of the customized extension dictionary
Document mapping
Introduction
A comparison has been made between the core concepts of Elasticsearch and relational databases. Index (index) is equivalent to database, type (type) is equivalent to database table, and mapping (mapping) is equivalent to the table structure of data table. The mapping in Elasticsearch is used to define a document, which can define the fields it contains, the types of fields, tokenizers, and attributes.
Document mapping is to specify the field type and tokenizer for the fields in the document.
Use GET /lnh2/stu/_mapping
Classification of mapping
- Dynamic mapping
We know that in a relational database, you need to create a database, and then create a data table under the database instance, and then insert data into the database table. In Elasticsearch, there is no need to implement defined mapping (mapping). When a document is written into Elasticsearch, the type is automatically identified according to the document field. This mechanism is called dynamic mapping. - Static mapping
In Elasticsearch, a well-defined mapping can also be implemented, including the various fields and types of the document. This method is called static mapping.
Basic data type
-
String
, string type contains text and keyword.
text : This type is used to index long texts. Before creating the index, these texts will be segmented into word combinations and indexed. Es is allowed to retrieve these words, text cannot be used for sorting and aggregation.
keyword : This type does not require word segmentation, and can be used to search, filter, sort and aggregate. The keyword type can only be searched by itself (fuzzy search after text segmentation is not available) -
Numerical
long, integer, short, byte, double, float -
Date type
date -
Boolean
boolean -
Binary type
binary -
Array types
Array datatype
Example
- Create a document type and specify the type
PUT /lnh2
POST /lnh2/_mapping/stu
{
"properties":{
"age":{
"type":"integer"
},
"sex":{
"type":"integer"
},
"name":{
"type":"text",
"analyzer":"ik_smart",
"search_analyzer":"ik_smart"
},
"home":{
"type":"keyword"
}
}
}
- Get the mapping information of the specified type
- Get index information
http://192.168.15.130:9200/lnh2/_settings
Elasticsearch cluster management
- How to solve high concurrency
ES is a distributed full-text search framework that hides complex processing mechanisms, including fragmentation mechanism, cluster discovery, fragmentation load balancing, and request routing. - basic concept
- cluster
代表一个集群,集群中有很多的节点,其中有一个为主节点,这个主节点是可以通过选举产生的,
主从节点是对于集群内部来说的。es的一个概念就是去中心化,字面上理解就是无中心节点,
这是对于集群外部来说的,因为从外部来看es集群,在逻辑上是个整体,你与任何一个节点的通信和
整个es集群通信是等价的。
- shards
代表索引的分片,es可以把一个完整的索引分成多个分片,这样的好处是可以把一个大的索引拆分成多个,分布到
不同的节点上。构成分布式搜索。分片的数量只能在索引创建前指定,并且索引创建后不能更改。
- replicas
代表索引的副本,es可以设置多个索引的副本,副本的作用一是提高系统的容错性,当某个节点某个分片损坏或
丢失时可以从副本中恢复。二是提高es的查询效率,es会自动对搜索请求进行负载均衡。
- recovery
代表数据恢复或叫做重新分布,es在有节点加入或者退出时会根据机器的负载均衡对索引分片进行重新分配,挂掉的节点重新启动时也会进行数据恢复。
- Core principle analysis
1. 每个索引会被分成多个分片shards进行存储,默认创建索引是分配五个分片进行存储。
每个分片都会分布式部署在多个不同的节点上进行存储,该分片成为primary shards。
主分片定义好后,后面不能做修改。原因是在查询的时候,根据文档id%主分片数量获取分片位置。
路由算法:shard = hash(routing)%主分片数量。
2. 每一个主分片为了实现高可用,都会有自己对应的备份分片,主分片对应的备份分片不能存放在同一台服务器
上。主分片primary shards可以和其他replicas shards存放在一个Node节点上。
假设主分片3 备份1 主分片3个,每个主分片对应1个备份分片=3*2=6
主分片3 备份2 主分片3个,每个主分片对应的2个备份分片=3*3=9
Es cluster environment construction
-
Prepare three server clusters
服务器名称 IP地址 node-0 192.168.15.130 node-1 192.168.15.131 node-2 192.168.15.132
-
Server cluster configuration modification (each)
vi elasticsearch.yml cluster.name: lnh ###保证三台服务器节点集群名称相同 node.name: node-0 #### 每个节点名称不一样 其他两台为node-1 ,node-2 network.host: 192.168.15.130 #### 实际服务器ip地址 discovery.zen.ping.unicast.hosts: ["192.168.15.130", "192.168.15.131","192.168.15.132"]##多个服务集群ip discovery.zen.minimum_master_nodes: 1
-
Start
Start each es, close the firewall systemctl stop firewalld.service -
Verification
Visit: http://192.168.15.130:9200/_cat/nodes?pretty
* represents the master node
Note: If it is a cloned machine, please delete the data directory of each machine first . The data is in the installation directory by default. It will be present when es starts and can be configured.
This resource was gathered in the Ant Classroom, and after a review, it was said that a good memory is not as good as a bad pen.