table of Contents
Possible failures when running on a linux virtual machine
ES software catalog description
ES important configuration parameters
The name of the cluster to which it belongs, the default is elasticsearch, which can be customized
Node name, the default is the first 7 characters of UUID, can be customized
Discovery Config node discovery configuration
Download the installation package
IKAnalyzer Chinese word segmentation integration
Introduction to ES
What is ES?
- Elasticsearch is an open source search engine based on the full-text search engine library Apache Lucene.
- Written in Java, it uses Lucene internally for indexing and searching, but its purpose is to make full-text retrieval simple, by hiding the complexity of Lucene, and instead providing a set of simple and consistent RESTful APIs.
- Elasticsearch is more than just a full-text search engine. It can be accurately described as follows:
- A distributed real-time document storage, each field can be indexed and searched
- A distributed real-time analysis search engine
- Competent for the expansion of hundreds of service nodes, and support PB-level structured or unstructured data
The development history of ES
- Elasticsearch later operated as a company (Elastic company), positioning it as a data search and analysis platform . In June 2014, it received 70 million US dollars in financing, and the accumulated financing exceeded 100 million US dollars. ES can now integrate with multiple clients such as Java, Ruby, Python, PHP, Perl, .NET, etc. It can also be integrated with big data analysis platforms such as Hadoop and Spark, with very powerful functions.
- Based on Elasticsearch, a series of open source software has been derived, collectively referred to as Elastic Stack (see next page).
- In order to avoid version confusion, starting from 5.0, Elastic has unified the version number of each component. When using, the version number of each component should be consistent (version number format: xyz, z can be different).
- The latest version of Elasticsearch is 6.2.4, which was released on April 17, 2018.
Elatic Stack
- Elasticsearch distributed search engine
- Logstash log collection and analysis tool
- Kibana visual analysis platform
- Beats data acquisition tool family (replace Logstash)
- X-Pack feature pack
ES popularity
Search engine ranking on DB-Engines (an organization that collects and statistics database management system information, website: https://db-engines.com/en/ranking/search+engine)
The ranking used as a database is also good:
Features of ES
See the introduction on the official website: https://www.elastic.co/cn/products/elasticsearch
Fast, easy to expand, flexible, flexible, simple to operate, multi-language client, X-Pack, hadoop/spark, strong cooperation, out of the box.
- Distributed: horizontal expansion is very flexible
- Full-text search: powerful full-text search capabilities based on Lucene;
- Near real-time search and analysis: data enters ES, can achieve near real-time search, and can also be aggregated and analyzed
- High availability: fault tolerance mechanism, automatic discovery of new or failed nodes, reorganization and rebalancing of data
- Free mode: ES' dynamic mapping mechanism can automatically detect the structure and type of data, create indexes and make data searchable.
- RESTful API:JSON + HTTP
ES application scenarios
- Site Search
- NoSQL database
- Log analysis
- data analysis
ES architecture
ES architecture introduction
- Gateway is a file system used by ES to store indexes and supports multiple types.
- The upper layer of Gateway is a distributed lucene framework.
- On top of Lucene are ES modules, including: index module, search module, mapping analysis module, etc.
- Above the ES module are Discovery, Scripting and third-party plug-ins. Discovery is the node discovery module of ES. To form a cluster, ES nodes on different machines need to perform message communication, and the master node needs to be elected within the cluster. These tasks are all done by the Discovery module. Support multiple discovery mechanisms, such as Zen, EC2, gce, Azure. Scripting is used to support the insertion of scripting languages such as javascript and python in the query statement. The scripting module is responsible for parsing these scripts, and the performance of using script statements is slightly lower. ES also supports a variety of third-party plug-ins.
- The upper layer is the ES transmission module and JMX. The transmission module supports multiple transmission protocols, such as Thrift, memecached, http, and http is used by default. JMX is a Java management framework used to manage ES applications.
- The top layer is the interface provided by ES to users, which can interact with the ES cluster through RESTful interfaces.
The core concept of ES
- Near Realtime (NRT) Near realtime. After the data is submitted to the index, it can be searched immediately.
- Cluster cluster , a cluster is identified by a unique name, the default is "elasticsearch". The cluster name is very important. Only nodes with the same cluster name form a cluster. The cluster name can be specified in the configuration file.
- Node node: stores the data of the cluster and participates in the index and search functions of the cluster. Like the cluster has a name, the node also has its own name. By default, the first seven characters of a random UUID will be used as the name of the node at startup. You can specify any name for it. Discover peers in the network through the cluster name to form a cluster. A node can also be a cluster.
- Index: An index is a collection of documents (equivalent to a collection in Solr). Each index has a unique name, and it is operated by this name. There can be any number of indexes in a cluster.
- Type: It means that different types of documents can be indexed in an index, such as user data and blog data. Obsolete since version 6.0.0, only one type of data is stored in an index.
- Document: A piece of data to be indexed, the basic information unit of the index, expressed in JSON format.
- Shard shards: When creating an index, you can specify how many shards to divide for storage. Each shard itself is also a fully functional and independent "index", which can be placed on any node in the cluster. Benefits of sharding:
- Allows us to specify the number of horizontal segmentation/extended capacity segments when creating an index, and it cannot be changed after creation. The number of backups can be changed at any time.
- Distributed and parallel operations can be performed on multiple shards to improve system performance and throughput.
- Replication backup: A fragment can have multiple backups (replicas). Benefits of backup:
- Highly available primary shard, replica shard
- Extend the concurrency and throughput of search. The search can be run in parallel on all copies.
When index is used as a verb, it refers to indexing data or indexing data.
Compare RDBMS
RDBMS |
Elasticsearch |
Database |
Index |
Table |
Type (6.0.0 obsolete) |
Row |
Document |
Column |
Field |
Table structure (schema) |
Mapping |
index |
Inverted index |
SQL |
Query DSL |
SELECT * FROM table |
GET http://.... |
UPDATE table SET |
PUT http://.... |
DELETE |
DELETE http://... |
ES learning resources
- The documents on the official website are the best learning resources, detailed and comprehensive. The official website also provides some videos: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
- The official website also provides an authoritative guide in Chinese, you can learn, (the version is a bit older): https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html
Installation & configuration
ES installation package
Official website download address: https://www.elastic.co/downloads/elasticsearch
JAVA requirements
Java version: 1.8
Installation example on linux
1. Obtain the installation package
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.4.tar.gz
2. Unzip to the installation directory
tar -xvf elasticsearch-6.2.4.tar.gz -C /opt
3. Configuration
4. Start
cd /opt/elasticsearch-6.2.4/bin
./elasticsearch
Possible failures when running on a linux virtual machine
1. The memory is not enough. The default es configuration uses 1G heap memory. If the virtual machine you are learning to use does not have such a large memory, please adjust it in config/jvm.options.
2. The following errors may be reported:
Solution:
问题一:max file descriptors [4096] for elasticsearch process likely too low, increase to at least [65536]
解决:修改切换到root用户修改配置limits.conf 添加下面两行
命令:vi /etc/security/limits.conf
* hard nofile 65536
* soft nofile 65536
问题二:max number of threads [1024] for user [lish] likely too low, increase to at least [2048]
解决:切换到root用户,进入limits.d目录下修改配置文件。
vi /etc/security/limits.d/90-nproc.conf
修改如下内容:
* soft nproc 1024
#修改为
* soft nproc 2048
问题三:max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]
解决:切换到root用户修改配置sysctl.conf
vi /etc/sysctl.conf
添加下面配置:
vm.max_map_count=655360
并执行命令:
sysctl -p
切换到es的用户。
然后,重新启动elasticsearch,即可启动成功。
ES port description
1. 9200 http port for external services
2. 9300 tcp port for communication between nodes
Run ES in the background
./elasticsearch -d
Close ES
Not running in the background: ctrl + c
Running in the background: kill es process
Start in windows
elasticsearch .bat
ES software catalog description
ES configuration instructions
Configuration file separation
yml format description
ES important configuration parameters
The data directory and log directory should be separated from the software in the generation environment
The name of the cluster to which it belongs, the default is elasticsearch, which can be customized
Node name, the default is the first 7 characters of UUID, can be customized
network.host IP binding
http.port: 9200-9300
transport.tcp.port: 9300-9400
Discovery Config node discovery configuration
The default node discovery method adopted in ES is zen (based on multicast (multicast), unicast). There are two important parameters that need to be configured before being used in production:
- discovery.zen.ping.unicast.hosts: ["host1","host2:port","host3[portX-portY]"] In unicast mode, set a list of nodes with master qualification, and newly added nodes will add to this list The nodes in send requests to join the cluster.
- discovery.zen.minimum_master_nodes: 1 This parameter controls the minimum number of nodes that a node needs to see with master qualifications before it can operate in the cluster. The official recommended value is (N/2)+1, where N is the number of master-qualified nodes.
- transport.tcp.compress: false Whether to compress the data transmitted by tcp, the default is false
- http.cors.enabled: true Whether to use http protocol to provide external services, the default is true
- http.max_content_length: 100mb The maximum capacity of http transmission content, the default is 100mb
- node.master: true Specifies whether the node can be used as a master node, the default is true. The ES cluster uses the first node as the master by default. If the node fails, the master will be re-elected.
- node.data: true Whether the node stores index data, the default is true.
- discover.zen.ping.timeout: 3s Set the ping connection timeout time when other nodes are automatically discovered in the cluster. The default is 3 seconds. In the case of a poor network environment, increasing this value will increase the time the node waits for a response, which will reduce misjudgments to a certain extent.
- discovery.zen.ping.multicast.enabled: false Whether to enable multicast to discover nodes.
Jvm heap size setting
In the production environment, you must increase its jvm memory in jvm.options.
JVM heap dump path setting
Specify the dump path of the heap when an OOM exception occurs in the production environment to analyze the problem. Configure in jvm.options:
-XX:HeapDumpPath=/var/lib/elasticsearch
There are also important operating system configurations, please refer to: https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config.html
Kibana
Kibana is a visual management tool for ES.
Download the installation package
https://www.elastic.co/downloads/kibana
installation
Unzip to the installation directory
Configuration
Configure the value of elasticsearch.url in config/kibana.yml as the access address of ES
start up
./bin/kibana
Integrated Ikanalyzer
IKAnalyzer Chinese word segmentation integration
Get the ES-IKAnalyzer plug-in
Address: https://github.com/medcl/elasticsearch-analysis-ik/releases
Install plugin
Unzip the ik package to the plugins/ directory of the ES installation directory (it is better to change the name of the extracted directory to prevent conflicts with the same name when installing other plugins), and then restart ES.
Expanded thesaurus
Configuration file config/IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>
Test IK
1. Create an index
curl -XPUT http://localhost:9200/index
2. Create a mapping
curl -XPOST http://localhost:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d'
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
}
}
}'
3. Index some documents
curl -XPOST http://localhost:9200/index/fulltext/1 -H'Content-Type:application/json' -d' {"content":"Is the United States leaving Iraq a mess?"}'
curl -XPOST http://localhost:9200/index/fulltext/2 -H'Content-Type:application/json' -d' {"content":"Ministry of Public Security: School buses everywhere will enjoy the highest right of road"}'
curl -XPOST http://localhost:9200/index/fulltext/3 -H'Content-Type:application/json' -d' {"content":"China-Korea fisherman police conflict investigation: Korean police detain 1 ship per day on average Chinese fishing boat"}'
4. Try searching
curl -XPOST http://localhost:9200/index/fulltext/_search -H 'Content-Type:application/json' -d'
{
"query" : { "match" : { "content" : "中国" }},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : { "content" : {} }
}
}'