1. What is Elasticsearch
Elasticsearch is a highly scalable open source full-text search and analysis engine. It can store, search and analyze large amounts of data quickly and in near real-time. It is usually used as a basic engine to support applications with complex search functions and requirements.
- ES distributed index library, nosql
- Provide external search service: http or transport (no longer supported after 7.0) protocol provides external search, Restful json
- Internally is a database nosql
Two. Elasticsearch scene
-
E-commerce website. Allow customers to search for products. You can use ES to store the entire product catalog and inventory, and provide it with search and auto-complete functions.
-
Log or transaction data collection. And to analyze and mine the data to find trends, statistical information, summaries or anomalies, etc.
-
Price alert platform. The platform allows customers to specify such as "If the price of a product drops below X within the next month, notify me".
-
Business intelligence analysis. Want to quickly investigate, analyze, visualize and ask special questions about large amounts of data.
Three.Elasticsearch features
Elasticsearch encapsulates Lucene and works out of the box, making it easier to use. Elasticsearch supports clusters and also supports dynamic expansion of cluster nodes. It has done a lot of work on high availability. It is a search engine, not an index library.
-
Elasticsearch is built on Lucene, Elasticsearch uses Lucene to do actual work
-
Each shard in ELasticsearch is a separate instance of Lucene.
-
Elasticsearch provides a distributed, JSON-based REST API based on Lucene (that is, using Lucene's functions) to make it easier to use Lucene's functions.
-
Elasticsearch provides other support functions, such as: thread pool, queue, node/cluster monitoring API, data monitoring API, cluster management, etc.
Four. Basic Concepts of Elasticsearch
1. Cluster (Cluster)
A cluster is a collection of one or more nodes (servers) that store all data together and provide indexing and search functions among all nodes. The cluster is identified by a unique name, which is "elasticsearch" by default. This name is important because if a node is set to join the cluster by name, the node can only be part of the cluster.
2. Node
A node is a single server that is part of a cluster and is used to store data and participate in the indexing and searching of the cluster. Like a cluster, a node is identified by a name, which defaults to a random unique identifier (UUID) assigned to the node at startup. If you do not want to use the default node name, you can customize the node name. The name is very important for managing the cluster. When naming, you can use the name to identify which node corresponds to the cluster as much as possible
3. Index
An index is a collection of documents with similar characteristics. For example, you can create an index with customer data, create an index with a product catalog, and create an index with order data
4. Type
In the index, you can define one or more types. The type is the logical category or partition of the index. Generally, define a type for documents with a set of common fields
5. Document
File is the basic information unit that can be indexed
Elastic Search |
Relational Database |
---|---|
Index (Indices) |
Databases |
Types of (There is only one Types 6.x, there can be multiple before, 7.x is gone) |
Table |
Document |
Row |
Field |
Column |
Mapping |
Schema |
Structured query language (query DSL) |
SQL |
GET http:9200/index... |
SELECT * FROM table... where ... |
The fatal disadvantage of es: there is no way to link queries for non-relational types, that is, cross-index queries
6. Shard
In actual projects, the index data may be very large, and these data may exceed the hardware conditions of a single node. For example, a single index of one billion documents occupies 1TB of disk space. At this time, the disk storage of a single node cannot be used, or the search request cannot be satisfied because the retrieval process is too slow.
To solve this problem, Elasticsearch provides the function of subdividing the index into multiple shards (shards). When creating an index, you only need to define the required number of shards. Each shard itself is a fully functional and independent "index", and they can be hosted on any node in the cluster.
7. Replicas
In order to prevent shards/nodes from going offline or disappearing for some reason (providing a high-availability mechanism), it is strongly recommended to use a failover mechanism. For this reason, Elasticsearch allows one or more copies of index shards to be made into so-called replica shards (replicas for short).
8. Near real-time search
With the development of per-segment search, the latency of a new document from indexing to being searchable is significantly reduced. New documents can be retrieved within a few minutes, but this is not fast enough.
Disk becomes the bottleneck here. Commiting a new segment to disk requires an fsyn operation to ensure that the segment is physically written to disk so that data will not be lost in the event of a power failure. But the fsyn operation is very expensive; if it is executed every time a document is indexed, it will cause a big performance problem.
What we need is a lighter way to make a document searchable, which means fsyn has to be removed from the entire process.
Between Elasticsearch and the disk is the file system cache. As previously described, the documents in the memory index buffer will be written to a new segment. But here the new segment will be written to the file system cache first-the cost of this step will be lower, and it will be flushed to disk later-this step is more costly. But as long as the file is already in the cache, it can be opened and read like any other file.
Lucene allows new segments to be written and opened—making the documents it contains visible to search without a complete submission. This method is much less expensive than a submission, and can be executed frequently without affecting performance.
Five. Elasticsearch installation and configuration
Version selection
The main version of Elasticsearch is 5/6/7, taking into account the performance of the machine and version 5 is still in use in many generation environments.
ES is developed based on java, so you need to make sure that jdk is installed on the machine before installation. ES depends on java as follows:
- ES5.x or higher requires Java8 or higher
- Java 11 has been supported since ES6.5
- Starting from ES7.0, ES has a built-in Java environment
Version: Elasticsearch-6.6.0.tar
JDK: The minimum version is JDK 1.8.0_133. Do not use the default opendJDK of the system. There are always some unexpected problems.
Download link: https://elasticsearch.cn/download/
Alternate download address: link: https://pan.baidu.com/s/14Xk0gv128mRVjsk89xFy9Q extraction code: fdx7
# 下载elasticsearch-6.6.0
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.6.0.tar.gz
# 解压文件
$ tar xvzf elasticsearch-6.6.0.tar.gz
# 移动到指定目录
$ mv ./elasticsearch-6.6.0 /usr/path/to
Configuration (Detailed configuration file)
Configuration item |
Configuration item description |
---|---|
cluster.name |
Elasticsearch cluster name. The default name is elasticsearch |
node.name |
Node name. By default, Elasticsearch uses the first seven characters of a randomly generated UUID as the node ID. The node ID is persistent and will not change when restarted, so the default node name will not change either. |
node.attr.rack |
Information about the rack where the node server is located |
path.data |
Set the storage path of the index data. The default is the data folder under the Elasticsearch root directory. You can set multiple storage paths, separated by commas |
path.log |
Set the storage path of the log file, the default is the logs folder under the Elasticsearch root directory |
bootstrap.memory_lock |
Setting it to true means that any JVM memory will not be swapped to disk, preventing Elasticsearch memory from being swapped out and affecting performance. |
network.host |
Specify the IP address of the Elasticsearch node |
http.port |
Set the http port number of Elasticsearch's external service. Default 9200 |
discovery.zen.ping.unicast.hosts |
Set the initial list of master nodes in the cluster, and these nodes can be used to automatically discover new nodes that join the cluster |
discovery.zen.minimum_master_nodes |
This parameter is to prevent the occurrence of "split brain" and defines the minimum number of hosts that must be visible in order to form a cluster. The default is 1. |
gateway.recover_after_nodes |
As long as the expected number of nodes has joined the cluster, the recovery of local shards will be initiated. Default is 0 |
action.destructive_requires_name |
Only the data with the specified name can be deleted. _all or wildcards are not allowed to delete the matching index library. This setting can be dynamically updated by the Rest API. |
System configuration file
memory_lock configuration
If memory_lock is set to true and an error occurs, you need to modify the following two Linux system files:
Modify /etc/security/limits.conf
soft nofile 65536
hard nofile 65536
soft nproc 32000
hard nproc 32000
hard memlock unlimited
soft memlock unlimited
Modify /etc/systemd/system.conf
DefaultLimitNOFILE=65536
DefaultLimitNPROC=32000
DefaultLimitMEMLOCK=infinity
max_map_count warning
At startup, if a warning appears:
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
Temporary setting
Temporary settings, when Linux restarts, it will be restored to the values before the settings.
$ sysctl -w vm.max_map_count=262144
# 查看结果:
$ sysctl -a|grep vm.max_map_count
# 显示:
$ vm.max_map_count = 262144
Permanently set
Add a line at the end of the /etc/sysctl.conf file
vm.max_map_count=262144