elasticSearch series (1) introduction and construction from zero to one

 

1. What is Elasticsearch 

         Elasticsearch is a highly scalable open source full-text search and analysis engine. It can store, search and analyze large amounts of data quickly and in near real-time. It is usually used as a basic engine to support applications with complex search functions and requirements.

  • ES distributed index library, nosql
  • Provide external search service: http or transport (no longer supported after 7.0) protocol provides external search, Restful json
  • Internally is a database nosql

Two. Elasticsearch scene 

  1. E-commerce website. Allow customers to search for products. You can use ES to store the entire product catalog and inventory, and provide it with search and auto-complete functions. 

  2. Log or transaction data collection. And to analyze and mine the data to find trends, statistical information, summaries or anomalies, etc. 

  3. Price alert platform. The platform allows customers to specify such as "If the price of a product drops below X within the next month, notify me". 

  4. Business intelligence analysis. Want to quickly investigate, analyze, visualize and ask special questions about large amounts of data. 

Three.Elasticsearch features 

Elasticsearch encapsulates Lucene and works out of the box, making it easier to use. Elasticsearch supports clusters and also supports dynamic expansion of cluster nodes. It has done a lot of work on high availability. It is a search engine, not an index library. 

  • Elasticsearch is built on Lucene, Elasticsearch uses Lucene to do actual work 

  • Each shard in ELasticsearch is a separate instance of Lucene. 

  • Elasticsearch provides a distributed, JSON-based REST API based on Lucene (that is, using Lucene's functions) to make it easier to use Lucene's functions. 

  • Elasticsearch provides other support functions, such as: thread pool, queue, node/cluster monitoring API, data monitoring API, cluster management, etc. 

 

Four. Basic Concepts of Elasticsearch 

1. Cluster (Cluster)

        A cluster is a collection of one or more nodes (servers) that store all data together and provide indexing and search functions among all nodes. The cluster is identified by a unique name, which is "elasticsearch" by default. This name is important because if a node is set to join the cluster by name, the node can only be part of the cluster.

2. Node 

        A node is a single server that is part of a cluster and is used to store data and participate in the indexing and searching of the cluster. Like a cluster, a node is identified by a name, which defaults to a random unique identifier (UUID) assigned to the node at startup. If you do not want to use the default node name, you can customize the node name. The name is very important for managing the cluster. When naming, you can use the name to identify which node corresponds to the cluster as much as possible

3. Index 

      An index is a collection of documents with similar characteristics. For example, you can create an index with customer data, create an index with a product catalog, and create an index with order data

 4.  Type 

    In the index, you can define one or more types. The type is the logical category or partition of the index. Generally, define a type for documents with a set of common fields

5. Document 

   File is the basic information unit that can be indexed

 Elastic Search 

 Relational Database 

 Index (Indices) 

 Databases 

 Types of

(There is only one Types 6.x, there can be multiple before, 7.x is gone) 

 Table 

 Document 

 Row 

 Field 

 Column 

 Mapping 

 Schema 

 Structured query language (query DSL) 

 SQL 

 GET http:9200/index... 

 SELECT * FROM table... where ... 

The fatal disadvantage of es: there is no way to link queries for non-relational types, that is, cross-index queries

6. Shard 

       In actual projects, the index data may be very large, and these data may exceed the hardware conditions of a single node. For example, a single index of one billion documents occupies 1TB of disk space. At this time, the disk storage of a single node cannot be used, or the search request cannot be satisfied because the retrieval process is too slow. 

       To solve this problem, Elasticsearch provides the function of subdividing the index into multiple shards (shards). When creating an index, you only need to define the required number of shards. Each shard itself is a fully functional and independent "index", and they can be hosted on any node in the cluster. 

7. Replicas 

     In order to prevent shards/nodes from going offline or disappearing for some reason (providing a high-availability mechanism), it is strongly recommended to use a failover mechanism. For this reason, Elasticsearch allows one or more copies of index shards to be made into so-called replica shards (replicas for short).

8. Near real-time search 

    With the development of per-segment search, the latency of a new document from indexing to being searchable is significantly reduced. New documents can be retrieved within a few minutes, but this is not fast enough. 

    Disk becomes the bottleneck here. Commiting a new segment to disk requires an fsyn operation to ensure that the segment is physically written to disk so that data will not be lost in the event of a power failure. But the fsyn operation is very expensive; if it is executed every time a document is indexed, it will cause a big performance problem. 

     What we need is a lighter way to make a document searchable, which means fsyn has to be removed from the entire process. 

     Between Elasticsearch and the disk is the file system cache. As previously described, the documents in the memory index buffer will be written to a new segment. But here the new segment will be written to the file system cache first-the cost of this step will be lower, and it will be flushed to disk later-this step is more costly. But as long as the file is already in the cache, it can be opened and read like any other file. 

     Lucene allows new segments to be written and opened—making the documents it contains visible to search without a complete submission. This method is much less expensive than a submission, and can be executed frequently without affecting performance. 

Five. Elasticsearch installation and configuration 

Version selection 

The main version of Elasticsearch is 5/6/7, taking into account the performance of the machine and version 5 is still in use in many generation environments.

 ES is developed based on java, so you need to make sure that jdk is installed on the machine before installation. ES depends on java as follows:

  • ES5.x or higher requires Java8 or higher
  • Java 11 has been supported since ES6.5
  • Starting from ES7.0, ES has a built-in Java environment

Version: Elasticsearch-6.6.0.tar

JDK: The minimum version is JDK 1.8.0_133. Do not use the default opendJDK of the system. There are always some unexpected problems.

Download link: https://elasticsearch.cn/download/

Alternate download address: link:  https://pan.baidu.com/s/14Xk0gv128mRVjsk89xFy9Q  extraction code: fdx7

# 下载elasticsearch-6.6.0
$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.6.0.tar.gz
 
# 解压文件
$ tar xvzf elasticsearch-6.6.0.tar.gz
 
# 移动到指定目录
$ mv ./elasticsearch-6.6.0 /usr/path/to

Configuration (Detailed configuration file)

 Configuration item 

 Configuration item description 

 cluster.name 

 Elasticsearch cluster name. The default name is elasticsearch 

 node.name 

 Node name. By default, Elasticsearch uses the first seven characters of a randomly generated UUID as the node ID. The node ID is persistent and will not change when restarted, so the default node name will not change either. 

 node.attr.rack 

 Information about the rack where the node server is located 

 path.data 

 Set the storage path of the index data. The default is the data folder under the Elasticsearch root directory. You can set multiple storage paths, separated by commas 

 path.log 

 Set the storage path of the log file, the default is the logs folder under the Elasticsearch root directory 

 bootstrap.memory_lock 

 Setting it to true means that any JVM memory will not be swapped to disk, preventing Elasticsearch memory from being swapped out and affecting performance. 

 network.host 

 Specify the IP address of the Elasticsearch node 

 http.port 

 Set the http port number of Elasticsearch's external service. Default 9200 

 discovery.zen.ping.unicast.hosts 

 Set the initial list of master nodes in the cluster, and these nodes can be used to automatically discover new nodes that join the cluster 

 discovery.zen.minimum_master_nodes 

 This parameter is to prevent the occurrence of "split brain" and defines the minimum number of hosts that must be visible in order to form a cluster. The default is 1. 

 gateway.recover_after_nodes 

 As long as the expected number of nodes has joined the cluster, the recovery of local shards will be initiated. Default is 0 

 action.destructive_requires_name 

 Only the data with the specified name can be deleted. _all or wildcards are not allowed to delete the matching index library. This setting can be dynamically updated by the Rest API. 

System configuration file 

memory_lock configuration 

If memory_lock is set to true and an error occurs, you need to modify the following two Linux system files: 

Modify /etc/security/limits.conf  

soft nofile 65536
hard nofile 65536
soft nproc 32000
hard nproc 32000
hard memlock unlimited
soft memlock unlimited

Modify /etc/systemd/system.conf 

DefaultLimitNOFILE=65536
DefaultLimitNPROC=32000
DefaultLimitMEMLOCK=infinity

 max_map_count warning 

At startup, if a warning appears: 

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Temporary setting 

Temporary settings, when Linux restarts, it will be restored to the values ​​before the settings.

$ sysctl -w vm.max_map_count=262144
 
# 查看结果:
$ sysctl -a|grep vm.max_map_count
 
# 显示:
$ vm.max_map_count = 262144

Permanently set 

Add a line at the end of the /etc/sysctl.conf file 

vm.max_map_count=262144

 

Guess you like

Origin blog.csdn.net/qq_38130094/article/details/108101316