Elasticsearch study notes (1) ES overall introduction and installation examples

table of Contents

Introduction to ES

What is ES?

The development history of ES

Elatic Stack

ES popularity

Features of ES

ES application scenarios

ES architecture

ES architecture introduction

The core concept of ES

Compare RDBMS

ES learning resources

Installation & configuration

ES installation package

JAVA requirements

Installation example on linux

Possible failures when running on a linux virtual machine

ES port description

Run ES in the background

Close ES

Start in windows

ES software catalog description

ES configuration instructions

Configuration file separation

yml format description

ES important configuration parameters

The data directory and log directory should be separated from the software in the generation environment

The name of the cluster to which it belongs, the default is elasticsearch, which can be customized

Node name, the default is the first 7 characters of UUID, can be customized

network.host IP binding

http.port: 9200-9300  

transport.tcp.port: 9300-9400  

Discovery Config node discovery configuration

Jvm heap size setting

JVM heap dump path setting

Kibana

Download the installation package

installation

Configuration

start up

Integrated Ikanalyzer

IKAnalyzer Chinese word segmentation integration

Get the ES-IKAnalyzer plug-in

Install plugin

Expanded thesaurus

Test IK


Introduction to ES

What is ES?

  • Elasticsearch is an open source search engine based on the full-text search engine library Apache Lucene.
  • Written in Java, it uses Lucene internally for indexing and searching, but its purpose is to make full-text retrieval simple, by hiding the complexity of Lucene, and instead providing a set of simple and consistent RESTful APIs.
  • Elasticsearch is more than just a full-text search engine. It can be accurately described as follows:
    • A distributed real-time document storage, each field can be indexed and searched
    • A distributed real-time analysis search engine
    • Competent for the expansion of hundreds of service nodes, and support PB-level structured or unstructured data

The development history of ES

  • Elasticsearch later operated as a company (Elastic company), positioning it as a data search and analysis platform . In June 2014, it received 70 million US dollars in financing, and the accumulated financing exceeded 100 million US dollars. ES can now integrate with multiple clients such as Java, Ruby, Python, PHP, Perl, .NET, etc. It can also be integrated with big data analysis platforms such as Hadoop and Spark, with very powerful functions.
  • Based on Elasticsearch, a series of open source software has been derived, collectively referred to as Elastic Stack (see next page).
  • In order to avoid version confusion, starting from 5.0, Elastic has unified the version number of each component. When using, the version number of each component should be consistent (version number format: xyz, z can be different).
  • The latest version of Elasticsearch is 6.2.4, which was released on April 17, 2018.

Elatic Stack

  • Elasticsearch distributed search engine
  • Logstash log collection and analysis tool
  • Kibana visual analysis platform
  • Beats data acquisition tool family (replace Logstash)
  • X-Pack feature pack 

ES popularity

Search engine ranking on DB-Engines (an organization that collects and statistics database management system information, website: https://db-engines.com/en/ranking/search+engine)

The ranking used as a database is also good:

Features of ES

See the introduction on the official website: https://www.elastic.co/cn/products/elasticsearch

Fast, easy to expand, flexible, flexible, simple to operate, multi-language client, X-Pack, hadoop/spark, strong cooperation, out of the box.

  • Distributed: horizontal expansion is very flexible
  • Full-text search: powerful full-text search capabilities based on Lucene;
  • Near real-time search and analysis: data enters ES, can achieve near real-time search, and can also be aggregated and analyzed
  • High availability: fault tolerance mechanism, automatic discovery of new or failed nodes, reorganization and rebalancing of data
  • Free mode: ES' dynamic mapping mechanism can automatically detect the structure and type of data, create indexes and make data searchable.
  • RESTful API:JSON + HTTP

ES application scenarios

  • Site Search
  • NoSQL database
  • Log analysis
  • data analysis

ES architecture

ES architecture introduction

  • Gateway is a file system used by ES to store indexes and supports multiple types.
  • The upper layer of Gateway is a distributed lucene framework.
  • On top of Lucene are ES modules, including: index module, search module, mapping analysis module, etc.
  • Above the ES module are Discovery, Scripting and third-party plug-ins. Discovery is the node discovery module of ES. To form a cluster, ES nodes on different machines need to perform message communication, and the master node needs to be elected within the cluster. These tasks are all done by the Discovery module. Support multiple discovery mechanisms, such as Zen, EC2, gce, Azure. Scripting is used to support the insertion of scripting languages ​​such as javascript and python in the query statement. The scripting module is responsible for parsing these scripts, and the performance of using script statements is slightly lower. ES also supports a variety of third-party plug-ins.
  • The upper layer is the ES transmission module and JMX. The transmission module supports multiple transmission protocols, such as Thrift, memecached, http, and http is used by default. JMX is a Java management framework used to manage ES applications.
  • The top layer is the interface provided by ES to users, which can interact with the ES cluster through RESTful interfaces.

The core concept of ES

  • Near Realtime (NRT) Near realtime. After the data is submitted to the index, it can be searched immediately.
  • Cluster cluster , a cluster is identified by a unique name, the default is "elasticsearch". The cluster name is very important. Only nodes with the same cluster name form a cluster. The cluster name can be specified in the configuration file.
  • Node node: stores the data of the cluster and participates in the index and search functions of the cluster. Like the cluster has a name, the node also has its own name. By default, the first seven characters of a random UUID will be used as the name of the node at startup. You can specify any name for it. Discover peers in the network through the cluster name to form a cluster. A node can also be a cluster.
  • Index: An index is a collection of documents (equivalent to a collection in Solr). Each index has a unique name, and it is operated by this name. There can be any number of indexes in a cluster.
  • Type: It means that different types of documents can be indexed in an index, such as user data and blog data. Obsolete since version 6.0.0, only one type of data is stored in an index.
  • Document: A piece of data to be indexed, the basic information unit of the index, expressed in JSON format.
  • Shard shards: When creating an index, you can specify how many shards to divide for storage. Each shard itself is also a fully functional and independent "index", which can be placed on any node in the cluster. Benefits of sharding:
    • Allows us to specify the number of horizontal segmentation/extended capacity      segments when creating an index, and it cannot be changed after creation. The number of backups can be changed at any time.
    • Distributed and parallel operations can be performed on multiple shards to improve system performance and throughput.
  • Replication backup: A fragment can have multiple backups (replicas). Benefits of backup:
    • Highly available      primary shard, replica shard
    • Extend the concurrency and throughput of search. The search can be run in parallel on all copies.

When index is used as a verb, it refers to indexing data or indexing data.

Compare RDBMS

 

 

RDBMS

Elasticsearch

Database

Index

Table

Type (6.0.0 obsolete)

Row

Document

Column

Field

Table structure (schema)

Mapping

index

Inverted index

SQL

Query DSL

SELECT * FROM table

GET http://....

UPDATE table SET

PUT http://....

DELETE

DELETE http://...

ES learning resources

  • The documents on the official website are the best learning resources, detailed and comprehensive. The official website also provides some videos: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
  • The official website also provides an authoritative guide in Chinese, you can learn, (the version is a bit older): https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html

Installation & configuration

ES installation package

Official website download address: https://www.elastic.co/downloads/elasticsearch

JAVA requirements

Java version: 1.8

Installation example on linux

1. Obtain the installation package

curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.4.tar.gz

2. Unzip to the installation directory

tar -xvf elasticsearch-6.2.4.tar.gz -C /opt

3. Configuration

4. Start

cd  /opt/elasticsearch-6.2.4/bin

./elasticsearch

Possible failures when running on a linux virtual machine

1. The memory is not enough. The default es configuration uses 1G heap memory. If the virtual machine you are learning to use does not have such a large memory, please adjust it in config/jvm.options.

2. The following errors may be reported:

Solution:

问题一:max file descriptors [4096] for elasticsearch process likely too low, increase to at least [65536]
解决:修改切换到root用户修改配置limits.conf 添加下面两行
命令:vi /etc/security/limits.conf
*        hard    nofile           65536
*        soft    nofile           65536

问题二:max number of threads [1024] for user [lish] likely too low, increase to at least [2048]
解决:切换到root用户,进入limits.d目录下修改配置文件。
vi /etc/security/limits.d/90-nproc.conf 
修改如下内容:
* soft nproc 1024
#修改为
* soft nproc 2048

问题三:max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]
解决:切换到root用户修改配置sysctl.conf
vi /etc/sysctl.conf 
添加下面配置:
vm.max_map_count=655360

并执行命令:
sysctl -p

切换到es的用户。
然后,重新启动elasticsearch,即可启动成功。

ES port description

1. 9200 http port for external services

2. 9300 tcp port for communication between nodes

Run ES in the background

./elasticsearch -d

Close ES

Not running in the background: ctrl + c

Running in the background: kill es process

Start in windows

elasticsearch .bat

ES software catalog description

ES configuration instructions

Configuration file separation

yml format description

ES important configuration parameters

The data directory and log directory should be separated from the software in the generation environment

The name of the cluster to which it belongs, the default is elasticsearch, which can be customized

Node name, the default is the first 7 characters of UUID, can be customized

network.host IP binding

http.port: 9200-9300  

transport.tcp.port: 9300-9400  

Discovery Config node discovery configuration

The default node discovery method adopted in ES is zen (based on multicast (multicast), unicast). There are two important parameters that need to be configured before being used in production:

  • discovery.zen.ping.unicast.hosts: ["host1","host2:port","host3[portX-portY]"] In unicast mode, set a list of nodes with master qualification, and newly added nodes will add to this list The nodes in send requests to join the cluster.
  • discovery.zen.minimum_master_nodes: 1 This parameter controls the minimum number of nodes that a node needs to see with master qualifications before it can operate in the cluster. The official recommended value is (N/2)+1, where N is the number of master-qualified nodes.
  • transport.tcp.compress: false Whether to compress the data transmitted by tcp, the default is false
  • http.cors.enabled: true Whether to use http protocol to provide external services, the default is true
  • http.max_content_length: 100mb The maximum capacity of http transmission content, the default is 100mb
  • node.master: true Specifies whether the node can be used as a master node, the default is true. The ES cluster uses the first node as the master by default. If the node fails, the master will be re-elected.
  • node.data: true Whether the node stores index data, the default is true.
  • discover.zen.ping.timeout: 3s Set the ping connection timeout time when other nodes are automatically discovered in the cluster. The default is 3 seconds. In the case of a poor network environment, increasing this value will increase the time the node waits for a response, which will reduce misjudgments to a certain extent.
  • discovery.zen.ping.multicast.enabled: false Whether to enable multicast to discover nodes.

Jvm heap size setting

In the production environment, you must increase its jvm memory in jvm.options.

JVM heap dump path setting

Specify the dump path of the heap when an OOM exception occurs in the production environment to analyze the problem. Configure in jvm.options:

-XX:HeapDumpPath=/var/lib/elasticsearch

There are also important operating system configurations, please refer to: https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config.html

Kibana

Kibana is a visual management tool for ES.

Download the installation package

https://www.elastic.co/downloads/kibana

installation

Unzip to the installation directory

Configuration

Configure the value of elasticsearch.url in config/kibana.yml as the access address of ES

start up

./bin/kibana

Integrated Ikanalyzer

IKAnalyzer Chinese word segmentation integration

Get the ES-IKAnalyzer plug-in

Address: https://github.com/medcl/elasticsearch-analysis-ik/releases

Install plugin

Unzip the ik package to the plugins/ directory of the ES installation directory (it is better to change the name of the extracted directory to prevent conflicts with the same name when installing other plugins), and then restart ES.

Expanded thesaurus

Configuration file config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
 	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">location</entry>
 	<!--用户可以在这里配置远程扩展停止词字典-->
	<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>

Test IK

1. Create an index

curl -XPUT http://localhost:9200/index

2. Create a mapping

curl -XPOST http://localhost:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d'

{  

      "properties": {      

           "content": {          

                "type": "text",        

                 "analyzer": "ik_max_word",

                "search_analyzer": "ik_max_word"  

          }  

      }

}'

3. Index some documents

curl -XPOST http://localhost:9200/index/fulltext/1 -H'Content-Type:application/json' -d' {"content":"Is the United States leaving Iraq a mess?"}'

curl -XPOST http://localhost:9200/index/fulltext/2 -H'Content-Type:application/json' -d' {"content":"Ministry of Public Security: School buses everywhere will enjoy the highest right of road"}'

curl -XPOST http://localhost:9200/index/fulltext/3 -H'Content-Type:application/json' -d' {"content":"China-Korea fisherman police conflict investigation: Korean police detain 1 ship per day on average Chinese fishing boat"}'

4. Try searching

curl -XPOST http://localhost:9200/index/fulltext/_search  -H 'Content-Type:application/json' -d'

{  

  "query" : { "match" : { "content" : "中国" }},  

  "highlight" : {      

       "pre_tags" : ["<tag1>", "<tag2>"],      

       "post_tags" : ["</tag1>", "</tag2>"],    

       "fields" : {             "content" : {}         }

    }

}'

Guess you like

Origin blog.csdn.net/qq_34050399/article/details/112639226