Big data technology ELK real-time retrieval

An introduction to elasticsearch

ElasticSearch is a high-performance, Lucene-based full-text search service, a distributed Restful-style search and data analysis engine, and can also be used as a NoSQL database.

Extended Lucene,
the prototype environment and the production environment can be switched seamlessly,
can be expanded horizontally,
and support structured and unstructured data

​ ElasticSearch extends Lucene, provides a richer query language than Lucene, realizes configurability, scalability, optimizes query performance, and provides a comprehensive function management interface.
​ The prototype environment and the production environment can be switched seamlessly; whether ElasticSearch is running on one node or a cluster containing 300 nodes, it can communicate with ElasticSearch in the same way. It scales horizontally to handle massive numbers of events per second, while automatically managing how indexes and queries are distributed across the cluster for extremely smooth operation. Numerical, textual, geolocation, i.e. structured and unstructured data are supported.
​ Lucene is an open source full-text search engine toolkit of the Apache Software Foundation. It is a full-text search engine architecture that provides a complete query engine and index engine, as well as some text analysis engines. The purpose of Lucene is to provide software developers with an easy-to-use toolkit to facilitate the full-text search function in the target system, or to build a complete full-text search engine (search engine and search program library) based on this not exactly equivalent).
​ lucene, the most advanced and powerful search library, is developed directly based on lucene. It is very complicated, and the api is complex (implementing some simple functions, writing a lot of java code), and requires a deep understanding of the principles (various index structures). Elasticsearch, based on lucene, hides complexity and provides easy-to-use restful api interface and java api interface

Two usage scenarios of elasticSearch

It is used in scenarios such as log search and analysis, spatiotemporal retrieval, time series retrieval, and intelligent search.
The type of data retrieved is complex: if the data to be queried includes structured data, semi-structured data, unstructured data, etc., ElasticSearch can perform a series of operations such as cleaning, word segmentation, and establishment of inverted indexes on the above data types, and then provide the full text ability to retrieve.
Diversified search conditions: full-text search conditions can include words or phrases.
Read while writing: The written data can be retrieved in real time.

​ Structured data (relational database, etc.), semi-structured data (web pages, XML, etc.), unstructured data (logs, pictures, images, etc.) A series of
operations refers to indexing
​ Diversified search conditions such as too many fields involved ; Full-text searches (queries) can include words and phrases, or multiple forms of words or phrases.
Retrieve or query.

Spatiotemporal retrieval, for spatiotemporal data:

Spatio-temporal data is data that has both time and space dimensions, and more than 80% of data in the real world is related to geographic location.
Spatiotemporal big data includes three-dimensional information of time, space, and thematic attributes, and has the comprehensive characteristics of multi-source, massive, and fast update.

Time series retrieval, for time series data:

Time series data refers to time series data. Time series data are columns of data recorded in chronological order for the same unified metric. Each data in the same data column must be of the same caliber, requiring comparability. Time series data can be period numbers or time points. The purpose of time series analysis is to find out the statistical characteristics and development regularity of the time series in the sample, construct a time series model, and perform out-of-sample forecasting.
Sensor data, temperature, etc.

Three ElasticSearch features

image-20201027211831426.png

high performance/speed

Get search results instantly. We implemented an inverted index for full-text retrieval through a finite state transformer, a BKD tree for storing numeric and geolocation data, and a column store for analytics. Since every piece of data is indexed, you don't have to worry about some data not being indexed.

ElasticSearch inverted index

  • Forward index: It is to find Value through Key, that is, start from the key point, and then use the key point to find the specific information in the information that meets the search conditions. The traditional search method (positive sorting index) is to start from the key points and search through the positive sorting index, that is, to find keywords through the document number.
  • Inverted index: The sorting method used by ElasticSearch is to find the Key by Value. In the full-text search, Value is the keyword to be searched, and the corresponding document can be found through Value. The details are shown in the figure below: Searching through the inverted index is to query the corresponding document number through the keyword, and then find the document through the document number, similar to looking up a dictionary, or looking up the contents of a book with a specified page number through the book catalog.

ElasticSearch inverted index- rendering

  • Searching through the inverted index is to query the corresponding document number through the keyword, and then find the document through the document number, similar to looking up a dictionary, or looking up the contents of a book with a specified page number through the book catalog.

image-20201027214502934.png

scalability

Can run on a laptop. It can also run on hundreds or thousands of servers hosting petabytes of data.
Prototype and production environments switch seamlessly; whether you're running on one node or a 300-node cluster, you'll be able to communicate with Elasticsearch in the same way.
It can scale horizontally to handle massive events per second, while automatically managing how indexes and queries are distributed across the cluster for extremely smooth operation.

relativity

Search everything. Find the specific information you need. Sort search results based on elements ranging from word frequency or recency to popularity. Mix and match these with features to fine-tune the way results are presented to users. Elasticsearch is fully functional and can handle human errors including complexities such as spelling mistakes.

Reliability/Resilience

hardware malfunction. Network segmentation. Elasticsearch detects these failures for you and ensures the safety and availability of your cluster (and data). With the cross-cluster replication function, the auxiliary cluster can be put into use at any time as a hot backup. Elasticsearch runs in a distributed environment and has been designed from the ground up with this in mind, so you can always rest easy.

Four ElasticSearch ecosystem

image-20201027211932309.png

ELK/ELKB provides a complete set of solutions and they are all open source software. They are used in conjunction with each other, perfectly connected, and efficiently meet the applications of many occasions.

Plug-in extension layer:

  • L o g s t a s h Logstash Log s t a s h : A pipeline with real-time data transmission capabilities, focusing on log-related processing; responsible for transmitting data information from the input end of the pipeline to the output end of the pipeline; supports flexible addition of filters in the middle according to your needs , Logstash provides many powerful filters to meet your various application scenarios
  • K ibana KibanaKi ibana : an open source analysis and visualization platform, the data is mainly provided by es; based on the search and analysis capabilities of es, get the results needed for upper-level analysis and visualization; developers or operation and maintenance personnel can easily perform advanced data analysis, And visualize data in a variety of charts, tables, and maps.
  • B e a t s Beats Be a t s : A platform dedicated to sending data, which can seamlessly transmit data to logstash or es; lightweight proxy mechanism installation, similar to ambari or cdh manager when installing hadoop clusters; hundreds of The data in thousands of computers is sent to logstash or es.
    es − hadoop es-hadoopesha d oo p : a project that deeply integrates Hadoop and es, and is a sub-project officially maintained by es; it can achieve the input and output between hadoop and es; Provides real-time search capabilities.
  • e s − s q l es-sql ess ql : use sql to operate es, to replace the previous problems that need to be solved by writing various complex json queries; es-sql currently has two versions, the first is the domestic main open source nlpchina/ that started many years ago The es-sql plugin, second, is the es-sql function officially supported since the release of the official es6.3.0 in 2018.06.
  • e l a s t i c s e a r c h − h e a d elasticsearch-head elasticsearchh e a d : It will be a client tool specifically for elasticsearch. It is an interface-based cluster operation and management tool that can perform foolish operations on the cluster. It is a front-end project based on node.js. Bigdesk
    BigdeskB i g d es k : It is a cluster monitoring tool of elasticsearch, through which you can view various states of the es cluster, such as: cpu, memory usage, index data, search status, number of http connections, etc.

5. Comparison with other data stores

redis mysql elasticsearch hbase hadoop/hive
Capacity/Capacity Expansion Low middle larger mass mass
Query Timeliness very high medium higher medium Low
Query flexibility Poor kv mode Very good, support sql Better, the associated query is weak, but full-text search is possible, and the DSL language can handle various operations such as filtering, matching, sorting, and aggregation Poor, mainly relying on rowkey, scan performance is not good, or the establishment of a secondary index Very good, support sql
write speed extremely fast medium faster faster slow
Consistency, Transaction weak powerful weak weak weak

Six installation of elasticSearch

1. Unzip the es compressed package to the opt directory

[root@houda software]# tar -zxvf /software/elasticsearch-6.6.0.tar.gz -C /opt/

2. Add es related configuration

[root@houda opt]# vim /opt/elasticsearch-6.6.0/config/elasticsearch.yml
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
# 设置es的集群名称
cluster.name: my-es
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
# 设置子节点名称
node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
# bootstrap 自检程序关闭
bootstrap.memory_lock: false
bootstrap.system_call_filter: false
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
# 改为当前的ip或hostname
network.host: houda
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
# 自发现配置:新节点向集群报到的主机名
discovery.zen.ping.unicast.hosts: ["houda", "houda02", "houda03"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes:
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
# #action.destructive_requires_name: true

3. Increase the cluster limit information

[root@houda ~]# vim /etc/security/limits.conf

Add content: (Note: * can not be omitted)

  • * hard nofile 65536
    * soft nofile 131072
    * hard nproc 4096
    * soft nproc 2048
    

image-20201027223725445.png

4. Reduce the memory size used by es to avoid memory overflow of our machine

[root@houda ~]# vim /opt/elasticsearch-6.6.0/config/jvm.options

image-20201027223843112.png

Increase es user memory permissions

[root@hd01 elasticsearch-6.6.0]# sysctl -w vm.max_map_count=262144

[root@hd01 elasticsearch-6.6.0]# vim /etc/sysctl.conf

vm.max_map_count=262144

image-20201103145056330.png

5. Modify the user and group to which es belongs, the root user cannot start the es service

[root@houda opt]# chown -R laohou:laohou /opt/elasticsearch-6.6.0/

6. Switch users and start the es service

[root@houda opt]# su laohou
# 前台启动
[laohou@houda opt]$ /opt/elasticsearch-6.6.0/bin/elasticsearch
# 后台启动
[laohou@houda opt]$ /opt/elasticsearch-6.6.0/bin/elasticsearch -d

7. Test whether es is normal

[root@houda ~]# curl [http://houda:9200/_cat/nodes?v](http://houda:9200/_cat/nodes?v)
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.8.120           36          44  20    1.22    0.41     0.35 mdi       *      node-1

Seven kibana installation

# 解压kibana到opt目录
[root@houda opt]# tar -zxvf /software/kibana-6.6.0-linux-x86_64.tar.gz -C /opt/
# 进入opt目录
[root@houda opt]# cd /opt/
# 修改kibana目录名称
[root@houda opt]# mv kibana-6.6.0-linux-x86_64/ kibana-6.6.0
# 进入config
[root@houda kibana-6.6.0]# cd /opt/kibana-6.6.0/config/
# 修改参数
[root@houda config]# vim kibana.yml

image-20201028203123698.png

# 修改kibana所属用户和组
[root@houda opt]# chown -R laohou:laohou /opt/kibana-6.6.0/
# 启动kibana(请先启动es) 前台启动
[laohou@houda opt]$ /opt/kibana-6.6.0/bin/kibana
# 后台启动
[laohou@houda opt]$ nohup /opt/kibana-6.6.0/bin/kibana >/dev/null 2>&1 &

image-20201028204109990.png

  • Discover: data search view

  • Visualize: icon production

  • Dashboard: dashboard production

  • Timeline: Advanced Visual Analysis of Time Series Data

  • DevTools: Developer Tools

  • Management: kibana related configuration

Eight basic concepts of elasticsearch

cluster The entire elasticsearch is the cluster state by default, and the entire cluster is a complete and mutually backup data.
node A node in the cluster, generally only one process is a node
shard Sharding, even the data in one node will be divided into multiple slices for storage through the hash algorithm, and the default is 5 slices. (7.0 changed to 1 piece by default)
index It is equivalent to the database (5.x) of rdbms. For users, it is a logical database. Although it will be physically stored in multiple shards, it may also be stored in multiple nodes. 6.x 7.x index is equivalent to table
type Similar to the table of rdbms, but not so much like a table, it is actually more like a class in object-oriented, a data collection in the same Json format. (Only one is allowed to be built in 6.x, and 7.0 is abandoned, causing the index to actually be equivalent to the table level)
document Similar to row of rdbms, object in object-oriented
field Equivalent to fields and attributes
GET /_cat/nodes?v  查询各个节点状态
GET /_cat/indices?v  查询各个索引状态
GET /_cat/shards/xxxx  查询某个索引的分片情况

九 elasticsearch restful api (DSL)

The full name of DSL is Domain Specific Language, which is a domain-specific language.

1. The data structure saved in es

public class  Movie {
    
    
	 String id;
     String name;
     Double doubanScore;
     List<Actor> actorList;
}
public class Actor{
    
    
String id;
String name;
}

If these two objects are stored in a relational database, they will be split into two tables, but elasticsearch uses a json to represent a document.
So he saved to es should be:

{
    
    "id":"1",
  "name":"operation red sea",
  "doubanScore":"8.5",
  "actorList":[  
{
    
    "id":"1","name":"zhangyi"},
{
    
    "id":"2","name":"haiqing"},
{
    
    "id":"3","name":"zhanghanyu"}
] }

2 Operations on data

2.1 Check which indexes are in es

GET /_cat/indices?v
  • There will be an index named .kibana by default in es

The meaning of the header

health green (cluster complete) yellow (single point normal, cluster incomplete) red (single point abnormal)
status Can it be used
index index name
uuid index uniform number
at how many master nodes
rep How many slave nodes
docs.count number of documents
docs.deleted How many documents were deleted
store.size Overall footprint
pri.store.size The master node accounts for

2.2 Add an index

PUT /movie_index

2.3 Delete an index

​ ES does not delete or modify any data, but increases the version number

DELETE /movie_index

2.4 New Documentation

1、格式 PUT /index/type/id
PUT /movie_index/movie/1
{
    
     "id":1,
  "name":"operation red sea",
  "doubanScore":8.5,
  "actorList":[  
{
    
    "id":1,"name":"zhang yi"},
{
    
    "id":2,"name":"hai qing"},
{
    
    "id":3,"name":"zhang han yu"}
]
}
PUT /movie_index/movie/2
{
    
    
  "id":2,
  "name":"operation meigong river",
  "doubanScore":8.0,
  "actorList":[  
{
    
    "id":3,"name":"zhang han yu"}
]
}
PUT /movie_index/movie/3
{
    
    
  "id":3,
  "name":"incident red sea",
  "doubanScore":5.0,
  "actorList":[  
{
    
    "id":4,"name":"zhang chen"}
]
}
  • If no index or type has been created before, es will be created automatically.

2.5 Find directly by id

GET movie_index/movie/1

2.6 Modification—Overall Replacement

There is no difference from adding new requirements: all fields must be included

PUT /movie_index/movie/3
{
    
    
  "id":"3",
  "name":"incident red sea",
  "doubanScore":"5.0",
  "actorList":[  
{
    
    "id":"1","name":"zhang chen"}
]
}

2.7 Modify - a field

POST movie_index/movie/3/_update
{
    
     
  "doc": {
    
    
    "doubanScore":"7.0"
  } 
}

2.8 Delete a document

DELETE movie_index/movie/3

2.9 Search all data of type

GET movie_index/movie/_search
结果
{
    
    
  "took": 2,    //耗费时间 毫秒
  "timed_out": false, //是否超时
  "_shards": {
    
    
    "total": 5,   //发送给全部5个分片
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    
    
    "total": 3,  //命中3条数据
    "max_score": 1,   //最大评分
    "hits": [  // 结果
      {
    
    
        "_index": "movie_index",
        "_type": "movie",
        "_id": 2,
        "_score": 1,
        "_source": {
    
    
          "id": "2",
          "name": "operation meigong river",
          "doubanScore": 8.0,
          "actorList": [
            {
    
    
              "id": "1",
              "name": "zhang han yu"
            }
          ]
        }
      }

2.10 Query by condition (all)

GET movie_index/movie/_search
{
    
    
  "query":{
    
    
    "match_all": {
    
    }
  }
}

2.11 word segmentation query

GET movie_index/movie/_search
{
    
    
  "query":{
    
    
    "match": {
    
    "name":"red"}
  }
}

2.12 Query by word segmentation sub-attribute

GET movie_index/movie/_search
{
    
    
  "query":{
    
    
    "match": {
    
    "actorList.name":"zhang"}
  }
}

2.13 match phrase

GET movie_index/movie/_search
{
    
    
    "query":{
    
    
      "match_phrase": {
    
    "name":"operation red"}
    }
}

Query by phrase, no longer use word segmentation technology, directly use phrase to match in the original data

2.14 fuzzy queries

GET movie_index/movie/_search
{
    
    
    "query":{
    
    
      "fuzzy": {
    
    "name":"rad"}
    }
}

Correct the matching word segmentation. When a word cannot be accurately matched, es also gives a certain score to very close words through an algorithm, which can be queried, but consumes more performance.

2.15 Filtering – post-query filtering

GET movie_index/movie/_search
{
    
    
    "query":{
    
    
      "match": {
    
    "name":"red"}
    },
    "post_filter":{
    
    
      "term": {
    
    
        "actorList.id": 3
      }
    }
}

2.16 Filtering – pre-query filtering (recommended)

GET movie_index/movie/_search
{
    
     
    "query":{
    
    
        "bool":{
    
    
          "filter":[ {
    
    "term": {
    
      "actorList.id": "1"  }},
                     {
    
    "term": {
    
      "actorList.id": "3"  }}
           ], 
           "must":{
    
    "match":{
    
    "name":"red"}}
         }
    }
}

2.17 Filter – filter by range

GET movie_index/movie/_search
{
    
    
   "query": {
    
    
     "bool": {
    
    
       "filter": {
    
    
         "range": {
    
    
            "doubanScore": {
    
    "gte": 8}
         }
       }
     }
   }
}
  • About range operators:
    gt greater than
    lt less than
    gte greater than or equal to great than or equals
    lte less than or equal to less than or equals

2.18 Sorting

GET movie_index/movie/_search
{
    
    
  "query":{
    
    
    "match": {
    
    "name":"red sea"}
  }
  , "sort": [
    {
    
    
      "doubanScore": {
    
    
        "order": "desc"
      }
    }
  ]
}

2.19 Pagination query

GET movie_index/movie/_search
{
    
    
  "query": {
    
     "match_all": {
    
    } },
  "from": 1,
  "size": 1
}

2.20 Specify the fields of the query

GET movie_index/movie/_search
{
    
    
  "query": {
    
     "match_all": {
    
    } },
  "_source": ["name", "doubanScore"]
}

2.21 Highlight

GET movie_index/movie/_search
{
    
    
    "query":{
    
    
      "match": {
    
    "name":"red sea"}
    },
    "highlight": {
    
    
      "fields": {
    
    "name":{
    
    } }
    }
}

2.22 Aggregation

取出每个演员共参演了多少部电影
GET movie_index/movie/_search
{
    
     
  "aggs": {
    
    
    "groupby_actor": {
    
    
      "terms": {
    
    
        "field": "actorList.name.keyword"  
      }
    }
  }
}
每个演员参演电影的平均分是多少,并按评分排序
GET movie_index/movie/_search
{
    
     
  "aggs": {
    
    
    "groupby_actor_id": {
    
    
      "terms": {
    
    
        "field": "actorList.name.keyword" ,
        "order": {
    
    
          "avg_score": "desc"
          }
      },
      "aggs": {
    
    
        "avg_score":{
    
    
          "avg": {
    
    
            "field": "doubanScore" 
          }
        }
       }
    } 
  }
}
  • Why add .keyword suffix when aggregation?
    .keyword is a string field, which is specially used to store a copy of the non-word-segmented format. In some scenarios, only the non-word-segmented format is allowed, such as filtering filters such as aggregation aggs, so the suffix .keyword should be added to the field.

3 Chinese word segmentation

The Chinese word segmentation that comes with elasticsearch simply separates Chinese characters one by one, without the concept of vocabulary at all. However, in practical applications, users use vocabulary as a condition to perform query matching. If the article can be divided into vocabulary units, then it can be more closely matched with the user's query conditions, and the query speed will be faster.
Tokenizer download URL: https://github.com/medcl/elasticsearch-analysis-ik

3.1 Installation

  • The downloaded zip package, please decompress it and put it in .../elasticsearch/plugins/ik

Then restart es

3.2 Test use

use default

GET movie_index/_analyze
{
    
      
  "text": "我是中国人"
}

Please observe the result
using tokenizer

GET movie_index/_analyze
{
    
      "analyzer": "ik_smart", 
  "text": "我是中国人"
}

Please observe the results
Another tokenizer
ik_max_word

GET movie_index/_analyze
{
    
      "analyzer": "ik_max_word", 
  "text": "我是中国人"
}

Please observe the results.
You can see that different tokenizers have obvious differences in word segmentation. Therefore, you can no longer use the default mapping when defining a type in the future. You must manually create a mapping because you need to select a tokenizer.

3.3 Custom Thesaurus

Modify IKAnalyzer.cfg.xml in /opt/elasticsearch-6.6.0/plugins/ik/config/

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
         <entry key="remote_ext_dict">http://192.168.67.163/fenci/myword.txt</entry>
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

Use nginx to publish static resources according to the marked red path
Configure in nginx.conf

 server {
    
    
        listen  80;
        server_name  192.168.67.163;
        location /fenci/ {
    
    
           root es;
    }
   }

And build the /es/fenci/ directory under /usr/local/nginx/, add myword.txt under the directory and
write keywords in myword.txt, each line represents a word.
Then restart the es server and restart nginx.
Test the word segmentation effect in kibana
After the update is completed, es will only use new words to segment the newly added data. Historical data will not be re-segmented. If you want to re-segment the historical data. Need to execute:

POST movies_index/_update_by_query?conflicts=proceed

4 about mapping

It was said before that type can be understood as table, so how is the data type of each field defined?

4.1 View mapping

GET movie_index/_mapping/movie

In fact, the data type of the field in each type is defined by mapping.
However, if no mapping is set, the system will automatically infer the appropriate data format based on the format of a piece of data.

  • true/false → boolean
  • 1020 → long
  • 20.1 → double
  • “2018-02-01” → date
  • “hello world” → text +keyword

By default, only text will be word-segmented, and keyword is a string that will not be word-segmented.
In addition to automatic definition, mapping can also be defined manually, but it can only be defined for newly added fields without data. Once there is data, it can no longer be modified.
Note: Although the data of each Field is placed under different types, the Field with the same name can only have one mapping definition under one index.

4.2 Build an index based on Chinese word segmentation

建立mapping
PUT movie_chn
{
    
    
  "mappings": {
    
    
    "movie":{
    
    
      "properties": {
    
    
        "id":{
    
    
          "type": "long"
        },
        "name":{
    
    
          "type": "text"
          , "analyzer": "ik_smart"
        },
        "doubanScore":{
    
    
          "type": "double"
        },
        "actorList":{
    
    
          "properties": {
    
    
            "id":{
    
    
              "type":"long"
            },
            "name":{
    
    
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}
插入数据
PUT /movie_chn/movie/1
{
    
     "id":1,
  "name":"红海行动",
  "doubanScore":8.5,
  "actorList":[  
  {
    
    "id":1,"name":"张译"},
  {
    
    "id":2,"name":"海清"},
  {
    
    "id":3,"name":"张涵予"}
 ]
}
PUT /movie_chn/movie/2
{
    
    
  "id":2,
  "name":"湄公河行动",
  "doubanScore":8.0,
  "actorList":[  
{
    
    "id":3,"name":"张涵予"}
]
}
PUT /movie_chn/movie/3
{
    
    
  "id":3,
  "name":"红海事件",
  "doubanScore":5.0,
  "actorList":[  
{
    
    "id":4,"name":"张晨"}
]
}
查询测试
GET /movie_chn/movie/_search
{
    
    
  "query": {
    
    
    "match": {
    
    
      "name": "红海战役"
    }
  }
}
GET /movie_chn/movie/_search
{
    
    
  "query": {
    
    
    "term": {
    
    
      "actorList.name": "张译"
    }
  }
}

5 index aliases _aliases

An index alias is like a shortcut or soft link, which can point to one or more indexes, and can also be used by any API that requires an index name. Aliases give us a great deal of flexibility, allowing us to do things like:

1. Group multiple indexes (for example, last_three_months)
2. Create views on a subset of indexes
3. Seamlessly switch from one index to another in a running cluster

5.1 Create an index alias

# 建表时直接声明
PUT movie_chn_2020
{
    
      "aliases": {
    
    
      "movie_chn_2020-query": {
    
    }
  }, 
  "mappings": {
    
    
    "movie":{
    
    
      "properties": {
    
    
        "id":{
    
    
          "type": "long"
        },
        "name":{
    
    
          "type": "text"
          , "analyzer": "ik_smart"
        },
        "doubanScore":{
    
    
          "type": "double"
        },
        "actorList":{
    
    
          "properties": {
    
    
            "id":{
    
    
              "type":"long"
            },
            "name":{
    
    
              "type":"keyword"
            }
          }
        }
      }
    }
  }
}
# 为已存在的索引增加别名
POST  _aliases
{
    
    
    "actions": [
        {
    
     "add":    {
    
     "index": "movie_chn_xxxx", "alias": "movie_chn_2020-query" }}
    ]
}
# 也可以通过加过滤条件缩小查询范围,建立一个子集视图
POST  _aliases
{
    
    
    "actions": [
        {
    
     "add":    
{
    
     "index": "movie_chn_xxxx", 
"alias": "movie_chn2019-query-zhhy",
            "filter": {
    
    
                "term": {
    
      "actorList.id": "3"
                 }
               }
 }
}
    ]
}

5.2 There is no difference between querying an alias and using a normal index

GET movie_chn_2020-query/_search

5.3 Delete the alias of an index

POST  _aliases
{
    
    
    "actions": [
        {
    
     "remove":    {
    
     "index": "movie_chn_xxxx", "alias": "movie_chn_2020-query" }}
    ]
}

5.4 Seamless switching for an alias

POST /_aliases
{
    
    
    "actions": [
        {
    
     "remove": {
    
     "index": "movie_chn_xxxx", "alias": "movie_chn_2020-query" }},
        {
    
     "add":    {
    
     "index": "movie_chn_yyyy", "alias": "movie_chn_2020-query" }}
    ]
}

5.5 Query alias list

GET  _cat/aliases?v

6 Index Templates

Index Template Index template, as the name suggests, is a mold for creating indexes, in which a series of rules can be defined to help us build mappings and settings of indexes that meet specific business needs. By using Index Template, our indexes can have predictable consistency.

6.1 Common Scenario: Split Index

​ Segmentation index is to divide a business index into multiple indexes according to the time interval.
For example, change order_info to order_info_20200101, order_info_20200102...
There are two advantages to doing this:
1. Flexibility of structural changes: because elasticsearch does not allow modification of the data structure. However, the structure and configuration of the index in actual use will inevitably change, so as long as the index of the next interval is modified, the original index position will remain the same. This allows for some flexibility.
2. Query range optimization: Generally, the data of the entire time period is not queried, so by dividing the index, the range of scanned data is physically reduced, which is also an optimization of performance.

6.2 Create a template

PUT _template/template_movie2020
{
    
    
  "index_patterns": ["movie_test*"],                  
  "settings": {
    
                                                   
    "number_of_shards": 1
  },
  "aliases" : {
    
     
    "{index}-query": {
    
    },
    "movie_test-query":{
    
    }
  },
  "mappings": {
    
                                              
"_doc": {
    
    
      "properties": {
    
    
        "id": {
    
    
          "type": "keyword"
        },
        "movie_name": {
    
    
          "type": "text",
          "analyzer": "ik_smart"
        }
      }
    }
  }
}

Among them, "index_patterns": ["movie_test*"] means that when writing data to the index starting with movie_test, if the index does not exist, then es will automatically create an index according to this template.
Use {index} in "aliases" to get the real created index name.
test

POST movie_test_2020xxxx/_doc
{
    
    
  "id":"333",
  "name":"zhang3"
}

6.3 View the list of existing templates in the system

GET  _cat/templates

6.4 Check the details of a template

GET  _template/template_movie2020
或者
GET  _template/template_movie*

Ten Elasticsearch process

Elasticsearch indexing process

After the client sends an index request to any node, the node judges the location of the index shard and forwards it to the corresponding shard node. After the shard node executes the request, it sends it to other nodes in parallel for execution. After all other nodes execute successfully, it returns to the user. Execution success message.

image-20201031202924757.png

  • R stands for replica shard. P stands for primary shard.
  • Phase 1: The client sends an index request to any node, assuming it is Node 1.
  • Phase 2: Node 1 determines the shard where the document should be stored through the request, assuming it is in the shard 0, so Node 1 will forward the request to Node 3 where the primary shard P0 of shard 0 exists.
  • Phase 3: Node 3 executes the request on the primary shard P0 of shard 0. If the request is executed successfully, Node 3 sends the request to all replica shard R0 in Node 1 and Node 2 of shard 0 in parallel. If all replica shards successfully execute the request, a success confirmation will be replied to Node 3, and when Node 3 receives the confirmation information of all replica shards, it will return a Success message to the user.

ElasticSearch batch indexing process

After the client sends a batch index request to any node, the node will forward the request to the corresponding primary shard node, and the primary shard node will operate in sequence, and after completing an operation, send it to the rest of the replication nodes for execution. The copy node operation is completed and reported to the primary shard node, and the primary shard node reports to the requesting node and returns to the client.

image-20201031203005726.png

  • Phase 1: The client sends a large number of (bulkv) requests to Node 1.
  • Phase 2: Node 1 builds batch requests for each shard and forwards them to the primary shards needed for those requests.
  • Phase 3: The primary shards execute operations sequentially, one after the other. When an operation is completed, the primary shard forwards the new document (or deleted part) to the corresponding replica node, and then performs the next operation. In order to report that all operations are completed, the replication node reports to the requesting node, and the requesting node organizes the response and returns it to the client.

ElasticSearch search process

After the node receives all the data to be retrieved, it sends a request to the data-related shard, and the shard node that receives the request reads the data and returns it to the retrieval node, and the retrieval node summarizes the results and returns it to the client.

image-20201031203110462.png

  • This figure shows the acquisition phase.
  • The search process is divided into two phases, the query phase and the acquisition phase. The query phase mainly locates the specific location of the data to be retrieved, and the task of the acquisition phase is to retrieve the located data content and return it to the client. Query
    phase:
  • Phase 1: The client sends a retrieval request to any node, assuming it is Node 3.
  • Phase 2: Node 3 sends the retrieval request to each shard in the index. At this time, a polling strategy is adopted to randomly select one from the primary shard and all its replicas to balance the read request load. Each shard performs the retrieval locally and adds the resulting sort locally.
    Phase 3: Each shard returns the result recorded locally and sends it to Node 3. Node 3 merges these values ​​and does a global sort.
    Acquisition phase:
  • Phase 1: After Node 3 obtains the location of all the data to be retrieved, it sends a request to the shard related to the data.
  • Phase 2: Each shard that receives the request from Node 3 will read the content in the relevant documents and return them to Node 3.
  • Phase 3: After Node 3 obtains the documents returned by all shards, Node 3 merges them into a summary result and returns it to the client.

ElasticSearch batch search process

The client sends batch search requests to any node, and the node constructs multiple data retrieval requests for each shard, and then forwards them to the primary shard or replica shard. When all requests are executed, the request node summarizes the records and returns them to the client.

image-20201031203205545.png

  • Phase 1: The client sends mget request to Node 1.
  • Phase 2: Node 1 builds a number of data retrieval requests for each shard, and forwards them to the primary or replica shards required for these requests. When all replies are received, Node 1 constructs the response and returns it to the client.
  • Other features of ElasticSearch:
    single-node multi-instance deployment, replica automatic cross-node distribution strategy, routing algorithm, balancing algorithm

Eleven Elasticsearch Java API

1. Dependency files

</dependency>
<!-- ElasticSearch -->
<!-- https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>6.6.0</version>
</dependency>
<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>transport</artifactId>
    <version>6.6.0</version>
</dependency>
<!-- 解锁ES运行时没有对应方法的的错误 -->
<dependency>
    <groupId>org.locationtech.spatial4j</groupId>
    <artifactId>spatial4j</artifactId>
    <version>0.6</version>
</dependency>
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.13</version>
</dependency>
<dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.10.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api -->
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-api</artifactId>
        <version>2.10.0</version>
    </dependency>

2. Addition, deletion, modification and query of ES

package utils;

import org.elasticsearch.action.admin.indices.create.CreateIndexResponse;
import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.client.IndicesAdminClient;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.TransportAddress;
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.index.query.QueryStringQueryBuilder;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.SearchHits;
import org.elasticsearch.transport.client.PreBuiltTransportClient;

import java.io.IOException;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Set;

import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;

public class ElasticSearchUtil {
    
    
    private Settings settings = Settings.builder().put("cluster.name", "my-es").
            put("client.transport.sniff", false).build();
    private TransportClient client;

    /**
     * 获取es的客户端 client
     *
     * @return
     * @throws UnknownHostException
     */
    public TransportClient getClient() {
    
    
        if (client == null) {
    
    
            synchronized (TransportClient.class) {
    
    
                try {
    
    
                    client = new PreBuiltTransportClient(settings).
                            addTransportAddress(
                                    new TransportAddress(InetAddress.getByName("houda"), 9200));
                } catch (UnknownHostException e) {
    
    
                    e.printStackTrace();
                }
            }
        }
        return client;
    }

    /**
     * 获取索引管理的IndicesAdminClient
     *
     * @return
     */
    public IndicesAdminClient getAdminClient() {
    
    
        return getClient().admin().indices();
    }

    /**
     * 判断下标是否存在
     *
     * @param indexName
     * @return
     */
    public boolean isExistsIndex(String indexName) {
    
    
        IndicesExistsResponse response = getAdminClient().prepareExists(indexName).get();
        return response.isExists();
    }

    /**
     * 创建索引
     *
     * @param indexName
     * @return
     */
    public boolean createIndex(String indexName) {
    
    
        CreateIndexResponse response = getAdminClient().prepareCreate(indexName.toLowerCase()).get();
        return response.isAcknowledged();
    }

    public boolean deleteIndex(String indexName) {
    
    
        AcknowledgedResponse response = getAdminClient().prepareDelete(indexName).execute().actionGet();
        return response.isAcknowledged();
    }

    public void setMapping(String indexName, String typeName, String mapping) {
    
    
        getAdminClient().preparePutMapping(indexName).setType(typeName).setSource(mapping, XContentType.JSON).get();
    }

    /**
     * 添加doc到索引 类型中
     *
     * @param indexName
     * @param typeName
     * @param id
     * @param document
     * @return
     * @throws IOException
     */
    public long addDocument(String indexName, String typeName, String id, Map<String, Object> document) throws IOException {
    
    
        IndexRequestBuilder builder = getClient().prepareIndex(indexName, typeName, id);
        Set<Map.Entry<String, Object>> entries = document.entrySet();
        XContentBuilder xContentBuilder = jsonBuilder().startObject();
        for (Map.Entry<String, Object> entry : entries) {
    
    
            xContentBuilder = xContentBuilder.field(entry.getKey(), entry.getValue());
        }
        IndexResponse indexResponse = builder.setSource(xContentBuilder.endObject()).get();
        return indexResponse.getVersion();
    }
# CRUT
    public List<Map<String, Object>> queryStringQuery(String text) {
    
    
        QueryStringQueryBuilder match = QueryBuilders.queryStringQuery(text);
        SearchRequestBuilder search = getClient().prepareSearch().setQuery(match);
        SearchResponse response = search.get();
//        命中的文档
        SearchHits hits = response.getHits();
//        命中文档数量
        long totalHits = hits.getTotalHits();
        SearchHit[] searchHits = hits.getHits();
        ArrayList<Map<String, Object>> maps = new ArrayList<>();
        for (SearchHit hit : searchHits) {
    
    
//            文档元数据
            String index = hit.getIndex();
            Map<String, Object> sourceAsMap = hit.getSourceAsMap();
            maps.add(sourceAsMap);
        }
        return maps;
    }

    public static void main(String[] args) {
    
    
        ElasticSearchUtil util = new ElasticSearchUtil();
        IndicesAdminClient adminClient = util.getAdminClient();
        System.out.println(adminClient);
//        util.createIndex("houda");
//        System.out.println(util.getAdminClient());
//        System.out.println(util.isExistsIndex("houda"));
    }
}

Guess you like

Origin blog.csdn.net/weixin_38620636/article/details/130435222