Elasticsearch distributed search

ElasticSearch of introduction

A Elasticsearch Background

1.1 How large-scale data retrieval

Such as: When on the amount of system data 1 billion, 10 billion, we are doing the system architecture usually to consider the issue from the perspective of the following:
1) What good database? (MySQL, Oracle, MongoDB, HBase ...)
2) how to resolve single point of failure; (LVS, the F5, AlO, Zookeep, the MQ)
. 3) how to ensure data security; (hot standby, cold standby, remote live)
4) how to solve the problem of retrieval; (database middleware proxy: mysql-proxy, Cobar, MaxScale etc;)
5) how to solve the problem of statistical analysis; (off-line, near real-time)

1.2 deal with traditional database solutions

For relational data, we usually use the following query or similar architecture to solve bottlenecks and write bottlenecks:
solving key points:
1) to address data security issues from a backup by primary;
2) heart rate monitored by the agent database middleware to solve the single point of failure problem;
3) the query statement distributed by proxy middleware to each slave node query and summary results

1.3 non-relational database solutions

For Nosql database to mongodb an example, similar to other principles:
solving main points:
1) the backup copy to ensure data security;
2) to solve the problem through a single point node election mechanism;
3) to retrieve information start slice configuration repository, then distributing to each requesting node, the final combined aggregated results from the routing node

1.4 Memory Database Solutions

Complete the data in memory is not reliable, in fact, it is not realistic, when we reached the PB level data, calculated in accordance with 96G of memory per node, in the case of data memory is completely filled, we need machines They are: 1PB = 1024T = 1048576G
nodes = 1048576/96 = 10,922
in fact, taking into account the data backup, the number of nodes often around 2.5 million units. Huge cost determines its unrealistic!

So the data in memory or not on the memory or whatever, can not completely solve the problem.
All on the memory speed problem is solved, but the cost will come up.
To solve the above problems, starting from the source analysis, usually to find ways in the following ways:
1, when storing data in an orderly storage;
2, the data and index separation;
3, the compressed data;
This leads Elasticsearch

Two Elasticsearch Introduction

What is ### 2.1Elasticsearch

Elasticsearch is a Lucene-based distributed search and analysis engine .

ES is elaticsearch short, Elasticsearch is an open source, highly scalable distributed full-text search engine, it can be near real-time storage, retrieval data; the expansion itself is very good, can be extended to hundreds of servers, processing PB-level data.
Elasticsearch developed in Java, released under the Apache open-source license terms, it is a popular enterprise search engine. Designed for the cloud, to achieve real-time search, stable, reliable, fast and easy to install

Use Lucene as its core to achieve all index and search function, but its purpose is to hide the complexity of Lucene by a simple RESTful API, making it easy full-text search

Design purposes: for distributed full-text searches using the HTTP data indexed by the JSON, fast 

2.2 Lucene relationship with Elasticsearch

1) Lucene is just a library. Want to use it, you must use Java as a development language and integrate directly into your application, make matters worse, Lucene is very complex, you need to understand the relevant knowledge retrieval to understand how it works.

2) Elasticsearch also developed in Java and uses Lucene as its core to achieve all index and search function, but its purpose is to hide the complexity of Lucene by a simple RESTful API, allowing full-text search easier.

2.3 Elasticsearch vs solr

1) Solr is an open source project Apache Lucene enterprise search platform. Its main features include full-text search, hit marking, handling faceted search, dynamic clustering, database integration, and rich text (such as Word, PDF) of.

2) Solr is highly scalable and provides distributed search and index replication. Solr is the most popular enterprise search engine, Solr4 also adds support for NoSQL.

3) Solr is written in Java, running in a Servlet container (such as Apache Tomcat or Jetty) independent full-text search server. Solr uses the Lucene Java search library as the core of full-text indexing and search, and has a REST-like HTTP / XML and JSON API.

4) Solr's powerful external configuration feature eliminates the need to carry on Java coding, it can be adjusted to accommodate various types of applications. Solr has a plug-in architecture to support more advanced customization

Elasticsearch and Solr comparison summary

  1. Both are simple to install

  2. Solr using the Zookeeper distributed management, coordination and Elasticsearch itself with distributed management capabilities

  3. Solr supports more data formats, and Elasticsearch only supports json file format

  4. Solr official offer more features, but Elasticsearch itself to focus more on core functionality, advanced features to provide more than a third-party plug-ins

  5. In traditional Solr search applications performed better than Elasticsearch, but when dealing with real-time search application efficiency is significantly lower than Elasticsearch

  6. Solr is a powerful solution for traditional search applications, but Elasticsearch more suitable for emerging applications of real-time search

 

2.4 Elasticsearch core concepts

#### 2.4.1 Cluster: cluster
ES can be used as a stand-alone single search server. However, in order to handle large data sets, fault tolerance and high availability, ES server can run on many mutual cooperation. Collection of these servers are called clusters.

#### 2.4.2 Node: a node
for each node is called a server cluster formed.

#### 2.4.3 Shard: fragmentation
when there are a large number of documents, due to memory limitations, insufficient disk capacity, can not respond fast enough to the client's request, etc., a node may not be enough. In this case, data may be divided into smaller fragments. Each slice placed on different servers.
When you query index is distributed over a plurality of slices, ES will send a query to each of the relevant slice, and the results are combined, but the application does not know the existence of fragmentation. That is: the process is transparent to the user.

#### 2.4.4 Replia: copy

To improve query throughput or high availability, you can use a copy of fragmentation.
Copy is an exact copy of a slice, each slice may be zero or more copies. ES can have many of the same slice, one of which is selected to change the index operation, this particular slice is called the master slice.
When the primary slice is lost, such as: when the data slice is located unavailable, the cluster to be the new copy of the master slice.

#### 2.4.5 Full text search
full-text search is indexed to an article, you can search based on keywords, like in the mysql like statement.
Full-text index is the content word based on the meaning of the word, then are creating an index, for example, "Today is the Sunday we went out to play" might be word as: "Today", "Sunday", "we", "go out and play." and other token, so that when you search for "Sunday" or "out of play" will search out the sentence.

2.5 Comparison with relational database Mysql

image-20191223153358211

1) relational database in a database (DataBase), equivalent to the ES index (Index)
2) N a database tables below (Table), equivalent to an index Index below than N type (Type) ,
3) the data in a database table (table) by a plurality of rows (the rOW) a plurality of columns (column, attributes) composition, equivalent to Type 1 by a plurality of documents (document) Field and a plurality composition.
4) in which a relational database, a table Schema defines the relationship between the fields for each table, as well as tables and fields. Corresponding thereto, in the ES: Type of field processing rules under Mapping define an index, that index how to create, index type, whether to save the original index JSON document, whether to compress the original JSON document, the need for word processing, how to word processing Wait.
5) by insert in the database, delete the delete, change update, search by a search operation is equivalent to the PUT ES / POST, the Delete delete, change _Update, check GET.1.7

What is 2.6 ELK

Logstash + + = elasticsearch ELK kibana
elasticsearch: Background distributed storage and retrieval text
logstash: log processing, "Porter"
kibana: visual display data.
ELK framework for distributed data storage, visualization and query log analysis creates a powerful management chain. Three complement each other, learn from each other, together to complete large distributed data-processing work.

2.7 Elasticsearch Features and Benefits

1) distributed real-time file storage, it can be stored in each field index, so that it can be retrieved.
2) real-time analysis of distributed search engines.
Distributed: Index split into a plurality of slices, each slice may be zero or more copies. Each data node in the cluster can carry one or more slices, and to coordinate and handle various operations;
load balancing and re-routing automatically in most cases.
3) can be extended to hundreds of servers, the processing level PB structured or unstructured data. It can also run on a single PC (tested)
4) support plug-in mechanism, word plugin, sync plug, Hadoop plug, visualization plug-ins.

Why use three Elasticsearch

3.1 outstanding cases at home and abroad

1) In early 2013, GitHub abandoned Solr, take ElasticSearch do the PB level of search. "GitHub uses data ElasticSearch search 20TB, including 130 billion documents and 1.3 billion lines of code."

2) Wikipedia: Start elasticsearch-based core search architecture.
3) SoundCloud: "SoundCloud use ElasticSearch provide instant and accurate music search services to 180 million users."
4) Baidu: Baidu is now widely used ElasticSearch analysis as text data, gathering all kinds of Baidu index data on the server and user-defined data through multi-dimensional analysis of a variety of data display, auxiliary positioning analysis examples of abnormal or business level exception. Currently covering the internal Baidu more than 20 lines of business (including CASIO, cloud analysis, network alliance, prediction, library, direct number, wallet, risk control, etc.), single-cluster maximum of 100 machines, 200 ES nodes, introducing 30TB + data per day.

5) How Sina ES analysis and processing 3.2 billion real-time log
6) Ali ES dug to build their own wealth log collection and analysis system
7) has praised ES log processing business

3.2 Our business scene

Actual combat in project development, almost every system will have a search function, when searching achieve a certain degree of difficulty to maintain and extend it will gradually become larger, so many companies will put out a search for a single independent module, with ElasticSearch like.

ElasticSearch recent years rapid development, has gone beyond its initial role of a pure search engine, now has added data aggregation analysis (aggregation) and visualization features, if you millions of documents need to be positioned by keyword, certainly ElasticSearch It is the best choice. Of course, if your documents are JSON, you can also ElasticSearch as a kind of "NoSQL database" application ElasticSearch characterization data aggregation (aggregation), and for the analysis of multi-dimensional data.

Try to use the ES to replace the traditional NoSQL, its scale is too convenient mechanism

Scenario:

1) the development of a new system tries to use as a storage and retrieval ES server;
2) upgrade existing systems need to support full-text search service, you need to use ES

 

Mac installation ElasticSearch

## a JDK installation environment

Because ElasticSearch is written in the Java language, it is necessary to install JDK environment, and JDK 1.8 is more, their own specific steps Baidu

View the installation is complete java version

java -version

 

Two official website to download the latest version

 

Download [ https://www.elastic.co/cn/downloads/elasticsearch ], select the appropriate version to download

image-20191204173716931

Three other versions Download

Just click https://www.elastic.co/cn/downloads/past-releases#elasticsearch

image-20191204173847321

## three download is complete, start

Extracting file, extract the switch to the file path, execute

cd elasticsearch-<version> #切换到路径下
./bin/elasticsearch  #启动es
#如果你想把 Elasticsearch 作为一个守护进程在后台运行,那么可以在后面添加参数 -d 。
#如果你是在 Windows 上面运行 Elasticseach,你应该运行 bin\elasticsearch.bat 而不是 bin\elasticsearch

 

四 测试启动是否成功

在浏览器输入以下地址:http://127.0.0.1:9200/

即可看到如下内容:

{
  "name" : "lqzMacBook.local",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "G1DFg-u6QdGFvz8Z-XMZqQ",
  "version" : {
    "number" : "7.5.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "e9ccaed468e2fac2275a3761849cbee64b39519f",
    "build_date" : "2019-11-26T01:06:52.518245Z",
    "build_snapshot" : false,
    "lucene_version" : "8.3.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

 

五 关闭es

#查看进程
ps -ef | grep elastic
#干掉进程
kill -9 2382(进程号)
#以守护进程方式启动es
elasticsearch -d

 

 

 Elasticsearch插件介绍

    es插件是一种增强Elasticsearch核心功能的途径。它们可以为es添加自定义映射类型、自定义分词器、原生脚本、自伸缩等等扩展功能。

es插件包含JAR文件,也可能包含脚本和配置文件,并且必须在集群中的每个节点上安装。安装之后,需要重启集群中的每个节点才能使插件生效。
es插件包含核心插件和第三方插件两种

##二 核心插件
核心插件是elasticsearch项目提供的官方插件,都是开源项目。这些插件会跟着elasticsearch版本升级进行升级,总能匹配到对应版本的elasticsearch,这些插件是有官方团队和社区成员共同开发的。

官方插件地址: https://github.com/elastic/elasticsearch/tree/master/plugins

##三 第三方插件
​    第三方插件是有开发者或者第三方组织自主开发便于扩展elasticsearch功能,它们拥有自己的许可协议,在使用它们之前需要清除插件的使用协议,不一定随着elasticsearch版本升级, 所以使用者自行辨别插件和es的兼容性。

四 插件安装

elasticsearch的插件安装方式还是很方便易用的。

它包含了命令行和离线安装几种方式。

它包含了命令行,url,离线安装三种方式。

核心插件随便选择一种方式安装均可,第三方插件建议使用离线安装方式
第一种:命令行

bin/elasticsearch-plugin install [plugin_name]
# bin/elasticsearch-plugin install analysis-smartcn  安装中文分词器

 


第二种:url安装

bin/elasticsearch-plugin install [url]
#bin/elasticsearch-plugin install https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-smartcn/analysis-smartcn-6.4.0.zip

第三种:离线安装

#https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-smartcn/analysis-smartcn-6.4.0.zip
#点击下载analysis-smartcn离线包
#将离线包解压到ElasticSearch 安装目录下的 plugins 目录下
#重启es。新装插件必须要重启es

 

注意:插件的版本要与 ElasticSearch 版本要一致

 

 

安装nodejs

一 nodejs介绍

Node.js 就是运行在服务端的 JavaScript。

Node.js 是一个基于Chrome JavaScript 运行时建立的一个平台。

Node.js是一个事件驱动I/O服务端JavaScript环境,基于Google的V8引擎,V8引擎执行Javascript的速度非常快,性能非常好。

为什么要安装Node.js呢,下面用到的Grunt 工具是基于Node.js 使用的

下载地址:https://nodejs.org/en/download/releases/

选择版本下载, 一直下一步确定即可,安装后进入命令行中 输入 :

node -v 
# 显示版本号即安装成功

 

二 查看原来的镜像地址

npm(node package manager):nodejs的包管理器,用于node插件管理(包括安装、卸载、管理依赖等)

npm get registry
# 输出:https://registry.npmjs.org/

三 npm切换阿里源

#切换阿里源
npm config set registry https://registry.npm.taobao.org/
#查看是否成功
npm config get registry
#或者
npm get registry
#可以看到输出
#https://registry.npm.taobao.org/

 

四 安装cmpm

cnpm:因为npm安装插件是从国外服务器下载,受网络的影响比较大,可能会出现异常,如果npm的服务器在中国就好了,所以我们乐于分享的淘宝团队干了这事。来自官网:“这是一个完整
npmjs.org 镜像,你可以用此代替官方版本(只读),同步频率目前为 10分钟 一次以保证尽量与官方服务同步。”

npm install -g cnpm --registry=https://registry.npm.taobao.org
#查看是否安装成功
cnpm -v
#成功后可以使用cnpm代替npm命令

 

 

五 改变原有的环境变量

1、首先配置npm的全局模块的存放路径、cache的路径

npm config set prefix "路径"
npm config set cache "路径"

 

 

 

 

Guess you like

Origin www.cnblogs.com/Gaimo/p/12127221.html