Elasticsearch build a full-text search engine

--ES with the problem of how the road is generated?

(1) Question: How large-scale data retrieval?

Such as: When on the amount of system data 1 billion, 10 billion, we are doing the system architecture usually to consider the issue from the perspective of the following: 
1) What good database? (mysql, sybase, oracle, DaMeng, magical, MongoDB, HBase ...) 
2) how to resolve single point of failure; (LVS, the F5, AlO, Zookeep, the MQ) 
. 3) how to ensure data security; (hot standby, cold standby , off-site live) 
4) how to solve the problem of retrieval; (database middleware proxy: mysql-proxy, Cobar, MaxScale etc;) 
5) how to solve the problem of statistical analysis; (off-line, near real-time)

(2) deal with traditional database solutions

For relational data, we usually use the following query or similar architecture to solve bottlenecks and write bottlenecks: 
solving key points: 
1) to address data security issues from a backup by primary; 
2) heart rate monitored by the agent database middleware to solve the single point of failure problem; 
3) the query statement distributed by proxy middleware to each slave node query and summary results 
Write pictures described here

(3) non-relational database solutions of

For Nosql database to mongodb an example, similar to other principles: 
solving main points: 
1) the backup copy to ensure data security; 
2) to solve the problem through a single point node election mechanism; 
3) to retrieve information start slice configuration repository, then distributing to each requesting node, the final combined aggregated results from the routing node 
Write pictures described here

Another way - how to complete the data into memory?

We know exactly the data in memory is not reliable, in fact, it is not realistic, when we reached the PB level data, calculated in accordance with 96G of memory per node, in the case of data memory is completely filled, we the machines are required: 1PB = 1024T = 1048576G 
nodes = 1048576/96 = 10,922 
in fact, taking into account the data backup, the number of nodes often around 2.5 million units. Huge cost determines its unrealistic!

Discussion We understand that the data in memory or not on the memory or whatever, can not completely solve the problem from the front. 
All on the memory speed problem is solved, but the cost will come up. 
To solve the above problems, starting from the source analysis, usually to find ways in the following ways: 
1, when storing data in an orderly storage; 
2, the data and index separation; 
3, the compressed data; 
This leads Elasticsearch.

A. ES basic clean sweep

1.1 ES definitions

ES = elaticsearch short, Elasticsearch is an open source, highly scalable distributed full-text search engine, it can be near real-time storage, retrieval data; the expansion itself is very good, can be extended to hundreds of servers, processing PB-level data. 
Elasticsearch also developed in Java and uses Lucene as its core to achieve all index and search function, but its purpose is to hide the complexity of Lucene by a simple RESTful API, allowing full-text search easier.

1.2 Lucene relationship with ES?

1) Lucene is just a library. Want to use it, you must use Java as a development language and integrate directly into your application, make matters worse, Lucene is very complex, you need to understand the relevant knowledge retrieval to understand how it works.

2) Elasticsearch also developed in Java and uses Lucene as its core to achieve all index and search function, but its purpose is to hide the complexity of Lucene by a simple RESTful API, allowing full-text search easier.

1.3 ES mainly to solve the problem:

1) retrieve data; 
2) returns the statistical results; 
3) faster.

1.4 ES works

When ElasticSearch node starts, it will use multicast (multicast) (or unicast, if the user changes the configuration) look for other nodes in the cluster, and connect to it. This process is shown below: 
Write pictures described here

1.5 ES core concepts

1) Cluster: cluster.

ES can be used as a stand-alone single search server. However, in order to handle large data sets, fault tolerance and high availability, ES server can run on many mutual cooperation. Collection of these servers are called clusters.

2) Node: a node.

Each server cluster is called a node formed.

3) Shard: fragmentation.

When a large number of documents, due to memory limitations, insufficient disk capacity, can not respond fast enough to the client's request, etc., a node may not be enough. In this case, data may be divided into smaller fragments. Each slice placed on different servers. 
When you query index is distributed over a plurality of slices, ES will send a query to each of the relevant slice, and the results are combined, but the application does not know the existence of fragmentation. That is: the process is transparent to the user.

4) Replia: copy.

To improve query throughput or high availability, you can use a copy of fragmentation. 
Copy is an exact copy of a slice, each slice may be zero or more copies. ES can have many of the same slice, one of which is selected to change the index operation, this particular slice is called the master slice. 
When the primary slice is lost, such as: when the data slice is located unavailable, the cluster to be the new copy of the master slice.

5) full-text search.

Full-text search is indexed to an article, you can search based on keywords, like in the mysql like statement. 
Full-text index is to the content based on the meaning of the word word, then are creating an index, such as "your passion is because of what came" might be word as: "you", "passion", "what", "to "and other token, so that when you search for" you "or" passion "will search out the sentence.

1.6 ES major conceptual data architecture (relational database Mysql comparison)

Write pictures described here 
(1) relational database in a database (DataBase), equivalent to the ES index (Index) 
(2) N a database tables below (Table), equivalent to an index Index following a multi-type N ( type),
data at (3) 
(4) in which a relational database, a table Schema defines the relationship between the fields for each table, as well as tables and fields. Corresponding thereto, in the ES: Type of field processing rules under Mapping define an index, that index how to create, index type, whether to save the original index JSON document, whether to compress the original JSON document, the need for word processing, how to word processing Wait. 
(5) by insert in the database, delete the delete, change update, search operation is equivalent to search the ES increase PUT / POST, the Delete delete, change _Update, check GET.

1.7 ELK What is that?

Logstash + + = elasticsearch ELK kibana 
elasticsearch: Background distributed storage and retrieval text 
logstash: log processing, "Porter" 
kibana: visual display data. 
ELK framework for distributed data storage, visualization and query log analysis creates a powerful management chain. Three complement each other, learn from each other, together to complete large distributed data-processing work.

Two. ES features and benefits

  • 1. distributed real-time file storage, each field can be stored in the index, so that it can be retrieved. 

  • 2. Distributed Search Engine real-time analysis. 

  Distributed: Index split into a plurality of slices, each slice may be zero or more copies. Each data node in the cluster can carry one or more slices, and to coordinate and handle various operations; load balancing and automatic route re-done in most cases. 

  • 3 can be extended to hundreds of servers, the processing level PB structured or unstructured data. You can also run on a single PC (tested) 

  • 4. Support plug-in mechanism, word plugin, sync plug, Hadoop plug, visualization plug-ins.

Three, ES performance

3.1 Performance results are presented

(1) Hardware configuration: 
core AuthenticAMD CPU 16 
Total memory: 32GB 
total hard disk: 500GB non SSD

(2) Based on the above test the hardware performance indicators are as follows: 
1) Average Throughput index: 12307docs / s (each document Size: 40B / docs) 
2) the average CPU utilization: 887.7% (16 cores per core : 55.48%) 
3) Construction of the index size: 3.30111 GB 
. 4) the amount of total write: 20.2123 GB 
. 5) Processed total test: 28m 54s.

3.2 Performance esrally tool (Recommended)

Reference: http://blog.csdn.net/laoyang360/article/details/52155481

Fourth, Why ES?

4.1 ES use outstanding cases at home and abroad

1) In early 2013, GitHub abandoned Solr, take ElasticSearch do the PB level of search. "GitHub uses data ElasticSearch search 20TB, including 130 billion documents and 1.3 billion lines of code."

2) Wikipedia: Start elasticsearch-based core search architecture. 
3) SoundCloud: "SoundCloud use ElasticSearch provide instant and accurate music search services to 180 million users." 
4) Baidu: Baidu is now widely used ElasticSearch analysis as text data, gathering all kinds of Baidu index data on the server and user-defined data through multi-dimensional analysis of a variety of data display, auxiliary positioning analysis examples of abnormal or business level exception. Currently covering the internal Baidu more than 20 lines of business (including CASIO, cloud analysis, network alliance, prediction, library, direct number, wallet, risk control, etc.), single-cluster maximum of 100 machines, 200 ES nodes, introducing 30TB + data per day.

4.2 We also need

Actual combat in project development, almost every system will have a search function, when searching achieve a certain degree of difficulty to maintain and extend it will gradually become larger, so many companies will put out a search for a single independent module, with ElasticSearch like.

ElasticSearch recent years rapid development, has gone beyond its initial role of a pure search engine, now has added data aggregation analysis (aggregation) and visualization features, if you millions of documents need to be positioned by keyword, certainly ElasticSearch It is the best choice. Of course, if your documents are JSON, you can also ElasticSearch as a kind of "NoSQL database" application ElasticSearch characterization data aggregation (aggregation), and for the analysis of multi-dimensional data.

[Know almost: Hot Cool Architect Pan flying ES] alternative to traditional DB in some scenarios 
personally think Elasticsearch as the internal memory is still good, basically able to meet the efficiency, in some ways replace the traditional DB is also possible, provided is this your business operations do not have special requirements of the business; and rights management need not be so fine, because the authority of this ES is not perfect. 
Since our application scenario is only that the ES data within a certain period of time the polymerization operation, there is no single large document request (such as a user to find documents by userid, similar application scenario NoSQL), so can substitute for NoSQL also you need your own test. 
If I choose, I will try to use the ES to replace the traditional NoSQL, because its lateral extension mechanism too convenient.

Five. ES scenario is what?

Usually we have two problems faced:

1) the development of a new system tries to use as a storage and retrieval ES server; 
2) upgrade existing systems need to support full-text search service, you need to use ES. 
Using the above two architectures, link below in detail. 
http://blog.csdn.net/laoyang360/article/details/52227541

The company ES line usage scenarios:

1) Sina ES 3.2 billion deal with how to analyze real-time log  http://dockone.io/article/505 
2) Ali ES build wealth dig their own log collection and analysis system  http://afoo.me/columns/tec/logging spec.html--platform 
. 3) has Like ES service log processing  http://tech.youzan.com/you-zan-tong-ri-zhi-ping-tai-chu-tan/ 
. 4) implemented within the station search ES  http: //www.wtoutiao.com/p/13bkqiZ.html

Six, Es-rtf installation

1. Download the installation package

 

 2. Go to the bin directory, open a command line, enter elasticsearch, carriage return

3. The browser opens, input http://127.0.0.1:9200/ , i.e. represents a successful installation page shown in FIG.

 Seven, elasticsearch-head installation

  • git clone git://github.com/mobz/elasticsearch-head.git
  • cd elasticsearch-head
  • npm install
  • npm run start

 Note: If there is an error after npm install, you can delete node_mudlues folder, npm install grunt-cli -g, npm install grunt -g, then npm install

Browser and enter 127.0.0.1:9100

Eight, kibana installation

Official website search https://www.elastic.co/cn/downloads/kibana download, attention should be consistent with the version elasticsearch

After downloading, go to the bin directory

1

kibana.bat

Browser and enter 127.0.0.1:5601

Published 48 original articles · won praise 49 · views 1837

Guess you like

Origin blog.csdn.net/bjniujw1024/article/details/104312123