Elasticsearch systematic study of (a) -elasticsearch brief introduction to the core concepts and

A, ES brief

elasticsearch, based on lucene, hide the complexity, providing a simple-to-use interfaces restful api, (api interface to other languages) java api Interface
(1) distributed document storage engine
(2) distributed search engines and engine analysis
(3) distributed in support of PB-level data

1.1, es function

(1) search engine and distributed data analysis engine

Search: In Baidu, the site of the station search, retrieval IT system of
data analysis: Electric's Web site, businesses in the last 7 days before the toothpaste of this commodity sales ranking 10 of which; news sites, recent news before 1 monthly visits top 3 section which is
distributed, search, data analysis

(2) full-text search, retrieval structure, data analysis

Full Text Search: I would like to find the product name containing toothpaste goods, select * from products where product_name like "% Toothpaste%"
Structured Search: I want to find a product classified as of supplies of goods which are, select * from products where category_id = 'cosmetic products'
partial matching, auto-complete, search for error correction, search recommended
data analysis: we analyze the number of products available in each Category, select category_id, count (*) from products group by category_id

(3) to near real-time massive data processing

Distributed: ES huge amounts of data can be automatically distributed to multiple servers to store and retrieve up
massive amounts of data processing: distributed later, on a large number of servers can be used to store and retrieve data, naturally you can achieve massive data processed
in near real time : retrieving data takes one hour (do this near real-time, off-line batch processing, batch-processing); data search and analysis at the second level
with distributed / mass data opposite: lucene, stand-alone applications, only using a single server, you can only handle a single server can handle the amount of data

1.2, ES usage scenarios

foreign:

(1) Wikipedia, similar to Baidu Encyclopedia, toothpaste, toothpaste Wikipedia, full-text search, highlight, search Recommended
(2) The Guardian (foreign news sites), similar to the Sohu news, log user behavior (click, browse, collection, comments) + social network data (related to certain views on the news), data analysis, to the author of each news article, let him know that his article public feedback (good, bad, hot, junk, contempt, worship)
error (3) Stack Overflow (abroad program exception discussion forums), IT problems, program, submit up, someone will answer and discuss with you, full text search, search-related questions and answers, program error, the error message will be paste to go inside, there is no corresponding search of answers
(4) GitHub (open source management), search of billions of rows of code
(5) electric's website, search for a product
(6) log data analysis, logstash collection log, ES complex data analysis (ELK technology, elasticsearch + logstash + kibana)
(7) commodity price monitoring website, users set the price of a commodity Threshold, when below this threshold, a notification message is sent to the user, for example, subscribe to monitor toothpaste, Colgate toothpaste if the Family set below 50 dollars to notice me, I would buy
(8) BI system, Business smart, Business intelligence. For example, there is a large shopping Group, BI, analyze the composition constitutes a certain amount of consumption of the users of regional trends last three years as well as user groups, output report related to the number of sheets, ** District, the last three years, the amount of consumption per year showing growth of 100%, and 85% of the user base is a senior white-collar workers, to open a new mall. ES perform data analysis and mining, Kibana data visualization

domestic:

(9) Domestic: Web Search (electricity supplier, recruitment, portals, etc.), IT systems search (OA, CRM, ERP, etc.), data analysis (ES popular use of a scene)

1.3, ES Features

(1) can be used as a large-scale distributed clusters (hundreds of servers) technology, PB-level data processing, services large companies; you can also run on a single machine, small service companies
(2) Elasticsearch not a new technology, mainly to the full text retrieval, data analysis, and distributed technology, merged together, it formed a unique ES; lucene (full-text search), commercial data analysis software (is there), distributed database (myCat)
(3) to users words, is out of the box, very simple, as small and medium sized applications directly deployed about three minutes ES, as the system can be used in a production environment, the amount of data, the operation is not too complex
(4) database features the face of many areas is not enough (the transaction, as well as a variety of online transaction type of operation); special features, such as full-text search, a synonym, relevance ranking, complex data analysis, near-real-time processing of massive data; elasticsearch as a complement to traditional database, the database provides a lot of features that are not can not provide

1.4, ES core concepts

(. 1) Near the Realtime (the NRT): near real time, meaning two, the data from the write data may be searched to have a small delay (about 1 second); es based search and analysis can reach the second level
(2) the Cluster : cluster comprising a plurality of nodes, each node belongs to which cluster is determined by a configuration (cluster name, the default is elasticsearch), for small and medium sized applications, the beginning of a cluster node on a normal
(3) the node : node, a node in the cluster, the node also has a name (the default is randomly assigned), node name is very important (in the implementation of the operation and maintenance management operations), default node to join a name for the "elasticsearch" cluster, If you start a bunch of nodes, then they will automatically form a elasticsearch cluster, of course, a node can also be composed of a elasticsearch cluster
(4) the document & Field, : documents, minimum data unit es in, a document can be a customer data, a Category data, an order data is usually represented by JSON data structure, Type under each index in, you can go to store multiple document. A document which has a plurality of field, each field is a data field.

# E.g. a Product Document 

{ 
  "the product_id": ". 1", 
  "PRODUCT_NAME": "Colgate toothpaste", 
  "product_desc": "Efficient white", 
  "category_id": "2", 
  "category_name": "cosmetic products" 
} 

# where product_id, product_name field is

(5) Index : Index, there is a pile of documents containing similar data structure, such as a customer can have an index, commodity classification index, orders index, the index has a name. Index contains a lot of document, an index to represent a class of similar or identical document. For example, to create a product index, commodity index, which might store all product data, all of the merchandise document.
(. 6) the Type : type, where each index can have one or more type, the data type is a classification index the logical, under a Document type, have the same Field, such as blog system, there is an index, can be user-defined data type, blog data type, comment data type.

type, daily commodity type, electrical goods type, commodity type fresh 

daily commodity type: product_id, product_name, product_desc, category_id, category_name 
electrical goods type: product_id, product_name, product_desc, category_id, category_name, service_period 
fresh produce type: product_id , product_name, product_desc, category_id, category_name , eat_period 

each type which will contain a bunch of the Document 


{ 
  "product_id": "2", 
  "PRODUCT_NAME": "Changhong TV", 
  "product_desc": "4K HD", 
  "category_id ':'. 3 ", 
  " category_name ":" collector ", 
  " service_period ":" Year. 1 " 
} 


{ 
  " the product_id ":". 3 ", 
  " PRODUCT_NAME ":" shrimp ", 
  "product_desc ":" natural, Iceland production ", 
  " category_id ":". 4 ", 
  " category_name ":" Fresh "  
  " eat_period ":". 7 day "
}

(. 7) Shard : single machine can not store large amounts of data, ES data may be cut into a plurality of index shard, distributed across multiple storage servers . With shard can scale out to store more data, so that operations such as search and analysis distributed to multiple servers to perform up to enhance the throughput and performance. Each shard is a lucene index.
(8) replica : a server at any time any possible failure or downtime, then shard might be lost, so you can create multiple copies of each replica shard. replica may provide backup service when shard failure, to ensure data is not lost, a plurality of replica may further improve the throughput and performance of the search operation. primary shard (indexing a set, can not be modified, default 5), replica shard (to modify the number of the default one), each index 10 Shard default, Primary Shard 5, 5 replica shard, the smallest high available configurations, a server 2.

1.5, elasticsearch core database concepts vs. Core Concepts

Elasticsearch	database
Document	Row
Type	table
Index	Storehouse