2019 ElasticSearch common interview questions resolve (on)

Foreword

ElasticSearch is a search server based on Lucene. It provides a distributed multi-user capabilities of full-text search engine, based on RESTful web interface. Elasticsearch is a Java language development, and as open source under the Apache license terms published, is a popular enterprise-class search engine. ElasticSearch for cloud computing, it is possible to achieve real-time search, stable, reliable, fast and easy to install. The official client in Java, .NET (C #), PHP, Python, Apache Groovy, Ruby and many other languages ​​are available. According to DB-Engines rankings show, Elasticsearch enterprise search engine is the most popular, followed by Apache Solr, also based on Lucene.

Elasticsearch face questions

1, elasticsearch to know how much to talk about your company es cluster architecture, the index data size, number of fragments, as well as some tuning means.
2. What elasticsearch inverted index is
3, elasticsearch index data more how to do, how to tune, deploy
4, elasticsearch is how to achieve master election
5, a detailed description of what Elasticsearch document indexing process
6, detailed description about the process Elasticsearch search?
7, Elasticsearch at deployment, optimization methods which have set for Linux
8. What lucence internal structure?
9, Elasticsearch is how to achieve Master election?
10, Elasticsearch the node (for example, a total of 20), of which 10 chose a master, the other 10 chose another master, how do?
11, when the client and cluster connections, how to select a specific implementation of the requesting node?
12, describe in detail the process of Elasticsearch index of documents.

1, elasticsearch to know how much to talk about your company es cluster architecture, the index data size, number of fragments, as well as some tuning means.

Interviewer: For ES company contact use before candidates scene, scale, have not done large-scale comparative index design, planning, tuning.
Answer: truthfully with their own practice scenarios to answer.
For example: ES cluster architecture 13 nodes, indexed by a total of 20 + index different channels, according to the date, day 20+ increments, index: 10 slices, is incremented 100 000 000 + Data daily, every day control channel index size: 150GB within.
Only the index level tuning tools:

1.1, the design phase tuning

(1) based on the incremental business needs to take to create an index based on the date template, roll over API Roll Index;
(2) by using an alias index management;
(3) do every morning force_merge operation timing index, to free space;
(4) separation mechanism taken cold, to heat the SSD data storage, retrieval efficiency is improved; cold shrink operation data periodically, in order to reduce the memory;
(5) take curator lifecycle management index;
(6) only for the needs of sub-word field, set a reasonable word breaker;
(7) Mapping phase sufficient binding properties of each field, the need to retrieve, the need for storage. ...... ..

1.2, write tune

(1) before writing the copy number is set to 0;
(2) off before writing refresh_interval set to -1, to disable refresh mechanism;
(3) the writing process: take the bulk batch write;
(4) the number of copies and the write recovery refresh interval;
(5) to make use of auto-generated id.

1.3, query tuning

(1) disable wildcard;
(2) Disable batch terms (hundreds of scenes);
(3) full use of inverted index mechanism, can try to type keyword keyword;
(4) When the amount of data to be finalized based on a time to re-retrieval index;
(5) set a reasonable routing mechanism.

1.4 Other tuning

Tuning deployment, service commissioning Excellence.
Part mentioned above, the interviewer will basically be evaluated before you practice or operation and maintenance experience.

2. What elasticsearch inverted index is

Interviewer: Do you want to learn basic cognitive concepts.
ANSWER: The popular explanation can be.
Our tradition is retrieved through articles, one by one traverse keywords to find the corresponding position.
The inverted index, is by word strategy, forming a mapping table words and articles, this dictionary + mapping table is the inverted index. With the inverted index, will be able to achieve o (1) time complexity of the efficiency of retrieved articles, and greatly improves the retrieval efficiency.
Academic answer way:
Inverted index, contrary to article What's the word, from which the word describes what this word appears in the document before, consists of two parts - dictionaries and inverted list.
Plus point: inverted index underlying implementation is based on: FST (Finite State Transducer) data structure.
Lucene data structures from version 4+ in large quantities using FST. FST has two advantages:
(1) small footprint. By repeated use of the word to the dictionary prefixes and suffixes, compressed memory space;
Fast (2) query speed. O (len (str)) query time complexity.

3, elasticsearch index data more how to do, how to tune, deploy

Interviewer: For large amounts of data operation and maintenance capabilities.
Answer: Planning index data, should be a good program in the early, so-called "design first, code after", so as to effectively avoid the sudden surge in data clusters lead to insufficient treatment capacity caused by online customers to retrieve or other businesses affected .
How to tune, as Question 1 said, refine it here:

3.1 Dynamic index level

Template-based + time + rollover api scroll to create an index, for example: the definition of the design phase: blog index template format is: timestamp of the form blog_index_, incremental data every day. The advantage of this: the amount of data will not cause proliferation of a single index data is very large, close to the power line 32 -1, the index store 2 reaches TB + even greater.
Once the risks of a single large index, storage, etc. has cropped up, so to think ahead as soon as possible to avoid +.

3.2 storage levels

Separate hot and cold storing data, thermal data (such as last 3 days of the week or data), the remainder being cold data.
For cold data will no longer write new data to be added to shrink force_merge consider periodic compression operation, space-saving storage and retrieval efficiency.

3.3 deployment level

Once before there is no planning, contingency strategies belong here.
ES combined with its support for dynamic expansion of the characteristics of the dynamic new ways to relieve cluster machine pressure Note: If before the main nodes such as planning and reasonable, do not need to restart the cluster can be dynamically added to complete.

4, elasticsearch is how to achieve master election

Interviewer: Want to understand the underlying principles ES cluster, not only focus on the operational level.
answer:
Front premise:
(1) only candidate master node: The node (master true) in order to become the master.
(2) a main object of the smallest number of nodes (min_master_nodes) is to prevent split-brain.
I checked the code, core entrance findMaster, select the master node successfully returns the corresponding Master, otherwise return null. Election process is as follows:
Step: Check Status candidate master nodes, the set value elasticsearch.yml
discovery.zen.minimum_master_nodes;
Step Two: Comparative: first determining whether a master status, the master node includes a qualified candidate of the priority return;
If the two nodes are candidates for the master node, the master node id value is small. Note that the id of type string.
Digression: node id of the acquisition method.
1GET /_cat/nodes?v&h=ip,port,heapPercent,heapMax,id,name

2ip port heapPercent heapMax id name复制代码

5, a detailed description of what Elasticsearch document indexing process

Interviewer: Want to understand the underlying principles of the ES, no longer only focus on the operational level.
answer:
Here's an index of documents should be understood as a document written in ES, the process of creating the index.
Write documents include: single-batch bulk document writing and writing here only to explain it: a single written document flow.
Remember official document this figure.
The first step: Write a client node cluster to write data, send the request. (If no routing / coordinator node, node requested to play the role of routing nodes.)
Step Two: After a node receives a request, it determines the document using the document belongs to _id slice 0. Request will be forwarded to another node, node 3 is assumed. Thus fragments of the primary slice allocated to node 0 3.
The third step: the master node in slice 3 performs a write operation, if successful, the request is forwarded to the parallel node 1 and 2 copies of the fragment, wait for the result returned. All copies of fragments are reporting success, will be the coordinator node 3 node (Node 1) reports success, node 1 successfully written report to the requesting client.
If the interviewer asked: The second step in the process of obtaining the document fragment?
Answer: With the routing algorithm, the process of routing algorithm is calculated based on the target document routing and fragmentation id id.
1shard = hash(_routing) % (num_of_primary_shards)复制代码

6, detailed description about the process Elasticsearch search?

Interviewer: ES want to understand the underlying principles of search, not only focus on the operational level.
answer:
Search dismantling as "query then fetch" in two stages.
Query object phases: positioning to a position, but did not take.
Dismantling the following steps:
(1) assuming a primary index data has 5 + 1 copy of 10 sheets, a request is a hit (or a copy of the master slice) of.
(2) each tile in the local query result is returned to the local priority queue ordered.
(3) Results 2) the step of transmitting to the coordinator node, the coordinator node generates a global ordered list.
Object fetch stage: access to data.
Routing nodes get all the documents, returned to the client.

7, Elasticsearch at deployment, optimization methods which have set for Linux

Interviewer: For operation and maintenance capabilities of ES cluster.
answer:
(1) Close the swap buffer;
(2) heap memory to: Min (node ​​memory / 2, 32GB);
(3) setting the maximum number of file handles;
(4) + queue size thread pool needs be adjusted according to the service;
(5) raid disk storage mode - storing a conditional use RAID10, increased performance, and to avoid a single single-node storage node failure.

8. What lucence internal structure?

Interviewer: Do you want to know the breadth and depth of knowledge.
answer:
There are two Lucene indexing and search process, including index creation, indexing, searching three points. Based on this context we can expand the number.

9, Elasticsearch is how to achieve Master election?

(1) Elasticsearch ZenDiscovery selected master module is responsible mainly comprising the Ping (via the RPC to discover each other between a node) and Unicast (unicast module contains a list of hosts which nodes need to control the ping) two portions;
(2) for all can become a master node (node.master: true) according to nodeId dictionary sort, every election they know each node node regarded a row order, and then select the first (bit 0) node , being that it is the master node.
(3) If the number of votes on a node reaches a certain value (can be the master nodes n / 2 + 1) and the node also elect their own, then this node is the master. Otherwise re-election until the above conditions are met.
(4) Added: Responsibilities include management of the master node clusters, nodes and indexes, is not responsible for the document-level management; data node can turn off the http function *.

10, Elasticsearch the nodes (such as 20 in total), of which 10

Chose a master, the other 10 chose another master, how do?
(1) When the number of candidate cluster master is not less than three, it may be provided by a minimum vote over more than half of all the candidate nodes to solve the problem by the number of split brain (discovery.zen.minimum_master_nodes);
(3) When the number of candidates is two, it can only be changed only a master candidate, other data as nodes, brain avoid cracking problems.

11, when the client and cluster connections, how to select a specific implementation of the requesting node?

TransportClient transport modules utilizing a remote connection elasticsearch cluster. It does not join the cluster, or simply to obtain a plurality of transport address initialization and polling way to communicate with these addresses.

12, describe in detail the process of Elasticsearch index of documents.

Coordinating node involved in the calculation default document ID (also supported by routing), to provide the appropriate routing slice.
shard = hash(document_id) % (num_of_primary_shards)复制代码
(1) When the node is located fragmented receiving a request from the coordinator node, the request is written to MemoryBuffer, and the timing (every 1 second by default) is written into Filesystem Cache, from MomeryBuffer to this process is the Filesystem Cache called the refresh;
(2) Of course, in some cases, there is Momery Buffer and Filesystem Cache data may be lost, ES translog through mechanisms to ensure the reliability of the data. Its implementation mechanism after receiving the request, but also written to the translog, when data is written in the Filesystem cache to disk, will be removed, a process called flush;
(3) In the flush process, the memory buffer is cleared, contents are written to a new segment, segment fsync will create a new commit point, and the contents flushed to disk, the old translog will be deleted and the start a new translog.
(4) flush trigger timing is a timing trigger (default 30 minutes) or translog becomes too large (the default is 512M) when;

Supplementary: Segement on the Lucene:
(1) Lucene index is composed of a plurality of segments, segment itself is a fully functional inverted index.
Paragraph (2) is immutable, the new document allows Lucene incrementally added to the index, rather than rebuild the index from scratch.
(3) for each search request, the index will be searched for all segments, and each segment consumes clock cycle of the CPU, memory, and file handles. This means that the greater the number of segments, search performance will be lower.
(4) In order to solve this problem, Elasticsearch small pieces will merge into a larger segment, submit new merged segment to disk, and delete old ones into small pieces.

At last

I welcome everyone's attention the public No. [programmers] Herd, 2019 companies java face questions compiled over 1000 Road 400-page pdf document, the article will be updated on the inside, finishing materials will be on the inside.

Article I remember like the point of a concern like yo, thanks for the support!



Guess you like

Origin juejin.im/post/5e0348d8e51d45582512a59f