2019 Common Elasticsearch face questions answer detailed analysis (under)

Foreword

1.Elasticsearch RESTful style is a distributed search and data analysis engine.

(1) query: Elasticsearch allowed to perform and patients with multiple types of search - structured, unstructured, location metrics - heart and change the way the search.
(2) Analysis: find the best match the query ten documents is one thing. But if faced with billions of lines of log, how should we interpret it? Elasticsearch polymerization enables you to think big, to explore trends and patterns in the data.
(3) speed: Elasticsearch quickly. Really, really fast.
(4) Scalability: can run on a laptop. It can also run on hundreds of servers in the PB level data carrier.
(5) elastic: Elasticsearch run in a distributed environment, from initial design to take into account this point.
(6) Flexibility: with multiple case scenarios. Numbers, text, location, structured, unstructured. All types are welcome.
(7)HADOOP & SPARK : Elasticsearch + Hadoop

2.Elasticsearch is a highly scalable open source full-text search and analysis engine. It allows you to quickly and close to the store in real-time, search and analyze large amounts of data.

Here are some use cases using Elasticsearch:
(1) you run an online store, you allow your customers to search for products you sell. In this case, you can use Elasticsearch to store the entire product catalog and inventory, and to provide them with search and auto-complete suggestions.
(2) you want to collect log or transaction data and want to analyze and mine data to find trends, statistics, summary or abnormal. In this case, you can use loghide (part Elasticsearch / loghide / Kibana stack) to collect, aggregate, and parses the data, and then let loghide input data into the elasticsearch. Once the data is in Elasticsearch, you can run a search and aggregation to exploit any information you are interested in.
(3) price alert you run a platform that allows customers to specify proficient prices following rules: "I am interested in buying a particular electronic device, if any vendor's products next month for less than $ X, I want to be notified." In this case, you can grab the price of vendors, push them into the Elasticsearch and use its reverse lookup (Percolator) function to match prices with customer inquiries, and will ultimately find matching alarm push to the customer.
(4) Do you have analysis / business intelligence needs and wants quick survey, analysis, visualization, and large amounts of data presented special problems (think millions or billions of records of). In this case, you can use Elasticsearch to store data, and then use Kibana (part Elasticsearch / loghide / Kibana stack) to build custom dashboards to visualize all aspects that are important for your data. In addition, you can also use Elasticsearch aggregation function to perform complex business intelligence queries against the data.

Elasticsearch face questions

1, a detailed description of what Elasticsearch update, and delete documentation process.
2, a detailed description about the process Elasticsearch search.
3, in Elasticsearch in, is how to find the corresponding inverted index based on a word?
4, Elasticsearch at deployment, optimization methods which have set for Linux?
5, for GC aspect, when using Elasticsearch pay attention to what?
6, Elasticsearch for a large amount of data (on the order of hundreds of millions) of polymeric how to achieve?
7, in the concurrent case, Elasticsearch if to ensure consistent read and write?
8, how to monitor Elasticsearch cluster status?
9, describes the overall technical architecture of your electricity supplier search.
10, tell us about your personalized search program?
11, understand whether the trie?
12, spelling correction is how to achieve?

1, a detailed description of what Elasticsearch update, and delete documentation process.

(1) deletes and updates are also writes, but Elasticsearch the document is immutable, and therefore can not be deleted or modified to change its display;
(2) each segment on the disk has a corresponding .del file. When the deletion request, the document has not really been deleted, but is marked for deletion in .del file. The document can still match the query, but are filtered out in the results. When a segment consolidation, in .del file is marked as deleted documents will not be written into a new segment.
(3) when a new document is created, Elasticsearch specifies a version number for the document, when performing an update, the old version of the document is marked for deletion in .del file, the new version of the document is indexed to a new segment . Older versions of documents can still match the query, but are filtered out in the results.

2, a detailed description about the process Elasticsearch search.

(1) Search is performed as a two-stage process, which we call Query Then Fetch;
(2) at the initial stage of the query, the query is broadcast to each of the index fragment copies (copies of the master slice or slice). The size of each slice and build a local search is performed to match the document from + size of the priority queue.
PS: When will the search query Filesystem Cache, but there are still some data MemoryBuffer, so the search is near real-time.
(3) each slice to return to their priority in the queue ID and a ranking value of all documents to the coordinator node, which incorporate these values ​​to the results list after its priority queue to generate a globally ordered.
(4) The next step is to retrieve phase, the coordinator node identify which documents need to be retrieved and fragmentation related submit multiple GET requests. Each slice is loaded and rich documentation, if necessary, and then return the document to the coordinator node. Once all the documents have been retrieved, the coordinator node returns the results to the client.
(5) Supplementary: Query Then Fetch type of search in the document relevancy scoring when reference is made to the data of this fragmentation, this may not be accurate number of documents in less time, DFS Query Then Fetch adds a pre-processing query asking Term and Document frequency, this score is more accurate, but the performance will be worse. *

3, in Elasticsearch in, is how to find the corresponding inverted index based on a word?

(1) Lucene indexing process is in accordance with the basic process of full-text search, the inverted table written procedure for this file format.
(2) Lucene search process, according to the information in this file format is the index into the read out, and then calculate the score for each document (score) process.

4, Elasticsearch at deployment, optimization methods which have set for Linux?

(1) 64 GB memory machines are ideal, but the 32 GB and 16 GB machine is very common. Less than 8 GB would be counterproductive.
(2) If you want to between faster CPUs and more nuclear option, choose more cores is better. A plurality of additional core provides much complicated slightly faster than a bit clock frequency.
(3) If you can afford SSD, it will go far beyond any rotating media. SSD-based nodes, query and indexing performance has improved. If you can afford, SSD is a good choice.
(4) Even though the data center are close at hand, but also to avoid the cluster across multiple data centers. Absolutely want to avoid clusters across large geographical distances.
(5) Be sure to run your application server and JVM JVM is exactly the same. Elasticsearch in several places, the use of native Java serialization.
(6) can be switched to avoid excessive fragmentation by providing gateway.recover_after_nodes, gateway.expected_nodes, gateway.recover_after_time when the cluster is restarted, which may make data recovery time from a few hours to a few seconds.
(7) Elasticsearch is configured to use the default unicast discovery to prevent inadvertent node cluster. Only nodes running on the same machine will automatically form a cluster. Unicast is preferably used in place of multicast.
(8) Do not arbitrarily change the size of the garbage collector (CMS) and each thread pool.
(9) Put your memory (less than) half to Lucene (but not more than 32 GB!), Set up by ES_HEAP_SIZE environment variable.
(10) memory is swapped to disk on server performance is fatal. If the memory is swapped to disk, a 100 microsecond operation may become 10 msec. Think think so much the operation delay of 10 microseconds add up. Swapping is easy to see how the performance is terrible.
(11) Lucene uses a large number of files. Meanwhile, Elasticsearch uses a lot of communication between a node and a socket of the HTTP client. All of this requires sufficient file descriptors. You should increase your file descriptor, set a large value, such as 64,000.
Added: stage performance index method
(1) request and use batch resize: 5-15 MB large amounts of data each time is a good starting point.
(2) Storage: SSD using
(3) segments and merge: Elasticsearch default value is 20 MB / s, mechanical disk should be a good setting. If you are using a SSD, consider raising the 100-200 MB / s. If you are doing bulk import, totally do not care about search, you can completely turn off the merger limiting. It also can increase index.translog.flush_threshold_size settings from the default value of 512 MB to a larger number, such as 1 GB, which can accumulate in the transaction log in a greater segment is triggered when empty.
(4) If your search results does not require near real-time accuracy, consider each index index.refresh_interval changed to 30s.
(5) If you are in big bulk import, considered by setting index.number_of_replicas: 0 Close the replica.

5, for GC aspect, when using Elasticsearch pay attention to what?

(1) inverted index dictionary requires permanent memory to GC, growth trends need to be monitored segmentmemory on data node.
(2) all kinds of cache, field cache, filter cache, indexing cache, bulk queue and so on, to set a reasonable size, and the heap should be based on the adequacy of the worst situation, that is, all kinds of cache fully occupied when there is heap space can be assigned to another task? Avoid using clear cache and other "self-deception" approach to free memory.
(3) avoid returning the search result sets with a lot of polymerization. Pull scene does need to take a large amount of data can be employed to achieve scan & scroll api.
(4) cluster stats resident memory and can not expand horizontally, ultra-large-scale cluster can be considering splitting into several clusters by tribe node connection.
(5) wondered heap enough, must be combined with practical application scenario, heap usage and clusters do constant monitoring.
(6) understand the memory requirements based on monitoring data, the rational allocation of various types of circuit breaker, will reduce to a minimum the risk of memory overflow

6, Elasticsearch for a large amount of data (on the order of hundreds of millions) of polymeric how to achieve?

Elasticsearch provided cardinality first approximation is a measure of the polymerization. It provides a base field, i.e. the number of distinct or unique values ​​of the field. It is based on the HLL algorithm. HLL will first of our input for the hash, and then do a probability estimate based on the results of the hash bits to obtain base. Its characteristics are: the accuracy configured to control the use of memory (more precisely = more memory); small data set is very high accuracy; we can configure the parameters to set a fixed amount of memory required to heavy . Whether or thousands of unique value billions of accuracy memory usage configuration only with your relevant.

7, in the concurrent case, Elasticsearch if to ensure consistent read and write?

(1) can be used by the version number of optimistic concurrency control, in order to ensure that the new version will not be covered by the old version, the application layer to deal with specific conflicts;
(2) In addition to the write operation, the consistency level support quorum / one / all, default quorum, that is, only when the majority of fragmentation can be used only allow write operations. But even if most of the available, there may be reasons written copy of the network as a result of the failure, so the copy is considered a failure on the reconstruction, fragmentation will be a different node.
(3) For a read operation, replication may be provided as sync (default), which makes the operation will return after the primary slice and the slices are completed copy; if set to async replication, the request may be provided by a search parameter _preference to query for the main primary fragmentation, ensure that the document is the latest version.

8, how to monitor Elasticsearch cluster status?

Marvel so that you can easily monitor by Kibana Elasticsearch. You can view real-time status of your health and performance of clusters, clusters can also be analyzed in the past, indexes and index nodes.

9, describes the overall technical architecture of your electricity supplier search.

10, tell us about your personalized search program?

word2vec and Elasticsearch personalized search based on
(1) Based on word2vec, Elasticsearch plug-ins and custom scripts defined, we have implemented a personalized search service, relative to the original implementation, the new version has significantly improved CTR and conversion rate;
(2) there is a vector based on a commodity word2vec available place, it is recommended be used to achieve a similar product;
(3) use word2vec to personalize search or personalized recommendations have certain limitations because it can only handle user clicks on such a historical time series data, but can not fully to consider the preferences of the user, this is still a great improvement and room for improvement;

11, understand whether the trie?

Common dictionary data structure is shown below:
Trie core idea is space for time, use common prefix string to reduce the cost of the query time in order to achieve greater efficiency. It has three basic properties:
1) does not contain character root node, each node except the root node contains only the outer one character.
2) from the root node to a node on the path through the connected character string corresponding to that node.
3) characters of all child nodes of each node contains is not the same.
(1) can be seen, trie tree nodes of each layer is 26 ^ i level. So in order to save space, we can also use a dynamic linked list, or use an array to simulate dynamic. Space and spend no more than the word length × number of words.
(2) implementation: opening a letter array set size for each node, each node linked to a list, use the left and right son sibling records this tree representation;
(3) For the Chinese dictionary tree, child nodes of each node using a hash table to store, so do not waste too much space, but also on the complexity of the query speed can keep a hash of O (1).

12, spelling correction is how to achieve?

(1) spelling correction based on the edit distance is achieved; edit distance is a standard method, which is used to indicate through insert, delete, and replace operations from one string to another string the minimum number of operation steps;
Calculation (2) the edit distance: For example, to calculate the edit distance batyu and beauty, and create a table of 7 × 8 (batyu length of 5, coffee length of 6, plus 2 each), then filled in a position black numbers. Other grid calculation is to take the minimum value of the following three:
If the character is equal to the top of the leftmost character, for the numbers at the top left. Otherwise the top left of the digital +1. (It is 0 for 3, 3)
Digital +1 left (for the grid is 3,3 to 2)
Digital +1 above (for the grid is 3,3 to 2)
The final value is the lower right corner taken edit distance value of 3.

For spelling correction, we consider constructing a metric space (Metric Space), the following three basic conditions of any relationship between the space to meet:
d (x, y) = 0 - If distance x to y is 0, then x = y
d (x, y) = d (y, x) - x is equal to the distance y from y to x,
d (x, y) + d (y, z)> = d (x, z) - the triangle inequality
(1) The triangle inequality is satisfied with the query from another character in the n-rotation range B, which is the maximum distance A d + n, minimum dn.
Configuration (2) BK on the tree as follows: Each node has any child nodes, each edge value indicates that there is an edit distance. All child nodes of the parent node to the label side of exactly n represents an edit distance is n. For example, we have a tree parent node is "book" and two child nodes "cake" and "books", "book" to "books" edge label 1, "book" to "cake" of the edge of the label 4. After a good tree structure from the dictionary, whenever you want when you insert a new word, the word edit distance calculation and root, and look for a value of d (neweord, root) side. Recursive was compared with each child node until no child nodes, you can create a new child node and save new words in that. For example, insert "Boo" tree to just the above examples, we first check the root node, find d ( "book", "boo") = 1 side and check node side sub-numeral 1, to obtain the word " books ". We then calculate the distance d ( "books", "boo") = 2, then a new word is inserted after "books", reference numeral 2 side.
3, similar to the query word as follows: the root node of the word is calculated edit distance d, and then recursively locate each child node to the designated dn d + n (inclusive) side. If the distance is checked and Word Search node d is less than n, and the node returns to research. Such as input cape and a maximum tolerable distance of 1, the first edit distance calculation and root d ( "book", "cape") = 4, and then find and edit distance between the root of 3 to 5, to find the the node of the cake, calculated d ( "cake", "cape") = 1, the condition is returned cake, cake, and then find the node edit distance is 0 to 2, respectively, and to find the cape cart node, thus obtaining cape this satisfies criteria.


At last

I welcome everyone's attention the public No. [programmers] Herd, 2019 companies java face questions compiled over 1000 Road 400-page pdf document, the article will be updated on the inside, finishing materials will be on the inside.

Article I remember like the point of a concern like yo, thanks for the support!


Guess you like

Origin juejin.im/post/5e0463d9f265da33cf1ae4a1