Decrypt Elasticsearch: In-depth exploration of this search and analysis engine | JD Cloud technical team

Author: JD Insurance Guan Shunli

opening

Recently, Elasticsearch was used to implement the portrait system, and the data middle platform capability of dmp was realized. At the same time, the architecture selection of competing products was investigated. And revisited the principle of redis, etc. Hereby do a summary and review of es. I haven't seen anyone using Elasticsearch to complete the portrait on the Internet. I'm going to try it for the first time.

After the background, let's think about one thing first, using the memory system as a database. What are his strengths? What are his pain points?

1. Principle

This is not the whole picture. Just talk about communication, memory, and persistence.

communication

The smallest unit of an es cluster is three nodes. The combination of two slave nodes to ensure its high availability is also the basis of clustering. So what is used for RPC communication between nodes? It must be netty, es realizes the communication package of Netty4Transport based on netty. Create Bootstrap after initializing Transport, and complete receiving and forwarding through MessageChannelHandler. In es, server and client are distinguished, as shown in Figure 1. The json used for serialization. ES tends to be easy-to-use, general-purpose, and easy-to-understand in rpc design. Rather than just pursuing performance.

figure 1

With the escort of netty, es is assured to use json serialization.

Memory

figure 2

The es memory is divided into two parts [on heap] and [off heap]. The on heap part is managed by the jvm of es. off heap is managed by lucene. The on heap is divided into two parts, one part can be recycled, and the other part cannot be recycled.

Part of the index buffer that can be recycled stores new index documents. When filled, the buffer's documents are written to the disk segment. All shards are shared on the node.

Cannot be recycled are node query cache, shard request cache, file data cache, segments cache

The node query cache is a node-level cache, filtered and stored on each node, shared by all shards, using the bitset data structure (Bloom optimized version) to turn off scoring. The LRU elimination strategy to use. GC cannot recycle.

The shard request cache is a shard-level cache, and each shard has it. By default, the cache only stores queries whose request result size is equal to 0. So the cache will not be hits, but hits.total, aggregations, suggestions will be cached. It can be cleared through the clear cache api. The LRU elimination strategy to use. GC cannot recycle.

The file data cache is to cache the aggregated and sorted data. In the early days, es did not have doc values, so after aggregation and sorting, a file data was needed for caching to avoid disk IO. If there is not enough memory to store file data, es will continuously load data from disk to memory and delete old data. These cause disk IO and trigger GC. Therefore, versions after 2.x introduce the doc values ​​feature, build documents on indextime, store them on disk, and access them through memory mapped files. Even if you only care about hits.total, only return doc id and turn off doc values. doc values ​​support keyword and numeric types. The text type will still create file data.

The segments cache is used to speed up queries, and the FST resides in the heap memory forever. FST can be understood as a prefix tree to speed up queries. but! ! The es 7.3 version began to hand over the FST to the off-heap memory, which allows the node to support more data. FST also has a corresponding persistent file on disk.

Off heap is Segments Memory, and the off-heap memory is used by Lucene. So it is recommended to leave at least half of the memory for lucene.

The es 7.3 version starts to load the tip (terms index) through mmp and hand it over to the system's pagecache for management. Except tip, nvd (norms), dvd (doc values), tim (term dictionary), and cfs (compound) files are all loaded and transferred by mmp, and the rest are all nio. The effect of tip off heap is that the jvm usage drops by about 78%. You can use the _cat/segments API to view the memory usage of segments.memory.

Because the external memory is managed by the operating system pagecache. If recycling occurs, FST query will involve disk IO, which will have a great impact on query efficiency. You can refer to the recycling strategy of linux pagecache to use the double-chain strategy.

Persistence

The persistence of es is divided into two parts, one part is similar to a snapshot, and the segments in the file cache are refreshed (fsync) to the disk. The other part is the translog log, which appends the operation log every second, and flushes it to the disk every 30 minutes by default. es persistence is very similar to the RDB+AOF mode of redis. As shown below

image 3

The figure above is a complete writing process. The disk also records data in segments. It is very similar to redis here. But the internal mechanism does not use COW (copy-on-write). This is also the reason why the load is full when querying and writing in parallel.

summary

The design of es memory and disk is very clever. The mmap method is used on zero copy, and the disk data is mapped to off heap, that is, lucene. In order to speed up data access, each segment of es has some index data residing in the off heap; therefore, the more segments there are, the more off heap will be divided up, which cannot be recycled by GC!

Combining the above two points, we can clearly know why es consumes so much memory.

2. Application

The following difficulties need to be solved in the user portrait system.

1. Crowd estimation: Select a group of people based on tags, such as 20-25-year-old men who like e-commerce social networking. 20-25 years old ∩ e-commerce social networking ∩ male. Select the number of clientIds that meet the characteristics through AND or NOT operations. This is a group.

Before our group and group, we can also do intersection and difference operations. For example, he is a 20-25-year-old man who likes e-commerce and socializing, and a man in Beijing who likes to play iron. (20-25 years old ∩ e-commerce social networking ∩ men) ∩ (20-25 years old ∩ iron ∩ men). For such recursion, it is required to return the estimated number of people in seconds in the more than 1.7 billion portrait library.

2. Crowd package circle selection: the crowd package selected by the above circle. Minute builds are required.

3. People package judgment: judge whether a clientId exists in several crowd packages. Requires 10 milliseconds to return the result.

We first try to use es to solve all the above problems.

For crowd estimation, the easiest solution to think of is to do logic operations in the memory of the server. However, it is very costly to do it on the server side if you select tens of millions of people and return them in seconds. At this time, you can throw the computing pressure to the es storage side, just like querying a database. Use a statement to find out the data we want.

For example mysql

select a.age from a where a.tel in (select b.age from b);

The corresponding dsl of es is similar to

{"query":{"bool":{"must":[{"bool":{"must":[{"term":{"a9aa8uk0":{"value":"age18-24","boost":1.0}}},{"term":{"a9ajq480":{"value":"male","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},{"bool":{"adjust_pure_negative":true,"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}}}

In this way, the high retrieval performance of es is used to meet business needs. Regardless of the number of groups and how many labels in the group. All typed into a dsl statement. To ensure that the results are returned in seconds.

Using the officially recommended RestHighLevelClient, there are three implementation methods, one is to spell the json string, and the second is to call the api to spell the string. I use the third way BoolQueryBuilder to achieve, more elegant. It provides filter, must, should and mustNot methods. like

     /**
     * Adds a query that <b>must not</b> appear in the matching documents.
     * No {@code null} value allowed.
     */
    public BoolQueryBuilder mustNot(QueryBuilder queryBuilder) {
        if (queryBuilder == null) {
            throw new IllegalArgumentException("inner bool query clause cannot be null");
        }
        mustNotClauses.add(queryBuilder);
        return this;
    }

    /**
     * Gets the queries that <b>must not</b> appear in the matching documents.
     */
    public List<QueryBuilder> mustNot() {
        return this.mustNotClauses;
    }

Using the api can greatly show the ability to compile code.

Build the crowd pack. At present, the largest package we have circled has more than 70 million clientIds. If you want to complete the construction at the minute level (70 million data is completed within 35 minutes under the conditional limit), you need to pay attention to two points, one is es in-depth query, and the other is batch writing.

There are three ways of es pagination, and there are two types of deep paging, and the latter two are retrieved by scrolling with the cursor (scroll and search_after).

Scroll needs to maintain the cursor state, and each thread will create a 32-bit unique scroll id, and each query must bring a unique scroll id. If multiple threads have to maintain multiple cursor states. search_after is similar to scroll. But its parameters are stateless and will always be parsed against newer versions of the searcher. Its sort order changes on scroll. The principle of scroll is to keep the doc id result set in the context of the coordinating node, and get it in batches each time you scroll. You only need to retrieve the results in order within each shard according to the size.

When writing, use the thread pool to do it, pay attention to the size of the blocking queue used, and choose an appropriate rejection strategy (the strategy of throwing exceptions is not required here). If the batch is still written to es (for example, read-write separation is done), in addition to multi-threading, there is also a refresh policy to optimize writing.

Since the entire business link is very long for the human packet determination interface, the fusing time set by the upstream service for this retrieval is 10ms. Therefore, the query to optimize es (or redis) is not responsible for logic processing after all. After using the thread pool to solve IO-intensive optimization, it can reach 1ms. tp99 peaks at 4ms.

3. Optimization, bottlenecks and solutions

The above is the problem-solving method using es for business needs. Response optimization is also required. At the same time, I also encountered the bottleneck of es.

1. The first is the optimization of mapping. The type in the fields in the image mapping is keyword, and the index should be turned off. The doc value in the fields in the human package is turned off. The portrait is for exact matching; the human package judgment only needs the result but not the value. The human package calculation on the es api uses a filter to remove the score, and the Bloom data structure of the bitset is used inside the filter, but the data needs to be preheated. When writing, it is not easy to have too many threads, just be the same as the number of cores; adjust the refresh policy level. Manual disk refresh, index.refresh_interval is adjusted to -1 during construction. It should be noted that stopping the disk refresh will increase the heap memory, and the frequency of disk refresh needs to be adjusted according to the business. To build a large crowd package, you can split the index into several. Decentralized storage can improve responsiveness. At present, dozens of crowd packs can still be supported. If it grows to several hundred in the future. You need to use bitmap to build and store crowd packages. es is excellent for retrieval performance. But it is not what he is good at when encountering write operations and query operations in parallel. For example, the data of the crowd package changes every day. At this time, the memory and disk io of es will be very high. We can use redis to store hundreds of packages. You can also choose to use MongoDB to store package data.

Four. Summary

The above is that we use Elasticsearch to solve business difficulties. At the same time, it was found that his persistence did not use the COW (copy-on-write) method. As a result, retrieval performance is degraded during real-time writing.

Using an in-memory system as a data source is a bit obvious, just retrieving blocks! Especially in real-time scenarios, it can be called a sharp weapon. At the same time, the pain point is also obvious. Real-time writing will reduce the retrieval performance. Of course, we can do read-write separation, split index and other solutions.

In addition to Elasticsearch, we can also choose ClickHouse, and ck also supports the bitmap data structure. You can even go to Pilosa, which is BitMap Database.

reference

Construction practice of Keike DMP platform

Mapping parameters | Elasticsearch Reference [7.10] | Elastic

The offheap principle of Elasticsearch 7.3

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8704613