Elasticsearch series --- search page and deep paging problem

Overview

Benpian from the introduction of the search page as a starting point, briefly discusses the differences with the tabbed data search search data existing centralized way of thinking, it is the phenomenon of deep paging paging problem on the issue of the analysis, the final presentation of the case of paging systems top N.

Search syntax page

Elasticsearch from the search syntax and parameters used to achieve the effect size two paging:

  • size: Displays the number of results should be returned, the default is 10.
  • from: Initial results show the number of offset of the query data, that should be skipped, the default is 0.

Meaning and MySql from these two parameters and size parameters limit the meaning of the search terms is the same.

To name a few examples, the query request page 1-3:

GET /music/children/_search?size=10
GET /music/children/_search?size=10&from=10
GET /music/children/_search?size=10&from=20

Differences distributed data centralized data

Centralized data storage, application mode from the first monomer, the SOA service mode to the early, most of the time is stored centralized data storage, data floor mysql relational data and the like, there are separate read and write support, deployment of multiple database instance to achieve the main - is still essentially a centralized storage structure.

In a single master database or from the database, execute a query paging, sorting and other statistical thinking is relatively clear, after all, finished pieces of data are put together, directly pick it wants to engage in a single instance, may have a capacity limit, only slower results .

Relational database with distributed data storage solution is a classic sub-library sub-table, the same table of data, with certain routing logic, split in different database instances in memory, then do statistics, you can not focus on just one examples of.

Distributed data storage, search the idea began to have minor changes, such as Elasticsearch, the index data is stored in each shard in the split, each shard may be spread over each node ES cluster, in which case , make inquiries, statistical analysis and other operations, although the ES has a good package of technical details we still need to understand that this is a distributed query stored program.

Personally I think that the differences distributed data processing and centralized data, although in a relational database or ES, it has already a mature framework of its package, but users still need to understand the change from thinking up a distributed brings, so to get the correct results.

deep paging problem

Put simply called the depth of deep paging paging is a particularly deep search, display data on hundreds of pages. Why deep paging is a problem?

We assume there are 20000 data in the index in five years shard, send a query request conditional and specify the sort fields of storage, if I want to take the data of the first page, then each shard will take 10 data, aggregated to Coordinate Node, the total of 50, 50 Coordinate Node this data re-sorted, filtered data following the 40, just take the foremost
10, returned to the client.

If it is the first 1000?
According to the old routine every shard take 10000-10010 bar, aggregated to Coordinate Node years, or 50, and finally returned to the client?

Do're wrong, so check the data is not distributed, page 1000, in each shard is not 10000-10010 take the first article, but take a shard were taken in front of 10010,5 50050 to Coordinate Node, Coordinate Node after the sorting operation is completed summary data 10000-10010 take the first item, the data returned to the client 10.

Took so much effort to collect 50,050 data, 10 real to the client, lost 50040 data, a good memory fees.

If the page 10,000th it? This result could not bear to look directly at

If the more the number of fragments of an index, you need to aggregate data will grow exponentially, you can see the cost of distributed systems to sort the results of the distribution rises exponentially with depth, the two most important effects of the depth and dimensions of the paging shard number. This heavyweight inquiry, is likely to drag down the entire Elasticsearch cluster, so that the search engines do not return more than 1000 results for any query.

Extended top N model problem

Deep paging problem by optimizing keyword search, pagination controls the depth of the problem can be certain of improving, and that top N issues such as how to solve it?

Aggregate query, the query most often encountered XX 10 records this analysis needs, this model is the top N problem.

Perfect solution scenarios

We first give a familiar story: the highest amount of statistics to play 10 songs in English.

document data structure:

{
  "_index": "music",
  "_type": "children",
  "_id": "2",
  "_version": 6,
  "found": true,
  "_source": {
    "name": "wake me, shark me",
    "content": "don't let me sleep too late, gonna get up brightly early in the morning",
    "language": "english",
    "length": "55",
    "likes": 0
  }
}

This process needs ES handy, for several reasons:

  • A document only exists in a shard of
  • Each document data has played in a number of statistical values

This ensures that there are several points above, when a query ES can safely take the highest number of players in the top 10 pieces of data on each shard, Coordinate Node aggregated data also 50, this time the performance is very high.

Scene should not be directly query

Characteristics depend on a section of pre-designed document storage and shard data, avoiding the full index scan, performance is particularly high, assuming that the system of play for the day click on, there is a play log, a record of the song ID, click on the person, click time, while listening to a long, long time percentage (percentage complete song, there is no quit listening to hear half of this value is 50%), an example of data of the document:

{
  "_index": "playlog-20191121",
  "_type": "music",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "music_id": 1,
    "listener": "tony wang",
    "listen_date": "2019-11-21 15:35:00",
    "music_length": 52,
    "isten_percentage": 0.95
  }
}

Assuming that the total amount of two million songs, play daily log 100 million, a daily log indexing, primary shard number 10, naming format playlog-yyyyMMdd, demand is playing the day's search rankings, the top 10 records taken ranking.

If direct statistics, can only Ying Kang, the basic process is as follows:

  1. Each shard group statistics do according to music_id, in theory, the number of single shard amount of up to 2 million.
  2. Coordinate Node 10 shard data collected, merged, the upper limit of the processing amount data 20000000, and merged into 2,000,000.
  3. From the foregoing data fetch 2000000 10, returned to the client.

This process is definitely a heavyweight, if every real-time statistics, then the pressure can be imagined ES cluster.

improve proposals
  1. Playback capabilities increase data update logic
    when previously increased by Date index data structure, each time a user clicks play, extra send a message to update its data update, a query directly from the index's statistical results, avoid each query.

  2. Scheduled task Statistics
    Statistics demand, scheduled task can be calculated using the calculation result stored by reducing real-time, full-index scan to avoid stress calculation.

Simple comparison:

  • The same point: is the practice space for time, to avoid the full index scan.
  • Different points: the former is achieved by the business logic change, increasing cascade update data, there is a certain service coupling; which becomes the calculation timing of tasks in real time, high flexibility, coupled with low traffic, but the real time difference.
add another point

Good data structure design can reduce largely ES queries pressure, improve the performance of real-time queries, but there is little need to accept:'ll have to consider the thoughtful design, it is difficult to adapt to the ever-changing needs; needs to change is inevitable, there is no once and for all program.

summary

Benpian start from the paging query, he explained the reasons for the problem of deep paging, and incidentally their distributed systems and differences in thinking centralized processing system gives a brief description of the problem top N finally extended scenes, the above-mentioned improvements program, only for relatively simple scenario, the actual production to be faced certainly more complex, such as the use of distributed computing components storm to solve the problem of top N, is here as initiate welcome you to share their views.

High focus on Java concurrency, distributed architecture, more dry goods share technology and experience, please pay attention to the public number: Java Architecture Community
Java Architecture Community

Guess you like

Origin www.cnblogs.com/huangying2124/p/12071186.html