Online Real Queuing System Restructuring Case Sharing—Practice

written in front

In the previous article, we talked about how to design a feasible refactoring technical solution——theoretical article.  This article mainly introduces a complete system refactoring project based on the recent online refactoring project—passenger queuing system refactoring. Construct technical solutions.

Detailed technical solution introduction

1. Background

1. Status:

* At present, the performance bottleneck of online passenger queuing is obvious, and the Redis List storage structure is mainly used. As the number of orders in the queue increases, the RT of operations such as querying, inserting, and judging whether the order is in the queue increases exponentially.

* The current passenger queuing structure cannot meet the needs of business expansion. In order to support the rapid iteration of business in the future, passenger queuing reconstruction is imminent.

2. Research items

* Feasibility analysis of using Mysql to store queuing information (offline environment pressure test) 

* Combing of external interfaces and scope of influence (analysis of about 20 external interfaces currently provided),

The form is as follows:

interface name interface path caller SWC RT(995) Average RT Remark
enqueue /queue/enter XXX XXX XXX XXX

2. Goals

1. The external interface remains unchanged, transformed from the underlying storage, compatible with the current online display scene, and the passenger ranking display and dequeue decoupling.

The ranking display reserves ordinary queues, channel queues, and priority queues (including absolute priority), sorted by enqueue time

The queue sorting factor is calculated according to fixed rules when entering the queue, and a more flexible strategy algorithm is used to calculate the queue priority.

2. Redis orderly collection is used for data storage ranking, and mysql storage is added for queue information, which is divided into 128 tables.

3. Solve the current performance bottleneck problem, support the rapid iteration of subsequent business, and the expansion of subsequent requirements.

3. The overall plan

1. Comparison of old and new solutions

Storage architecture before refactoring: redis: list data structure, key: honeycomb center point + car model + queue type 

Refactored storage architecture: 

       Ranking queue: redis ordered collection, key: honeycomb center point + model + queue type (for compatibility with old ones) 

       Queue information table: queue_info_xxx, stored in mysql, divided into tables according to the hash of the honeycomb center point, and build a joint unique index based on order number + model

New-old comparison of some interfaces

interface view ranking is in the queue enqueue dequeue jump in line
before refactoring 1. Loop through all elements in all queues, and loop through to determine the calculation position. 2. Query the algorithm group to calculate the estimated time Traverse and query all elements of the queue, loop to determine whether to contain First judge whether it exists in the queue, and here it will also judge whether it is written into the redis queue (list) according to different queue types hit According to the model cycle & multi-queue type cycle out of the team, and record the log Benefit card jumping in line
After refactoring Query the queuing information from the "queue information table" through the order number. If there is a queuing record, judge whether there is a ranking. If there is no ranking, M+ is displayed (the ranking queue has online control), otherwise query the "ranking queue" and return the order directly. Query algorithm group to calculate estimated time Directly query the "queue information table" to determine whether there is a record Write into the "queue information table" first, and if it does not exceed the ranking threshold, write the corresponding "ranking queue" Update the status of the "queue information table". If there is a ranking, it will be removed from the ranking queue, and the candidate will be notified asynchronously, and the log will be recorded The queue order can be changed directly by updating the "order_by" field of the queue information table

Bottleneck analysis before refactoring: Each request will take out all the elements in the queue and loop through it (when the number of queued orders increases, the RT will increase exponentially, which is a big deal. You can think about the reason?)

Advantages of the refactored storage architecture: Change the original O(n) time complexity to O(1) complexity.

2. Architecture diagram after refactoring

8d6924ea30b8d0dafd0d2b6129d8ed0b.png
  1. Questions about queue size statistics:

Ranking Unlimited Flow Queue: Obtain directly through ZCARD (O(1) time complexity)

Ranking current-limiting queue: obtain the total length (O(1)) through the counter, and obtain the downgrade through ZCARD

2) Regarding new capacity matching—the query list [orange part] may have a bottleneck problem—there are 2 optimization directions in the later stage, which can be ranked top N and extracted from the buffer collection queue.

  1. Other flowcharts: enqueue and dequeue flowcharts (omitted here)

  2. table structure design

 queue_info_[001 ~ 128] : queuing information table is divided into tables according to the hive centerline point hash % 128 rule, and the data is archived by day

   queue_manager : The ranking queue management table mainly controls whether the current limit state is present, and the hive queue information

queue_log_[001~128]: Order entry & exit record table, divided into tables according to the hive midline point hash % 128 rule, and will be considered for archiving later.

Detailed table structure - omitted

4. Design of sort field (order_by)

For queuing scenarios, the shorter the time, the earlier. The time difference can be calculated in reverse order, the formula is as follows: ~(-1L << 39L) & (~(millisecond time difference))

Other rules are omitted here.

5. Compatibility issues with historical queue scenarios

Rank display: common queue, channel queue, priority queue

Order out of the queue: Through different configurations of weight coefficients, different sorts are finally calculated

6. Grayscale scheme

According to the gray scale of the city, choose a city with low traffic first.

7. Rollback scheme

Turn off the city grayscale switch, the existing data in the queue will be affected, and the migration tool needs to refresh the data

8. Data archiving plan & bottom-up plan

Data archiving: Passenger queuing information is archived by day

Bottom line strategy: long-term (configurable) queuing status has not changed (may be abnormal), forced to exit

9. Data monitoring & alarm

Passenger queuing Grafana monitoring: Monitoring indicators: city, hive, model, number of common queues, number of channel queues, number of priority queues Alarm: Dingding alarm when the number of queues exceeds the threshold

10. Time Planning

Interfaces for program research (20 interfaces) add renovation programs, responsible persons, and progress items

interface name interface path caller SWC RT(995) Average RT Remark Retrofit plan Responsible schedule
enqueue /queue/enter XXX XXX XXX XXX



c10226fedce43c2f416e0c7cc711504d.png

Note: Interface self-test and CR are completed in the development phase, monitoring alarms do not affect the development of the test, and can be developed in the testing phase.

11. Association group

slightly

12. Required resources

slightly

Summarize

Refactoring needs to take into account a lot of details, and needs to take into account every possible bottleneck, as well as subsequent optimization and expansion issues.

All changes must be personally responsible (to avoid omissions), and all self-tests (unit tests) must be passed before testing.

At present, the code development of this solution has been basically completed. The next article will continue to use the reconstruction of the queuing system as a scenario, and will talk about how to design the stress test solution in the grayscale stage. Please look forward to it.

Welcome to pay attention to the official account of "Talking about Architecture", share original technical articles from time to time, and have the opportunity to share the technical details of system reconstruction with you in the future.

1115d2876c392d1ab23e0d43a51a7f5f.png

file

Guess you like

Origin blog.csdn.net/weixin_38130500/article/details/125252678