X system high availability and high concurrency solution

PPT download: https://download.csdn.net/download/love254443233/12334991

 

 

 

 

2.1 Data storage primary and backup

The main machine is working, and the standby machine is in the state of monitoring preparation; when the main machine is down, the standby machine takes over all the work of the main machine

2.2 Dual computer room deployment

Multiple hosts work together, each running one or several services, and each defines one or more backup hosts for the service. When a host fails, the services running on it can be taken over by other hosts.

 

2.3 Cluster deployment

Multiple hosts work together, each running one or several services, and each defines one or more backup hosts for the service. When a host fails, the services running on it can be taken over by other hosts.

 

3. Service Stability

Stability here refers to the ability of the system to provide services normally without changing or going wrong over time.

3.1   Flow peak clipping ( phenomenon 1)

Push promotions lead to a sudden traffic increase of 2x or more

3.1   Flow peak clipping ( phenomenon 2)

qp m increased from 2500 to 5000 or even 10000

 

3.1   Flow peak clipping ( impact )

Reference article: Design local cache based on LRU-K algorithm to realize traffic peak clipping

3.1.1) Interface response time increased by 5 times ( qps increased by 2 times )

3.1.2) Bandwidth alarm of the LAN switch in the computer room (1 km bandwidth uses more than 900 M) ;

3.1.3) The response time of the data interface obtained from redis increases , etc.

3.1   Flow peak clipping ( solution )

1. Automatically identify hotspot data for each instance of the service

2. Store limited hotspot data

3. The hotspot data is first fetched from the local cache instead of redis and database, which can greatly reduce the request response time and improve concurrency

 

3.1   Flow peak clipping ( LRU-K )

1 ) When the data is accessed for the first time, it is added to the access history record table (referred to as the record table ); set the last access time = new() in the corresponding K unit in the record table , and set the number of visits to 1 ;

2 ) If the number of data accesses does not reach K times, then the number of accesses +1 . If the interval between the last access time and the current time exceeds a preset value ( such as 30 seconds ) , the number of visits will be cleared to 0 and added to 1 ;

3 ) When the data access count exceeds (>=) K times, the number of accesses will be +1 . Save the data to the LRU cache queue, and the cache queue is reordered according to time;

4 ) After the data in the LRU cache queue is accessed again, it is reordered;

5 ) When the LRU cache queue needs to eliminate data, the data at the end of the cache queue is eliminated, that is, the data of "the penultimate K -th access is the longest from now" is eliminated.

3.1   Flow peak clipping ( effect 1)

1. When the QPS increases before optimizing the parameters , the response time does not shorten significantly. Such as the left and right column 1

2. After parameter tuning goes online. After the QPS increased significantly, the overall response time did not change when the redis request response time increased. See the last column on the left and right

3.1   Flow peak clipping ( effect 2)

QPS has increased by 4 times, and the response time has not changed, as usual

 

1. The access system uses multi-threaded push

2. In the extreme state, re-push will occur when the network times out ( it has been pushed 6 times in 1 second )

3. Too fast will affect the stability of C system and X service

1. The access system uses multi-threaded push

2. The referral state is persistent, with good performance and almost no side effects on the system

3. Single-threaded processing of timed tasks to control concurrency. No pressure in the downstream system

Several modes for synchronizing data from big data:

1. HTTP push: real-time, high concurrency will affect the service

2. Message push: semi-limited flow, high concurrency will affect service

3. Actively pull data: single thread, does not affect service

1. Introduce or divert some traffic from services with high overall concurrency to other services with low overall traffic

2. Y verifies that the correct relationship between the landlord and the store is adjusted from the query X to the query authority platform

 

 

The average concurrency has dropped by 9w , and the highest drop has been 15w . At that time, the concurrency of all services distributed was 8w5 in total .

 

traceId or ip statistics

 

 

result or impact

1 ) Comparison of the response time of the core single query interface ( landlord + store + house ) : the original 2ms-7ms is reduced to 0.04ms-0.8ms , which is approximately reduced

2 ) Redis overall concurrency comparison: the original 729K , now 266.469K ( peak value ) , about 64% reduction

3 ) Redis overall traffic comparison: the original 25K , now 17K , reduced by about 32%

Average Response Time: 2.25ms
Maximum Response Time: 13ms
 
Average response time: 1.712ms/18=0.09ms
Maximum Response Time: 1.2ms
Average response time: 0.585ms/18=0.032ms
Maximum response time: 1.0ms

Guess you like

Origin blog.csdn.net/love254443233/article/details/105513789