Sudden increase in online traffic millions of high-availability guarantee solution

1: Prevention in advance

1. Estimate system bottlenecks

1.1 Sorting out the core interface

  • Sort the Top 100 by call volume
  • PM sorts out the interfaces that may be started at the lower end according to the business importance level
  • Output the final version of the interface document

1.2 Evaluate the highest TPS of the core interface

  • The test environment simulates the production environment data
  • Core interface for full-link stress testing
  • Output core interface performance report

2. Guarantee measures for high system availability

2.1, core interface current limiting management (not according to external interface)

  • Application integration sentinel
  • Apply the corresponding interface access current limit
  • Print flow limit log (easy to set alarm)
  • The sentinel management terminal configures the interface current limit value (reference: the highest TPS of the interface*20%)

2.2. Core interface timeout management, fuse management (depending on external interface)

  • Application integration sentinel
  • Set the timeout time for calling external requests (the default is 3s)
  • Print timeout log (easy to set alarm)
  • Apply the corresponding interface access fuse
  • Confirm the bottom line plan with the business
  • Print pocket logs (easy to set alarms)
  • The sentinel management terminal configures the interface fuse threshold (set according to the response time RT, the number of exceptions, and the ratio of exceptions)

2.3. Monitoring and alarming

  • Alarm for sudden increase in interface calls (based on monitoring capabilities)
  • Interface current limit alarm (based on printed current limit log configuration)
  • Call external interface timeout alarm (based on printed timeout log configuration)
  • Business monitoring (for example, event issuance monitoring, coupon issuance monitoring, points issuance and mall point exchange monitoring
  • Application Job, involving transaction operations to support idempotence

2.4. Emergency plan preparation

  • Sort out the abnormal risk points of the core interface (mainly evaluate the key places that can be downgraded)
  • Non-business manual downgrade (configure disconf dynamic configuration switch manual downgrade)
  • Downgrade directly at the interface level (configure the disconf dynamic configuration switch to manually downgrade)

2.5. Daily health inspection (9:00 am every day)

2.5.1 Error log check

2.5.2 Application index check (CPU, memory fullgc, number of busy threads)

2.5.3 Middleware Indicators

  • redis : cpu, memory usage, command writes per second, redis slow query

  • kafka: production rate, consumption rate, kafka backlog

2.5.4 Database metrics

  • cpu: memory usage, IO, slow SQL

2.2.5. Interface indicators

  • APM8 greater than 1s interface

  • Is the peak value of core interface calls soaring?

2.2.6 Inspection of core business functions (such as membership center, coupon checking, and discount inquiry)

2.5.7 Check the cause of abnormal monitoring alarm log

Two: deal with the matter

1. Application restart

  • CPU is full

  • The thread pool is full

2. Application upgrade

Application upgrade, middleware upgrade, database embryonic

  • CPU upgrade (for example, upgrade from 8 to 16 cores)
  • Memory upgrade (for example, upgrade from 8G to 16G)
  • JVM parameter optimization after application memory upgrade

3. Expansion

  • Application expansion: Application node expansion (8 nodes-12 nodes)
  • Middleware expansion: redis expansion (redis increases master-slave nodes) kafka expansion partition (add kafka partition)

4. Flow control

  • Disconf opens the emergency plan interface and closes the configuration switch to quickly cut off the traffic
  • The sentinel management terminal modifies the interface current limit value (for example, from the previous 100 to 50), and limits the flow

5. Version rollback

  • Configuration rollback (mount and Disconf configuration)
  • Application version rollback

6. Business (affecting system performance) goes offline

  • Business offline
  • Inform product or operation

Three: Post-mortem recovery

Taking the review of a P0-level fault project as an example, explain

1. Event review

  • 1. At xx:xx on the x month of x (for example, at 20:20 on March 14th), the operator will launch the XXX strategy, and many of the preferential policies will take effect for all network users;

  • 2. At xx:xx on the x month of x, R&D suddenly received a large number of interface alarms from the system, and R&D quickly began to query logs to troubleshoot problems;

  • 3. At xx:xx on the x month of x, some application nodes were unavailable due to the operation and maintenance feedback, and the operation and maintenance began to horizontally expand the application cpu and the number of nodes and serially restart all nodes of the dynamic discount system;

  • 4. At xx:xx on the x month of x, the xx system reported a large decline in the xx business, and the P0 fault was activated;

  • 5. At xx:xx on the x month of x, some nodes of the system will automatically restart, and R&D and O&M will continue to monitor and check various indicators of the system;

  • 6. At xx:xx on the x month of x, some large keys appeared in redis through monitoring, and the traffic was very high, reaching 1.8G/s. After investigation, the newly uploaded business policy message was too large, which caused the large keys to appear in redis;

  • 7. At xx:xx on the x month of x, the relevant business strategies will be removed from the operation, and the system will return to normal.

    The entire fault lasted for nearly 20 minutes, indirectly affecting the income of XXX million, and it was classified as a P0 fault

2. Problem analysis and solutions

problem analysis:

  1. XX policies are configured on the day of operation, and each policy has about XXX rules;
  2. The previous policy data is stored in redis, so after the operation configuration XX policy takes effect, the message received by redis becomes very large instantly, the policy key is quickly upgraded to a large key, from 1kb to 60kb, and the client frequently requests Redis has a large key, and the traffic is as high as 1.8G/s. Soon the system's http thread pool is full and the request cannot respond, and the system enters a state of suspended animation.

Solution:

  1. Optimize the large edis key: On the night of the incident, for the large redis key scenario, the local cache was introduced as the first-level cache, and data was read from the local storage first, reducing the pressure on the redis middleware. After optimization, redis ran relatively smoothly
  2. Added system exception handling: add a switch to dynamically shut down the interface in the configuration center, once the system is unavailable. Turn on the switch immediately, perform manual downgrade and return to the court data, wait for the system resources to be normal before turning on the switch.

3. Summary and reflection

1. R&D side

  1. Heavy dependence on redis, redis has slight jitter, the service interface response time fluctuates greatly, and the system risk is large;
  2. The use of redis is not standardized, and the large redis key is not predicted in advance, resulting in the system being unable to handle sudden traffic;
  3. A set of standard high-speed and high-stability architecture design review plans has not been formed. Since the xxx system is a core system and requires a large OPS every day, the requirements for system stability design are extremely high. A little negligence in the design plan may cause serious problems. Will affect the overall stability of the system;
  4. The performance stress test does not have real simulated online data, which leads to deviations in the stress test results, and cannot provide accurate support for R&D students to optimize performance;

2. Chassis operation and maintenance side:

At present, all our applications have been migrated to the cloud, and we rely heavily on the group's chassis capabilities. However, the troubleshooting tools provided by the group are relatively rough, and there are no convenient tools available. Here are a few points summarized as follows:

  • JVM level:

    Obtaining the dump thread file: Every time you need to find the operation and maintenance to manually dump the thread file, manual intervention is relatively slow.

    Locating detailed performance bottlenecks: At present, it is difficult to analyze the performance problems of application indicators in real time online. The grafana indicator monitoring provided by the chassis can only analyze rough results, and cannot accurately locate what causes the CPU to fluctuate, frequent full gc, and HTTP threads to be blocked. If it is full, the troubleshooting period will be longer.

  • Middleware level:

    Lack of redis tools: currently there are monitoring tools such as redis slow query, hot key, and large key that are not available in the chassis, making it difficult to troubleshoot online redis problems

    The operation and maintenance did not open the monitoring alarms of the underlying middleware and database to business research and development, resulting in business research and development being unable to receive alarms in a timely manner and make a reasonable response.

3. Product operation side:

  • Product side : The product students in charge of the system often change people, the system lacks precipitation, and the system logic complexity is very high. It usually takes a long time for newcomers to familiarize themselves with the underlying logic, which will lead to imperfect consideration and missing some core logic when developing requirements.
  • On the operation side : the pace of the operation students’ strategy is too fast. After the strategy is launched, it is often targeted to the users of the entire network. If there is a problem, it will cause a large number of customer complaints and the risk is high.

4. Actions and results

1 R&D test

  • Output a set of standard redis usage specifications, including redis storage specifications, redis hot key solutions, redis large key solutions, and conduct guidance within the department to avoid other group pitfalls
  • Develop redis hot key, large key monitoring and alarm tools and integrate them into the Eagle Eye monitoring platform, which is convenient for departmental R&D personnel to locate problems in the first place
  • Before the follow-up product requirements enter the research and development, there must be a strict technical plan review in advance, and the R&D students must output the architecture plan design diagram, interface detailed design, database design, performance and security design, service current limit and degradation configuration, monitoring alarm configuration, and plan design , the document is precipitated to CF, led by each group of PM, and relevant R & D and test personnel must participate
  • The performance stress test must output a set of unified standards. The test scenarios of the stress test and the online height are restored. If the interface does not meet the standard for the business to go online, the test students report by email and discuss with the PM whether to postpone the demand, etc. R & D Go online after optimization to meet specified performance standards

2 Chassis O&M side

  • JVM level : It is recommended to automatically obtain the dump thread file. The operation and maintenance can provide a visual page on the group cloud. R&D students can download the dump thread file at any time when the system is abnormal.

    It is recommended that the Group Cloud integrate the Arthas performance analysis tool in the k8s container. Many Internet companies are using it. It can analyze application-related indicators online in real time, which is very convenient.

  • Middleware level : It is recommended that the operation and maintenance open the monitoring of middleware and database to business research and development, so as to ensure that the research and development can receive the information for processing in the first time, and eliminate potential system risks in time;

    It is recommended that the chassis develop redis slow query and alarm functions, which can be output together in series with the skywalking distributed link tool, which is very useful for troubleshooting performance problems

3 Product operation side

  • Product side : It is recommended that product students be as fixed as possible and have backup students. A stable team is conducive to the long-term and continuous planning of the system and the accumulation of business debt left by the system and many previous products;

    It is recommended that product students reserve enough time to deal with these debts first, otherwise the more you do, the more you will sell. I understand that the current slowness is for the future speed, and more haste is not enough.

  • On the operation side: send a copy of the strategy to the email in the later stage. It is suggested to explain the number of users and the effective time of the strategy, which is convenient for research and development to monitor the data in advance and whether to expand the system according to the maximum QPS supported by the system. At the same time, increase the approval process, such as the operation responsible The strategy can only be launched after the approval of the person;

    It is recommended that the operation and maintenance online strategy adopt the method of gray release. On the day of release, the operation pays close attention to whether the data is normal. After confirming that there is no problem, it will be pushed to the entire network users the next day

Guess you like

Origin blog.csdn.net/quanzhan_King/article/details/130649574