How does Ele.me do technical operations

The Ele.me platform not only does food delivery, but also Hummingbird, Breakfast and Future Restaurant, as well as many other platforms, are in a stage of rapid expansion. The entire takeaway product chain is long, and the time from the user's order to the final delivery is about 30 minutes, and the requirements for timeliness are very strong.

From a technical point of view, the biggest challenge Ele.me encountered was accidents. This article will focus on the accident and divide it into two parts: technical operation experience and experience. The first part of the experience is divided into three stages: refined division of labor, stability (capacity and change) and efficiency enhancement. The second part of experience is the author's understanding of operation and maintenance services.

1. Technical operation experience

The responsibility of technical operations is to do our best to coordinate with more people to achieve the goal of ensuring stability, which can be divided into two stages: operation and maintenance guarantee and operation and maintenance services. Now, Ele.me is in the stage of operation and maintenance services. The technical operation team, as Party B, maintains the developed products, develops and tests the services, ensures stability, optimizes performance, and improves resource utilization.

In the stage of rapid business expansion, what needs to be done by the technical team?

First of all, the first stage is the refined division of labor.

Promote parallel speed-up through refined division of labor, let professional people use professional knowledge and the most effective working methods to improve work efficiency and code throughput, establish communication channels to speed up decision-making, and maintain stable information flow.

The refined division of labor is divided into three parts:

The first part is to do database splitting and code decoupling. The technical work is concentrated on the splitting of the database, first vertical splitting, and then horizontal splitting as a last resort. In order to serve business expansion faster, some work on code decoupling is mixed in.

The so-called code decoupling is to imagine the original code system as a mud ball and gradually split it into many pieces. Now there are more than ten business modules, and each module has a dedicated team to maintain it, and the internal domains are divided.

Ele.me is doing database and code splitting in parallel. Then, the mandatory access to the new release system and single-instance and single-use were started, that is, physical splitting.

During the whole process of code decoupling and fine division of labor, they encountered many problems, among which two typical types of accidents are:

  • Accident 1: Timeout, slow back-end service, triggering a chain reaction, leading to an avalanche of front-end services. The user's request time depends on the response time of the service on the RPC call path. When one of the nodes slows down and the entire cluster is unavailable, the general emergency measure is to stop the service in order from the front to the back of the call chain, and then start the service from the back to the front. When this kind of problem occurs, if there is no fuse mechanism, the front-end service will cause an avalanche due to dependencies, and the service cannot be restored by itself. After the fuse mechanism is added, when the back-end problem node restarts or the network jitter recovers, the front-end service will also recover by itself.
  • Incident 2: For three consecutive days, merchants have to retry continuously to accept orders, which is related to Redis governance. When a bug in the switch causes network jitter, Redis is most affected. During the network jitter, the backlog of requests will establish too many Redis connections, and too many connections will cause the response delay of Redis to soar from 1ms to 300ms. Due to the slow Redis request, the processing speed of the service is slowed down, while the external request is still backlogged, which finally causes an avalanche. When the failure first occurred, due to the long monitoring cycle of Zabbix, the operation and maintenance engineer could not monitor it. Later, it took them three days to carry out stress testing and reappearance before troubleshooting the fault point. Afterwards, the operation and maintenance engineer created a new infrastructure monitoring tool. The implementation method is to collect all the indicators in the /proc directory every 10 seconds, and basically can locate the problem within 3 minutes. In addition, the retransmission of packet loss will also seriously affect the performance of Redis, because an HTTP engine may generate dozens or even hundreds of Redis requests to the backend, and one of them will be hit and retried, which will have a great impact on the service. deadly.

The second part of the fine division of labor is to form a horizontal team. For example, big data is a horizontal team, and the business line is a vertical team. After the division, the upward curve of the development trend chart of the entire business is very steep, and it can be inferred that technology does not hinder the business. Rapid development, that is, the throughput of technology and the efficiency of new product development are healthy.

During the period, the operation and maintenance engineer also did several things, such as dividing the monitoring into four parts: Metric, Log, Trace, and Infrastructure. Form a Noc team to be responsible for emergency response. When any problem is found, the information will be notified to all members through Oncall in time. There are also sorting out all kinds of cleaning, access release, SOA, downgrade fuse development, etc.

cleaning

What is the concept of general cleaning? That is, after the engineers analyze the historical accidents, they roughly make a technical summary, list some mistakes that are often made into some doable procedures, and publicize them to the backbone of the department. The specific content includes:

  • For SOA service governance, the main emphasis here is domain division, high cohesion and low coupling.
  • Governance over common components. The database Redis here consists of two professional teams, one is DA and the other is DBA. The main plan of DA governance is to collect information from various industry partners, plan capacity, manage the usage posture of development, and solidify experience into the R&D process.
  • The sorting out of business indicators includes setting the concept of TPS (documenting according to the returned status after the status is rotated), the stagnation time of the status, and the accumulation depth of the status. This accumulation depth is mainly the status rotation of some back-end services.
  • Reasonable setting and retry mechanism for the timeout chain.
  • External dependencies and switches. Why the emphasis on external dependencies? External dependencies can be divided into two categories, one is cooperation with other companies, such as calling payment interfaces of other companies. Another type of dependency is the dependency between teams. Please don't trust anyone's service here, bugs will happen at any time.
  • Critical Path. Why set the critical path? One is fuse and the other is downgrade. When there is a problem with the non-critical path, just drop it directly without affecting the critical path. Another advantage is that when compensation is made next, it can be done in a targeted manner.
  • log. The team also has a lot of accidents in the log, which can be preached through cases one by one.
  • Developed blind exercise goals being achieved. Because the code interaction between eight or nine hundred technical engineers is a complex system, and the business is a very long business chain. The critical path involves more than 100 services. Simple functional testing is possible, but when the capacity is large , it will be difficult to locate the problems existing between them, such as code coupling acceptance between A team and B team. The solution that comes to mind at this time is a blind exercise. Blind exercises can not only do acceptance on the business side, but also infrastructure, including Redis clusters, MySQL clusters, and networks. I once did a test, calculating the packet volume on a Redis instance based on a 1% packet loss rate, which caused the business of the entire site to drop to the bottom. At that time, the entire Redis cluster had 12 units and hundreds of instances, and one of the instances had a problem, which caused such a big impact. Through blind exercises, the technology is seeking a solution that minimizes the impact of a single node downtime.

The second stage is a stable period. Enemy number one is capacity issues.

In the stage of rapid business expansion, the biggest enemy affecting system stability is capacity, which is similar to boiling frogs in warm water or sudden avalanches. Because different languages ​​have different ways of determining capacity, Ele.me’s complex system consisting of more than 1,000 services, rapid changes in business scenarios, frequent service changes and other factors have caused capacity problems to plague it for nearly a year.

In the end, the method of regular online full-link stress testing was adopted, and a 100-person campaign was launched. It took more than a month to rectify nearly 200 hidden dangers, and basically solved the capacity problem. Even in the trough period, full-link suppression is adopted. It can also be done together with the pressure test of the technology before going online, and then the data can be coordinated and analyzed.

spike accident

During the preparation stage of the 5.17 seckill promotion, the technical operation idea was to use the daily service cluster to fight against the seckill, and the entire capacity was more than doubled before the event. However, the order volume soared that day, and in the few seconds after the flash sale started, the instantaneous concurrent requests reached 50 times the usual one. When the traffic flood peak comes, the flood peak directly congests the front-end Nginx network.

Reflecting on it, the reason for the problem is that there is little experience in the seckill scene, the estimate of the flood peak data brought by the activity is too low, the flow limit of the URL is not prioritized, and so on. The improvement measure is to build a system specifically for seckill, mainly for hierarchical protection, establishment of client cache, swimming lane, cloud cluster and competitive cache, etc.

The third stage is to increase efficiency. Improve efficiency through tools, resources, and architectural transformation.

Accident 1: Various accidents occurred in the delivery of hummingbird for two consecutive weeks

The reason is that the continuous batch retrying of messages leads to the accumulation of RMQ, the exhaustion of UDP handles, and the use posture of fuse judgment is wrong. It can be seen that in the process of rapid delivery of new business, code quality and the use posture of external components are high-risk hidden points of accidents.

Accident 2: MySQL

Slow SQL queries, from 2 to 3 per week, have been reduced to very few in the near future. The solution is to use component governance. Component governance is first to serve its own resources and capacity. The second is to set current limit and do downgrade. The third is mainly some postures that limit the development.

After these three points are completed, the technology will do some work related to automation, mainly information, standardization and orchestration. The other is the pre-indicator KPI, that is, when some components are just used, some quantitative considerations should be made. After these few items are done, the technology can basically avoid major failure problems.

For governance using gestures, the gains to stability are greatest. Here are a few key points in particular:

  • It is necessary to have a partner who is proficient in components, read the source code, understand all the pitfalls encountered in the community, and go deep into the front line of business development to understand business scenarios and initially determine the usage scenarios of components in business.
  • Engineers carry out knowledge transfer, and transfer knowledge points such as standardization, development specifications, clustering, and development and use postures in place through various channels.
  • Solidify experience or red lines into processes and tools such as resource application and architecture review as soon as possible.

Accident 3: RMQ

In Ele.me, RMQ can be used in many scenarios, including Python and Java. At the beginning of 2016, although the engineers did a technology and configuration review, there were still many scenarios that were not thought of. The main issues involved are as follows:

  • Partitioning means that when the technology is cut over, the core switching is to upgrade and replace the equipment. When the device network cutover is completed, although the configuration in the RMQ cluster can be self-recovery, there are still many clusters that have not achieved self-recovery. Therefore, the technology specially reserves a cold standby RMQ cluster, and deploys all the configurations of the live network to that cold standby cluster. If one of the more than 20 online RMQ clusters goes down, it can be switched over in time.
  • The queue is blocked. The main thing is to trace the consumption capacity, because the business is soaring and the consumption capacity is not enough, which will easily lead to queue congestion.
  • scenes to be used. For example, when sending and receiving messages, if every time a message is sent, every time a message is received, the link is rebuilt or the Queue is rebuilt. This reconstruction will lead to an Event mechanism inside RMQ. When the request increases to a certain level, it will directly affect the throughput of RMQ, and the capacity of RMQ will drop to one tenth of the original.

Persistent Difficulty: Fault Location, Recovery Efficiency

The main reason for the slow fault location is that Ele.me has too much information in the entire system. When a problem occurs, the engineer leading the fault location has a lot of information. For example, it is difficult for him to make a decision when he gets three pieces of information. What exactly is the fault and how to detect it.

The current approach is to conduct a fragmented, carpet-like sweep to troubleshoot. What is a carpet sweep? It is to obtain enough information first, carry out division of labor, and require every engineer involved to check it. The content involves food delivery, merchants, payment and logistics, as well as basic business and network monitoring, some traffic on the external network, and some burden on the server, etc.

At this time, the orderly self-certification of technical engineers becomes very important. What can be done now is that everyone can see whether there is a problem with the service they are currently responsible for. All that needs to be done is to provide tools, such as the packet loss of the switch and the packet loss of the server. Through some tools, technical engineers can find problems in time, but this process takes time.

The other is that when self-certifying, you must check carefully. As a member of the team, each technical engineer is responsible for the corresponding section, but once some mistakes are made due to personal negligence or insufficient self-inspection, they have to "clean the pot" by themselves. After the fault is located, improving recovery efficiency and solving the problem is the key.

Also, emergency drills are important. Emergency drills are directly related to the efficiency of system recovery. When a cluster fails, can the technology recover quickly.

2. Operation experience

Most of this sharing is about accidents. Every accident is not accidental, and many problems can be avoided through correct use posture, capacity estimation in advance, and gray scale. If technology only solves this matter on a case-by-case basis, accidents will often occur at another point in time. This requires engineers to do things in a thinking way, such as doing accident review, accident report review, and acceptance team. Then, by raising the key points involved in an accident several times at each stage, we will constantly summarize and formulate feasible operating specifications. The solution to problems often requires a change in thinking mode, and requires partners to think more about how to take time off from important and urgent daily affairs. And dare to toss. What is the concept of tossing? It is to constantly drill and make troubles. Engineers must be very familiar with the maintenance system, so that they will be very accurate when locating and solving faults. The last one is the problem of darkness under the lights, especially the infrastructure. This was a headache at the time. It took more than ten minutes to an hour to check a problem on the infrastructure. Later, a small partner changed his mind and made a system to help the team solve this big problem very well. So daring to think and being diligent in trying is a very important experience of Ele.me's technical team.

Guess you like

Origin blog.csdn.net/yaxuan88521/article/details/132416289