Didi DevOps Remodeling Road Based on Go Language

R&D efficiency and system stability are topics that the R&D team can never avoid. The former determines the efficiency of business iteration, while the latter determines the quality of delivery. Over the years, Didi has continuously explored more efficient technical means under the premise of ensuring stability, and has accumulated a lot of practical experience. Recently, Wei Jingwu, the person in charge of R&D efficiency and stability of online car-hailing, shared Didi's practice in related fields on Gopher China.

The following is the transcript of the speech.

Many technical students may have had a similar experience: in the process of dealing with failures, in addition to being under tremendous pressure, they will also bear the pressure from the new requirements of the business side to go online. It seems that stability and R&D efficiency are difficult to reconcile. We have been exploring how to continuously improve our delivery efficiency on the premise of ensuring stability. Since it is a technical problem, we have to find a technical solution. DevOps is our breakthrough. Because it includes almost the entire R&D delivery process.

Of course, under the premise of ensuring stability, improving efficiency is a huge topic, and the entire industry and even society are seeking the optimal solution. My sharing today is just an introduction. I hope our practice can bring you some inspiration.

DevOps - a new challenge

From the figure below, the challenges faced by Didi and other companies in the industry in DevOps are almost similar, nothing more than the upper-level business challenges and the lower-level technical architecture challenges of DevOps.

fab678ed158fa69ab43f613152b3d97e.png

Let's talk about it separately. From the perspective of business challenges, Didi has multiple products such as private cars, express cars, luxury cars, and carpooling. With the development of business, the products will become more and more refined. For example, there are speed carpooling and special carpooling in carpooling. At the same time, Didi has also opened up its transportation capacity, allowing third-party online car-hailing platforms to access Didi’s transportation capacity. These complex business details are projected into a technically complex microservice architecture.

In terms of architectural challenges, Didi has already fully migrated to the cloud, and has realized the business of dual-active in the same city and multi-active in different places. Although these technology upgrades have obvious benefits, it is undeniable that it has further aggravated the complexity of the technical architecture. The original microservices have to be deployed in multiple computer rooms. When the number of microservices is only dozens or hundreds, it may There will be no big problems. Once the number of microservices reaches thousands or even tens of thousands, any problem that we thought was small will be infinitely magnified. Specifically, it is divided into the following three aspects:

38e993588e32d496a061630c98918529.png

Development

With the development of business, microservices become smaller and smaller, and their functions become more and more single. Compared with single services, microservices only have less business logic, but they will introduce many new functions, such as service discovery, RPC calls, link tracking, etc. "Although microservices are small, they have all internal organs", and repetitive work is heavy .

testing

For a small-scale microservice architecture, an individual or a team can quickly set up a test environment. Then you can imagine that if there are thousands or even tens of thousands of microservices, it is difficult for an individual or a team to build a test environment. A stable test environment, let alone a set of staff. Secondly, how to conduct regression testing under the microservice architecture, that is to say, no one can make it clear what the impact of this change is. Even if it can be made clear, can all the scenarios be prepared for regression testing? - It's hard.

Operation and maintenance

As the number of microservices increases, in order to avoid online accidents, we need to build more monitoring and alarms, such as monitoring and alarms for different dimensions such as business indicators and technical indicators. However, after building more, it will cause two other problems: one is that there are too many alarms, and there are also many false alarms; the other is that the alarm strategy may need to be adjusted at any time, because the business is constantly changing, and the continuous guarantee of the effectiveness of the alarm requires a huge investment. Labor costs. In addition, it is difficult to locate. Any problem in any link may lead to the collapse of the entire system. How to quickly locate the root cause and stop the loss is also a huge challenge in the operation and maintenance stage.

Next, I will start from the challenges faced at different stages and talk about Didi's countermeasures.

Development - Cloud Native Scaffolding

35c1bca08dcdfa6ee1713af930d3aff6.png

The first is the heavy repetitive work in the development phase mentioned above, such as service discovery, link tracking, RPC calls, and so on. But there are other challenges. When Didi first selected technology, it chose a weakly typed language for business development—PHP. The biggest advantage of PHP is that it is fast. It helped Didi quickly realize the product landing and seize the market opportunity. However, with the increase of business complexity, or more and more interactions between microservices, some weakly typed languages ​​are also exposed. question. On the one hand, due to the weak type of data, when we interact with some strongly typed languages, such as C++, Java, etc., it often leads to the collapse of the strongly typed language; on the other hand, there are performance problems. Obvious valley peak traffic characteristics, such as morning and evening peaks, holiday peaks, etc., especially the evening peaks on the day before the long holidays such as May 1st and 11th, the traffic will be very high. At this time, the performance problem of weakly typed languages ​​​​will consume a lot of machines Resources, although Didi reduces a large part of the cost through elastic scaling, the consumption during peak hours is still high.

Based on this, Didi made three "unifications" during the development phase:

Unified Go technology stack

As for why it was unified to Go, in addition to the problems mentioned above, there is another more important reason, which is the transformation cost of the entire technical team. Didi is a company that used the Go language earlier. Before I joined Didi in 2016 , Many teams within Didi already use the Go language, so based on these reasons, we finally decided to unify the Go technology stack.

But to be honest, when choosing the Go technology stack, everyone just reached an ideological unity. In terms of specific operations, we cannot ask everyone to migrate to Go across the board, or ask everyone to suspend business development to complete the migration. Fang did not agree.

unified framework

We encapsulate all non-business logic through the framework to help businesses with low enough costs in the process of migrating or creating new services. At the same time, in order to be compatible with the original service and meet some historical "debt" needs of the business, we chose to use Thrift IDL as the center for expansion (see the figure below).

7de12d230b2816a14104b1903ef7ec13.png

Thrift is an extensible, cross-language RPC framework, and IDL is its interface description language. Although Thrift has its own protocol, through our expansion of Thrift IDL, it can support HTTP and gRPC protocols while being compatible with Thrift's native protocol. This can not only solve the compatibility problem of the original Thrift service, but also successfully complete the migration of HTTP service or gRPC service.

At the same time, based on the extended IDL, it can generate SDK, documents and Mock services different from the language; it can generate a multi-protocol server. In this server, in addition to some basic capabilities, it also includes some middleware Encapsulation, it can be said that we have almost encapsulated all Didi middleware SDKs, such as MQ, RDS, KV, etc., out of the box.

unified data

Personally, I think this is more important and difficult for Didi. Because there are more people and more services, the difficulty of data unification will increase exponentially. Unifying data in the framework can not only improve service governance capabilities, but also reduce a lot of repetitive development work.

After these three things are done, the cost of business migration will be relatively much lower. Especially for the development of new services, business R&D engineers only need to focus on business logic, and the underlying complexity is shielded.

Test- traffic playback and test environment

We mentioned above the issue of framework migration, unified the technology stack, and prepared the framework, but for migrating old services, the bigger challenge is how to ensure that the migrated services are stable enough to achieve smooth migration.

8d55032f28146db06ec7e0f437acdeb0.png

For Didi, for a microservice, there may be hundreds of downstream dependent services. It is difficult to cover all scenarios by traditional unit testing or automated testing. You may wish to think about it, whether it is a small micro-service architecture or a large-scale micro-service architecture, for them, where is the real test cost? Usually, it is not on the testing of new functions, but on those regression tests that have been accumulated for a long time (you can recall your own experience, is it because most of the failures affect the old functions, but the new ones? function is fine). In the actual testing process, due to various complicated reasons, it is difficult for us to cover all the scenarios every time we go online, because it is difficult to achieve either in terms of manpower or traditional technology.

Another challenge in the testing phase is what we just mentioned, that is, how to build a testing environment . When it comes to thousands of microservices, how to build a test environment? How to ensure that everyone has a stable test environment and does not affect each other?

I learned that there are two factions in the industry regarding the construction of test environments. One is represented by contract testing. They believe that microservice testing environments should not be built, and all downstream mocks should be used for testing through contract testing. However, just imagine, if we have hundreds of dependencies downstream, and each request will bring dozens or hundreds of downstream calls, how to mock? It cannot be solved by relying on a bunch of people.

The other faction is called Test in Production, which means testing online. They think that since the cost of building a test environment offline is too high, it is not enough to logically isolate the online environment through some test accounts. This method is not acceptable to Didi, because although it has been logically isolated, it cannot completely guarantee no pollution to the line, especially the pollution of online data.

How did Didi deal with the above two challenges? Let’s start with the first question. To solve this problem, Didi adopted a relatively innovative method—traffic recording and playback. Different from the industry’s traffic recording and playback, we recorded all the context of each request. What does that mean? For example, when taking a taxi, the user will estimate the price, and the estimated traffic will be requested to the backend service. At this time, we will not only record the request traffic and return traffic of the estimated interface, but also record the RPC involved in calling the downstream , MySQL, Redis and other outbound traffic are also recorded, and these traffics are bound together. This is the context of the request just mentioned. We call it a session, and at the same time, it can ensure that the traffic between concurrent requests will not be strangled.

How to achieve it? Students who follow Didi open source may know that we have open sourced two projects, respectively for PHP and Go language solutions. For Go language, we solve the problem of recording by changing the Go source code. But now Didi has upgraded a new solution-kernel recording without modifying the source code.

As shown in the figure below, in our new solution, for Go and C++ services, we use the eBPF-Go library provided by Cilium to hook all related system calls through the kernel, such as Accept, Recv, Send and other system calls.

a22ca3f1e0d5c926025313a76685d921.png

On this basis, it is also necessary to hook the call of the user mode through Uprobe, because we are bound to the process for recording, not through the network level, so we can record all inbound and outbound traffic Recorded and bound, while also ignoring out-of-process noisy traffic.

For PHP, the logic we implement is to use CGO + LD_PRELOAD, which actually hooks the system call at a higher level (libc level). In fact, this method is more useful, but the Go language does not rely on libc by default, so we use two solutions for different languages ​​to record, and conduct continuous real-time recording online.

After recording, traffic filtering, that is, down-sampling or abnormal traffic filtering, is required. Finally, the recorded traffic is dropped into ElasticSearch. This process is actually building a test case without manual intervention. With these constructed Cases, users only need to retrieve the scenarios they need in ElasticSearch, or directly select the latest batch of traffic for batch playback.

In terms of retrieval, it is actually searching ElasticSearch, such as what parameters are in interface requests and returns, what fields are sent by MySQL in downstream requests, and even some abnormal scenarios can be searched out. In order to allow everyone to retrieve more scenarios, we have made textual processing for different protocols, such as Thrift protocol, MySQL protocol, etc., which can all be retrieved in ES.

After the traffic is queried, playback is required next. For playback, in addition to mocking all network requests, the most important point is to go back to the time, local cache, configuration, etc., that is, back to the recording time point. Why do you do this? You can think about it. When we are playing back, is there a lot of business logic that is processed according to timestamps, such as cache time, etc. If we don’t go back in time, its logic may be completely different from online. It's different.

The last is to perform online and offline traffic Diff, such as Diff whether the online Response is consistent with the test Response, whether the downstream MySQL, Thrift, HTTP and other request traffic is consistent with the online, if not, there is a Diff, of course There's going to be noise here, and we've done a lot of noise reduction as well. Finally, the success of the replay is judged based on the traffic Diff.

We have done verification, we used about 10,000 recorded traffic for playback test, and compared the code coverage with online services, and finally found that there is only a gap of 2% to 3%, which means that almost all Scenes.

Based on this, we can automatically build high-coverage test scenarios without manual intervention, and then play them back in large batches to achieve high-coverage regression testing. The migration framework and Go technology stack we mentioned earlier all rely on this method.

The above is our solution to the regression testing challenge.

This technical solution has already been run on Didi's online assembly line, and all upstream modules have been connected in. Every time you go online or submit code, you will run it in batches. But we will not run 10,000 entries, we will filter the traffic of each module, and we will screen out thousands of entries from the 10,000 entries. Of course, only a few hundred pieces of traffic from some modules are screened out, because most of the traffic is repetitive, so the online pipeline actually only needs to run hundreds or thousands of Cases, and the playback can be completed in about 5 minutes.

Next, let’s talk about the second challenge in the testing phase, that is, how to build a testing environment under the microservice architecture. Didi has actually gone through a detour, and wants to build the entire testing environment by an individual or a team. At the beginning, we stuffed all the services into one image, and it was still running when there were few services at the beginning, and everyone used it more smoothly. After all, everyone only needs to apply for their own image.

But as more and more services are added, the environment becomes more and more unstable, and it is difficult to build. No one can handle it. Some people even start to test in the pre-release environment, which is very risky.

In fact, on the issue of "how to build a test environment under the microservice architecture", there are two main difficulties, which can also be said to be challenges. One is how to build a test environment at low cost, at least to build the entire test environment offline ; Another challenge is how to ensure a set of test environments at low cost without affecting each other.

As can be seen from the figure below, Didi's complete test environment is called "offline benchmark environment". I have to mention that this is due to Didi's investment in cloud native, which greatly reduces the cost of adding a new computer room. We only need to add a computer room node under the online node, and bind the benchmark environment with the online node. The code is synchronized with the online service when it is launched and rolled back, ensuring the simulation of the test environment.

21f66b8db983528c9ea8759e51959055.png

In order to achieve a set of test environments for each person without affecting each other, we developed a set of Sidecars based on the Go language, that is, we deploy a Sidecar in front of all benchmark environments, and build test environments on demand through the traffic coloring scheme. Service Mesh and some thoughts on traffic coloring in the industry.

Take the above picture as an example, assuming that I only need to develop S1, then I only need to build S1, and use the benchmark environment for other dependencies. At present, during the peak period of online car-hailing development, we can basically maintain about 1,000 sets of test environments, which can guarantee one environment for each person, or even more than one environment for one person. Of course, we will also reuse the online cloud-native capabilities to oversell and release resources on a regular basis to minimize costs.

O&M - AIOps

b2d0d89e18a714d2497eac9964b565b3.png

In terms of operation and maintenance, as we mentioned earlier, there are two main problems. One is that there are too many alarms, but they are not accurate and difficult to maintain, and the other is that positioning is difficult and the stop loss is slow. In addition to the optimization of the mechanism, at the technical level, I think the best solution is AIOps. The concept of AIOps was proposed by Gartner in 2016. The core idea is to analyze massive operation and maintenance data through machine learning to help operation and maintenance improve operational capabilities.

According to Didi's business scenarios, we divide AIOps into three directions: monitoring alarms, root cause location, and fault recovery.

5a1f2ea8830e1bbaaf92a5ccdcf34d19.png

To expand a little bit, the most important thing in monitoring alarms: one is timing anomaly detection , because most indicators are timing data, and there will be different data at different times, and different indicators have different curves. One place where anomaly detection is very important. The second is the dismantling and classification of indicators . We need to disassemble an indicator into indicators of different dimensions. For example, when we monitor the "order volume", it is not enough to simply monitor the volume of orders, but we need to do the Cartesian product of different product lines, different cities and other dimensional indicators to derive N curves, and then Classify the curves. The third is automatic construction of alarms . After the indicators are disassembled, different alarm strategies need to be configured according to different curve types. Since there are many curves, this process does not require automatic construction. At the same time, the alarm strategy of curves can be adjusted regularly according to business changes.

The logic of root cause location is actually the process of simulating human location. You can recall what your thinking was when you were doing root cause location. Is it based on the alarm information or user feedback information, and then search the knowledge map in the brain to find the most likely cause, and eliminate them one by one until the root cause is found.

It is also a similar process in Didi’s incident, but we have made it automated. We will build a knowledge map through some correlation analysis, causal inference, and some manual annotations, and deposit it, just like our brains. Knowledge graph is the same. Next, we will combine link tracking, alarm information, event changes and other data to comprehensively determine the most likely root cause.

ca38b282b1069b746bb41cca5fbdad38.png

Taking a specific example, it is mainly divided into the following steps:

Collect golden indicators

This data is very important. The unified data we mentioned in the research and development stage is actually doing this. In the entire operation and maintenance system, the most important thing is to standardize data, which is actually the feature engineering of service machine learning, which accounts for almost 70% of the cost.

Do index classification

As mentioned earlier, the curves of time-series data are different, such as the volume of completed orders. In some large cities, the curve of completed orders may have obvious peaks and troughs with the morning and evening peaks, while in some small cities, the curve of completed orders The curve may not be so obvious, these indicators need to be classified.

Determine the alert strategy

Taking the national data as an example, the curve of the data is relatively smooth and regular (morning and evening peaks, weekdays, holidays, etc.), you can choose the ETS (exponential smoothing) strategy to predict future data by learning historical data, and then set a confidence interval, exceeding The confidence interval is the alarm. But for the irregular curves of small cities, a bottom-up strategy may be needed. Of course, there are many anomaly detection algorithms in the industry, which are mainly divided into supervised and unsupervised algorithms. Didi is more inclined to unsupervised methods, because supervised learning requires a large amount of labeled data, but there are very few abnormal points online, and the amount of data not enough.

Alarm Mask

When an accident actually occurs, there may be an alarm storm. At this time, it is difficult to find out which is the cause and which is the effect in a short time, but it will disturb everyone's investigation thinking. There are roughly two ways of thinking about alarm shielding. One is to perform single-policy shielding, which is a relatively simple way. For example, if we have configured CPU alarms, and 1,000 machines on a certain node are reporting "CPU drops", we can combine these 1,000 alarms into one; in addition, the alarm time can be set to 1 time per 10 minutes or 1 time per 30 minutes Second-rate.

The second way of thinking mainly relies on root cause location to determine which alarms can be shielded. For example, if the completed order is lost, it may be caused by the failure of the estimate. At this time, the estimate can be reported to the police, but there is no need to call the police for the completed order.

root cause location

The logic of positioning is also divided into two categories, one is horizontal positioning, or business positioning. Still taking the completed order volume as an example, when the completed order volume drops, it is first necessary to determine whether it is caused by the driver or the passenger. If the passenger has dropped, is the bill issued or estimated? Whether the applet is down or the main end is down, if the applet is down, it may start from the applet to check.

The other category is vertical positioning. If it has been located, it is estimated that the whole country has fallen, and then vertical positioning, also called technical positioning, is required to analyze services related to the estimation, such as trace link tracking, metric indicators, saturation indicators, change events, etc. To locate the root cause of the alarm, most accidents are related to change events, so all online changes will be recorded for timely query.

Recovery

After locating the root cause or locating the downgrade plan, the next step is to restore the fault. The so-called fault recovery actually does not mean temporarily choosing a downgrade plan every time an alarm occurs, but to do a lot of arson and drills before the alarm occurs, and build a plan for possible risk points in advance, and then stop the loss stage Just implement the pre-built plan.

You can see the picture below. This is the positioning robot currently used by Didi.com. We internally call it the Dragon King of the East China Sea. It has basically replaced most of our positioning work. Through Donghai Dragon King, we can retrieve all relevant information within a few seconds. For example, according to the interface success rate, we can locate which computer room has a problem. If we can locate a specific In the computer room, we don’t need to locate the root cause anymore, we can directly operate the flow cut and stop loss, and we will check the root cause after recovery.

33f718b3b44d9029c61be07a095758c2.png

In addition, we will extract all the information returned by the status code, including Trace, and extract some of the key information into the East China Sea Dragon King. For example, the error code and error message in the above picture will be prompted, and RPC positioning and degradation plans for RPC, CPU, memory, disk check information, etc. The most important point is to be at the person in charge, because in the actual stop loss process, there are many downgrades that require manual decision-making.

Future - Automated Deployment

9a6f39f8e5a86ec58eafe6b4d73cbe5f.png

If we distinguish between manual and automated DevOps, we find that in each link, especially under the large-scale microservice architecture, the bottleneck of manual labor is becoming more and more prominent. For example, in the development stage, manual labor is mainly to formulate specifications and then perform Code Review, etc., but everyone's abilities, standards, and even the mood of the same person at different time points are different, it is difficult to reach a relatively stable level, and through automation, we can use static scanning, compilation inspection, framework constraints, etc., Improving the static inspection capability to a relatively stable level, including the AIOps operation and maintenance method we just mentioned, will also maintain the operation and maintenance capability at a stable level, and we can improve our capabilities by continuously optimizing automation capabilities. The lower limit of ability.

Based on this theory, Didi is trying to build a fully automated pipeline, such as turning everything from Code Review to deployment rollback into unattended, relying entirely on automated means to detect problems in advance and notify the person in charge, which can not only improve On-line efficiency can also raise the lower limit of stability guarantee to a certain level. Of course, all of this is inseparable from Didi’s years of building basic capabilities, such as cloud-native capabilities, observability capabilities, and chaos engineering.

785022df97bb8e116f26496eb4206712.png

END

Author and department introduction 

The author of this article, Wei Jingwu, is from Didi’s online car-hailing travel technology team. As a research and development team for online car-hailing business, travel technology has built an end-user experience platform, C-end user product ecology, B-end transportation capacity supply ecology, travel safety ecology, and services. Govern the ecology and core security system to create a travel platform that is safe, reliable, efficient, convenient, and user-reliable.

Job Offers

We are recruiting for the backend of the team and testing requirements. Interested partners are welcome to join. You can scan the QR code below and submit your resume directly. We look forward to your joining!

R & D Engineer

Job description:

1. Responsible for background research and development of related business systems, including business architecture design, development, complexity control, and improvement of system performance and research and development efficiency;

2. With business sense, through continuous technical research and innovation, iteratively improves the core data of the business together with products and operations.

3973f9927c1f5b191280aca7fa0d6581.png

Test Development Engineer

Job description: 

1. Build a quality assurance system applicable to the online car-hailing business, formulate and promote the implementation of relevant quality technical solutions, and continue to ensure business quality;

2. In-depth understanding of the business, establish communication with various roles in the business, summarize business problems and pain points, create value for the business in an all-round way, and work without fixed boundaries;

3. Improve business code quality and delivery efficiency by applying relevant quality infrastructure;

4. Precipitate efficient testing solutions, and provide generalized solutions to support landing applications in other business lines;

5. Solve difficult problems and complex technical problems in business quality assurance;

6. Forward-looking exploration in the field of quality technology.

422fc5d8cad77dfa26cf6552c7cbd00d.png

Guess you like

Origin blog.csdn.net/RA681t58CJxsgCkJ31/article/details/131692788