Application practice of special call automatic operation and maintenance

        After more than three years of rapid development, the special electric cloud platform has grown from scratch, from strong to strong, and from strong to refined. It has been continuously polished and perfected in application practice, and has gradually grown into a benchmark in the domestic and even international charging network field, supporting every day More than 3 million degrees, a total of more than 800 million degrees of charge. Behind this is not only the technological innovation and forward-looking precipitated by the cloud platform, but also the spirit of the cloud platform R&D personnel to pursue excellence through inquiries, and the silent efforts of the operation and maintenance system and operation and maintenance personnel. Here, the author throws bricks and draws some ideas, briefly talks about the cognition of monitoring and operation and maintenance, as well as the exploration and practice of special calls in automatic operation and maintenance.

1. Raise awareness

1. Cognitive monitoring

        When it comes to monitoring, the first perception is that monitoring is to bury various probes, collect machine data, and display the data after statistical processing.

        Through continuous iterative development and continuous improvement of cognition in terms of monitoring, the Tedailiyun platform has formed a relatively complete product system:

 

 

        First of all, based on the big data processing engine Flink, a comprehensive monitoring platform has been developed, covering system monitoring, middleware monitoring, business monitoring, etc. The system has a comprehensive understanding, but each chart is isolated, and it is difficult to find the relationship between the charts; then referring to the mainstream practice in the industry, self-developed single-link monitoring, using TraceId for a request from front to back Perform serial connection and use RPCId to sort. Through this method of threading the needle, the whole picture of a distributed call relationship is truly restored, which provides strong support for locating performance bottlenecks and troubleshooting. Secondly, it is recognized that single-link monitoring can only indicate that the A relationship path from A to B, and there should be multiple relationships from A to B. Through the offline analysis technology of big data, a full-link map representing multiple relationships is drawn, and the understanding of the relationship is more comprehensive; All of the products are a static approach, and a dynamic link diagram should be drawn through real-time stream computing technology, in order to understand the running posture of the entire system in a more real-time manner.

        "Monitoring" is a word in Chinese. If it is translated into English, it is composed of two words: Monitor and Control. Through analysis, we found that the previous work is more about "Monitor" (Monitor), The "monitor" cannot directly "control" the machine in question, so "control" is where we exert our strength. Because there is no direct control through the Monitor, they need to be connected through "Analysis". Only after analyzing the data from the Monitor can the control be carried out in a targeted manner, so "Analysis" is also what we need to strengthen and continue to develop. place of power.

        Monitoring, analysis, and control are three points, and each point corresponds to a corresponding product. What is the goal of each product? "Monitoring" should be "complete". Only by comprehensively collecting data can we know the operating status of the system well; "Analysis" needs to be "fast", and only fast analysis can make efficient decisions; "Control" must be "accurate", Only accurate control can solve the problem immediately. Continuing to improve cognition, it can be found that the three "points" of monitoring, analysis and control are not isolated, but are complementary and complementary to each other to form a "line", and ultimately provide an integrated solution as a "face". As shown below:

 

 

        As mentioned above, "control" is where the overall monitoring solution needs to work, so how to control and how to control, we believe that control is to make a fuss about operation and maintenance (automatic operation and maintenance, intelligent operation and maintenance).

2. Cognitive operation and maintenance

        When it comes to operation and maintenance, the first perception is that this is a very difficult job, such as "The afternoon of hoeing and weeding is not as hard as operation and maintenance. Facing the broken computer, one adjustment for one afternoon". In fact, there are many things that operation and maintenance can do. When encountering a failure, they not only restart the machine, but also rush to put out the fire, and take the blame when they return. Not only that, the operation and maintenance treats his partner, the machine, no matter how many times he is bullied, he always shows the true character of a good man: the machine abuses me thousands of times, and I treat the machine like my first love. In a word: operation and maintenance is not easy, and it is carried out and cherished.

The operation and maintenance of the special call cloud platform is a kind of operation and maintenance on the cloud. The entire operation and maintenance system is divided into two levels (system operation and maintenance, application operation and maintenance) and four dimensions (quality, safety, efficiency, operation). The overall functional structure is as follows As shown in the figure:

 

 

        Quality management is actually DevOps. Operation and maintenance must pursue system stability, but stability is the effect, not the cause, and quality is the cause. Only by doing well in quality management can we promote system stability; , the system will not run naked; efficiency is reflected in automatic operation and maintenance and intelligent operation and maintenance. Only when the operation and maintenance work is automated and intelligent, can the operation and maintenance personnel be freed from repetitive work and do more valuable things; quality, When safety and efficiency are in place, high-level and all-round operational analysis will be guaranteed. Therefore, the perception of operation and maintenance is: focus on quality, promote stability, improve efficiency, ensure safety, fully automatic and precise operation.

As mentioned above, the operation and maintenance of the special call cloud platform is divided into two levels (system operation and maintenance, application operation and maintenance). Let's take a look at how to realize the first stage of "control" at the application level: automatic operation and maintenance.

2. Automatic operation and maintenance

    Automatic operation and maintenance is a common topic, different people have different understandings, and there are many open source solutions. After research and selection, we adopted the "open source" + "self-developed" approach to realize automatic operation and maintenance platform. , the entire platform functional architecture is as follows:

 

 

      We believe that operation and maintenance personnel mainly do two things, one is to maintain system status, and the other is to execute commands remotely. Therefore, after abstracting and extracting plug-in and script models, they can be used as the left and right hands of operation and maintenance personnel, and they can be issued through commands. , download to the designated machine, perform the corresponding operation and maintenance tasks, and report the logs generated during the operation and maintenance process. All transmissions are encrypted and transmitted to ensure the security of operation and maintenance tasks.

      Since the operation and maintenance work is abstracted into plug-ins and scripts, after the automatic operation and maintenance platform framework is built, more work is to develop corresponding plug-ins and scripts for pain points. As the host of plug-ins and scripts, O&M Agent is more like a lightweight O&M container. Whenever an O&M task is received, it will start a separate executor process, which is responsible for downloading and executing plug-ins and At the same time, the logs generated during the operation and maintenance process are continuously reported to the log storage, as shown below:

 

 

   The common automatic operation and maintenance is a very large system, which is not necessarily suitable for every enterprise. After abstraction and practice, the automatic operation and maintenance platform of the special call cloud platform is simple but not simple, and it solves the problems often encountered in daily operation and maintenance. Some of the pain points encountered, let's talk about two application practices of automatic operation and maintenance.

3. Application Practice

1. Distributed application deployment

        In Internet enterprises, deploying distributed applications is a common business, but when a system involves hundreds or thousands of machines and the distributed application is very complex, the installation, deployment and update of the application will involve A series of operations, such as database restoration, application deployment, Web site creation and binding, system service creation and startup, are undoubtedly time-consuming, labor-intensive and error-prone if they are performed manually. With the combination of the installation platform and the automatic operation and maintenance platform, what used to take many people two weeks to complete, now only takes 20 minutes by one person, and the productivity is greatly improved. The following is a simple diagram of distributed application deployment:

 

 

2. Early warning platform

        By analyzing a large number of failures, we found that when a problem occurs, some routine operations such as restarting the site or process, killing blocking processes or connections, can solve 80% of the problems, and if these manual operations can be automated, perform before the problem worsens Early control will have a multiplier effect. As a part of monitoring, analysis and control, the early warning platform is more “analysis”. By analyzing and judging the monitoring data by rules, when the threshold is triggered, machine information, process information, site information, service information, etc. can be obtained. The automatic operation and maintenance platform issues commands and actions, loads the corresponding plug-ins and scripts, performs the corresponding operation and maintenance tasks, and realizes the "control" of the system. The following is a simple diagram:

4. Special call cloud computing and big data WeChat public account

 1. Wechat public account name: Te-Call Cloud Computing and Big Data

 2. QR code:

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324734029&siteId=291194637