Long article | Pacific Insurance's intelligent monitoring system based on Zabbix

We will share the construction history of the CPIC monitoring platform, the integrated monitoring platform based on Zabbix, the fusion of monitoring data, the creation of an intelligent monitoring platform, and the intelligent operation and maintenance system that detects when it occurs, and disposes when it is discovered.

——Du Yingjun, Taibao Technology, automation operation and maintenance expert

The ppt can be obtained on the WeChat public account: Zabbix open source community, enter ppt.

Please add a picture description

01 Construction history of CPIC monitoring platform

In the first part, we will introduce the construction process of our CPIC monitoring platform. We can see from the picture that our CPIC has been building an overall monitoring system since 2008.

From 2008 to 2017, we basically used this commercial product of BMC. Starting from 17 years is an alternative to exploring commercial products. Since 2018, the plan to replace the BMC with Zabbix has been formally confirmed. After a series of demonstrations in the test environment, from 2018 to 2020, we will gradually replace it. In 2020, we will completely replace it. Replace all the system of this monitoring index collection with this Zabbix replacement product, mainly some functions of BMC and Netcool. At this stage, we are mainly focusing on fault early warning location and an analysis scenario of intelligent troubleshooting.

Please add a picture description

When it comes to monitoring, it must be inseparable from some operation and maintenance tools behind it. I will also introduce the construction process of this tool platform within CPIC. We started building some things related to the automated operation and maintenance and monitoring system in 2014. We have mainly gone through four stages, and we are still in the third stage. Basically, we have realized a front-end and automated scenario. There is also the fourth stage. We believe that the intelligent operation and maintenance system should be a data-driven, fault-healing scenario, and it will become more and more abundant. We are now gradually starting to try to do these things.

Please add a picture description

The following picture is an overall introduction of an existing tool platform of CPIC. The previous ones are some of our functional platforms. You can see that we have a private cloud management, a container platform, and an automated operation and maintenance platform. The bottom layer is some of our connected s-level devices, and the one next to it is some configuration information and some functions related to data collection. It is mainly monitoring and logging, as well as our CMDB. The middle layer is our service gateway. In fact, we also abstract it out as a middle platform for operation and maintenance capabilities. On the upper level, we will sub-package various types of automated operation and maintenance and monitoring related application scenarios. We also have a series of low-code platforms, some user UI systems displayed on the big screen.

2. Integrated monitoring platform based on Zabbix

This is our internal use of Zabbix. The first one we also implemented is a distributed deployment in two places and three centers. The management is probably 3 sets of environments for development, testing and production, with 60,000 nodes. The monitoring indicators we basically reached more than 2,200 on the Internet. Later, let's introduce the better features of Zabbix.

Please add a picture description

The first is the definition of the threshold, which we mainly use is a function of the trigger.

The second configuration template, this is to greatly reduce our overall configuration workload. The distributed deployment was also introduced earlier, and the unified management and support of multiple sets of GPS in our three centers in two places is relatively comparative. Well, the alarm configuration, the usage of the macro definition in Zabbix is ​​also used more internally, mainly there are some related information, which can be distributed directly through Zabbix, and the upper layer has also done some other internal systems of ours Some data packages are provided, which are used in combination.

The third is automatic discovery that does when it comes to monitoring. This is definitely a monitoring blind spot. It is also a very important topic. Since the introduction of Zabbix, it has been very helpful for our monitoring self-discovery. The main file system and port, his discovery ability, actually reduces the workload for our operation and maintenance students.

The last piece is data export, which is also an important dependency for us to implement subsequent data analysis. This is also the biggest advantage of Zabbix's closedness that is different from commercial products.

Please add a picture description

This page mainly introduces some problems we encountered during the construction of Zabbix.

The first one is a single set of Zabbix. Now each set basically has about 2,000 nodes, so this is also the Zabbix system. A single set can only manage these things, and if there are too many managements, there may be performance problems.

The second is that network equipment monitoring is easy to miss, and monitoring blind spots often appear. In fact, we also combined some of our own internal processes, including the on-and-off shelves, and combined some comprehensive information of our CMDB. This mainly relies on resources to solve some of the weaknesses of Zabbix itself, index coverage, and just mentioned macro variables. The readability of the configuration of key indicators is relatively poor. For our real-line operation and maintenance students, this may not be easy to understand. We are gradually doing this now, because there are so many indicators that we have not done all the key indicators. We I did a whole translation and established some rules to do it.

The fourth piece is the life cycle management of monitoring objects. Before we did not implement this life cycle management, if some devices were taken off the shelf, because we have one internally, the lower limit is equal to temporarily keeping the operation, but in fact this machine is no longer used. If a business system is used, it is easy to have some false alarms in this link. We are now combining this with our CMDB including the overall process of one of our equipment. stage, we will not automatically suppress this alarm.

The last Zabbix relies on a relational database. Then we mainly use the Zabbix mentioned above, which has a function of importing data into files. Then we first import it into files, and then use FileBeat to collect them into our MongoDB. We call it A middle platform for operation and maintenance data, and all our follow-up analysis and aggregation all go from this bus.

Please add a picture description

Zabbix's Xinchuang adaptation is also a hot topic now.

The one on the left is the component that we manage internally with Zabbix. We have tried the operating system-level Tongxin, Kirin, and Hongqi. We have used the database for Tencent, Ali, and Dameng. These are all It is because we have realized management within CPIC.

The one on the right is the adaptation of Zabbix. We have tried to deploy Zabbix on Tongxin and Kirin operating systems without any problems. The main database, because the current Xinchuang database may be based on this MySQL 5 version, and Zabbix itself may require 8.0, and we may have a lot of doubts. Then we are going to look at the TTC and MySQL versions of Tencent, but we have not yet fully put them into production and use. We only run them on the test environment, and they can run.

Please add a picture description

This is a schematic diagram of the overall monitoring platform within our CPIC. I also take this opportunity to share with you.

We are three centers, and the environment of our off-site Chengdu data center is relatively complicated. In fact, it covers Zabbix and testing, and a small part of the production system will also be there. Here are our two centers in Shanghai, one is originally in Tianlin, and the other is the main center. We are now in Luojing, and the one in Tianlin has come to an end.

Then we also established cooperation with Alibaba Cloud later, and established a new data center. Our monitoring system remains basically unchanged. The architecture of the three centers is based on Zabbix's several, and then we have encapsulated an overall event management center. I will focus on this later, and then collect the data in the form of a file into the message queue, and then combine it. Some streaming engines do a data aggregation. The above one is our internal operation and maintenance data bus, which is actually similar to a small middle platform. All our monitoring data is not only from Zabbix, but also from some links Road monitoring and some additional hardware monitoring will be placed in this bus platform.

Then our post-event analysis mainly relies on a data service provided by MongoDB. The other piece combined with the log is in ES, which is not illustrated in this picture.

Please add a picture description

This piece is mainly to introduce the effectiveness of Zabbix. From the perspective of cost reduction, efficiency increase and empowerment.

The cost reduction is obvious. In fact, the founder just mentioned that Zabbix will not be limited by the number of usage, which is our biggest advantage.

The other is efficiency enhancement, monitoring blind spot invalid alarms, which is definitely better than previous monitoring platforms. The timeliness of Zabbix itself and the improvement on the BMC are actually quite a lot. Because after we got on Zabbix, the managed nodes in the same period actually grew at a double speed.

The last piece, empowerment is data openness, which I think is the most important. Because the monitoring in this piece of operation and maintenance data is actually a large part, because if you use a commercial product, some subsequent analysis will converge, and we will have less room for our own development.

3. Integrate monitoring data to create an intelligent monitoring platform

The third part combines this monitoring data and internally encapsulates some major research and development work done within the scope of monitoring.

The first one is the governance of operation and maintenance data. We divide it into three levels internally. The first original data layer, then some of our existing automated monitoring logs and CMDB cloud management, spit out data, including monitoring In collecting data, we are actually different from the traditional data center approach. We still request it on demand, so we will not clone it all, and then do some real statistical analysis, because the operation and maintenance data and business data are actually There is still a relatively big difference. 70%-80% of business data is valuable, but the ratio of operation and maintenance data to this ratio is actually reversed, so the advantage of our approach is that hardware costs can be relatively saved , which is also helpful for some of the later performance.

The second layer is mainly the public dimension layer. Our team will build some abstract public bodies, and the value of the public layer is the calculation result. We will also extract some data analysis needed by various professional operation and maintenance teams. We calculate it, for example, it is similar to the average value, or some data. In addition, we combine the data with our integrity analysis and put it on this layer to establish An overall management system.

Please add a picture description

There is also a life cycle, which I think is very important, because we have gone through some detours before, and made the model of the data pot, but found that the volume is getting bigger and bigger. It is more difficult. After many iterations of versions, it was finally decided to put it in MongoDB. This is actually a solved problem, including the life cycle. It is relatively easy to manage. If you don’t want it, you should do it as soon as possible. Delete it, so that it is a slimming for the platform, not too bulky.

Analysis, decision-making and prediction, this part is relatively open, we will build together with professional teams, part of it is that they use it to directly do some small operation and maintenance scenarios, and the other part, relatively large analysis and decision-making early warning and prediction, is ours. The overall operation and maintenance tool R & D team to achieve.

Please add a picture description

The second block is an overall high-level police and dispatch platform based on Zabbix. Earlier, just as the founder also said, Zabbix is ​​a tool for processing indicators. Based on this tool, we will replace BMC with alarms and orders in 2020. It has a whole alarm event processing module. In fact, we It was completely replaced in the first half of this year, and it was still used all the time, because Zabbix itself does not have this piece.

We are behind because the order dispatching is quite complicated in our Taibao system. The diagram of our convergence rules and order dispatching rules is actually less than 1/10. This is just the intercepted configuration. It’s a flow chart, but it’s not finished yet, next to it is an overall effect, based on our intelligent alarm convergence platform, in general, our convergence rate can reach 40%, and invalid alarms are greatly reduced. This Word-of-mouth is still very good, and our entire platform is also self-developed. This advantage is that it can fit our interior, which is relatively personalized and complicated.

Please add a picture description

The second is the early warning machine line. In fact, we still implement it based on some rule-based algorithms. In fact, we also tried to introduce some AI-related intelligent algorithms in 2018. At that time, the effect was not very good. What about the past few years? , everyone's idea of ​​using AI in the operation and maintenance scene should also return to rationality. Then we finally deepened and refined this thing in the first half of this year. The above picture, I think it is It is better for monitoring.

The Law of the Sea is our traditional monitoring platform. Basically, 1 and 29 on this picture will be monitored, and at least the alarm will be called. Then we will definitely not call the police on the following 300 and 1,000, because the amount is too large and they There are actually very few situations that really need to be dealt with, so we combine this early warning, and we also have a diagnostic function later, that is, we will also do some processing for the 1,000 and 300 things, and intervene in advance, then other For monitoring, our biggest challenge now is: we can find problems, but the time left for operation and maintenance to deal with them is actually not enough. Then there is still no way to really deal with it before the business is affected. In fact, this piece can be gradually left to our intermediate operation and maintenance students to deal with after it is used.

Please add a picture description

This is an application topology relationship that we reconstructed internally and this year. CMDB was launched in 2015 and 2016, but we started planning for this piece of data at the end of last year. When reviewing this data governance plan, we found that this topology is basically The above is completely unavailable, but for our application failure analysis, this topology is actually crucial. From my point of view, from the development of operation and maintenance tools to the current stage, the CMDB built Whether it is good or not, actually depends on the integrity of this picture, whether it can exert its real value, in addition to replacing this, the value registered in this form, I think this relationship is very important, anyway, it is based on our later An important basis for early warning and diagnosis as a whole.

Please add a picture description

The fifth step is actually laying the groundwork for this thing. The above business gold indicator, three circles, this is an input of our basic early warning, that is, the one above the Hayne rule in front of me. The lower two layers of the picture, after we find out, we will not deal with the alarm immediately, but will go through the process engine to have a look, configuration, it can also be said to be a troubleshooting process, but this thing is all based on experience and manually configured . The latter horizontal is a full-link system for link drawing, which can collect a relationship between applications. We will combine these pieces and take a look, and if there is indeed a problem, then Only then did the real alarm be issued. In this case, it is tantamount to saying that we have intervened in advance. The block process is a native and an engine of a job process based on an automation platform. It is used for monitoring. I personally think that its performance will still be a little challenging. It can be used when the scale is small, and it will definitely be used in the future. It will become more and more difficult, so we will introduce some AI-related or more advanced concepts to do this later.

4. An intelligent operation and maintenance system that discovers when it occurs and disposes of it when it is discovered

This picture is the overall intelligent operation and maintenance monitoring system planned by us in the future. On the far left, from the overall data to observation and analysis, at the analysis layer, various scenes related to operation and maintenance will be encapsulated. On the far left, we also see several things. One is that we can do operation and maintenance. We will also use this analysis platform similar to BI to our professional team to reduce development costs and develop this intelligent threshold. , All of our students in operation and maintenance can be introduced into an ecosystem built by this tool. This piece is at the level of analysis, and it actually belongs to a model of co-construction and co-creation.

Please add a picture description

Finally, data-driven, promoting this is to monitor and mobilize our automation platform, which can do some relatively complicated fault recovery scenarios, because we are now curing the simplest restart, including file cleaning, all of which are related to Monitoring linkage, we will deal with it if we find it. If it is more, it may be more advanced or: for example, the difficulty of handling and the factors of judgment are more complicated. We have not activated it automatically. Now this, for sure It is based on our previous process of troubleshooting and accuracy. After troubleshooting becomes more and more accurate, we can gradually add an action of automatic recovery.

Please add a picture description

The team I am in charge of will do some preparatory work. The first one is the fault analysis we do under the new K8S, which uses the container system. It is point-to-point with us now, this IP-level one. There are still some differences in failure analysis.

The second part is that we will combine the integration of digital completion time and monitoring data to realize a patrol inspection with a strong sense of online experience and a visual monitoring system.

The third is that we continue to introduce AI algorithm blessings. The prediction and troubleshooting I just mentioned, and the last one is hybrid engineering. We will analyze some scenarios and put them in a scenario that can be simulated. The dependent energy of this subject Or our monitoring data must be the largest data.

Please add a picture description

   最后讲讲愿景,因为我个人比较喜欢摄影,这两张图,就是也是我全部是自己拍的,这个我是从2011年进太保的啊,就这张图是就等于是我们上Zabbix之前啊,吃饭吃到一半拉去干活了,后面一张呢,就是等于我们上了这个工具平台来维持、越来越完善之后,那我们这个就意境就不一样了,那我个人也是成功的从一个一线干活的,发展到一个看着别人干活的,那也是归功于我们Zabbix功不可没,对于我个人以及我们系统的公司系统运维平台的一个建设都起到了至关重要的作用。

Please add a picture description

Guess you like

Origin blog.csdn.net/Zabbix_China/article/details/129294692