SREWorks v1.5 released | Open source log clustering based on real-time job platform

After four iterations of v1.0~v1.4, the core base of SREWorks has shown extremely high stability and maturity. In the v1.5 version, the SREWorks development team carried out more iterations of digital intelligence capabilities on the core base. At the same time, in the iterative process of digital intelligence, we also maintain a high frequency of communication with SREWorks users. We found that people generally pay more attention to the digital intelligence capabilities based on monitoring data, so we did some in-depth analysis on these points and found that they generally encounter the following problems:

  1. After the data volume of the self-developed monitoring system increased, the reliability decreased.
  2. The introduction of various types of unstructured data such as logs has led to a sharp increase in engineering complexity and greater challenges in terms of real-time performance.
  3. Simple expressions often cannot meet the monitoring needs of business diversification.

Therefore, many users choose to switch from the self-developed monitoring system to the flow computing engine Flink, but the usage threshold and operation and maintenance of Flink Job itself have become a major problem. After several rounds of analysis and research, the SREWorks development team decided to divide these problems into two phases:

  1. Lower the threshold for using Flink Job, empower SRE to quickly convert operation and maintenance requirements into computing power, and enable SRE to truly have the ability to touch data.
  2. Use SREWorks engineering capabilities to build open source Flink operation and maintenance products, further reducing the difficulty of Flink operation and maintenance.

In the v1.5 version, we will first complete the open source of phase 1. At the same time, on the real-time operation platform, we will introduce the log clustering that everyone has called for as the best practice of this kind of digital intelligence ability: through Flink ML Greatly improve the real-time aggregation efficiency of massive logs . Regarding phase 2, an introduction to Flink's intelligent diagnostic tool - Flink Advisor will be disclosed in the near future, which will not be expanded in this article. Let's start with the open source product of phase 1: the real-time operation platform.

Real-time operating platform

When SREWorks was first open-sourced, because SREWorks included the community version of Ververica Platform to manage Flink jobs, for a while, answering questions about using the community version of vvp took up most of our communication time with users. Therefore, after the precipitation and polishing of these requirements, we also integrated the real-time processing link into the operation platform. The operations in the operation platform are divided into scheduled operations and real-time operations:

  • Scheduled jobs provide minute-level job execution scheduling, which is suitable for batch processing scenarios with small data volumes and low timeliness.
  • Real-time jobs provide real-time job management based on Flink + Community Edition Ververica Platform.

image.png

After collecting a lot of user feedback, we decided to integrate the step-by-step orchestration capabilities of SRE into real-time operations to further lower the threshold for using SRE. The final function is shown in the figure below: We will use picturea Flink Job is split into three structures for easy management:

  • Input source : Corresponding to the source source of Flink, there can be multiple input sources.
  • Python processing : Corresponding to the processing process summarized by Flink, currently based on pyFlink, you can directly write Python scripts, or split it into multiple Python processing processes according to business needs.
  • Output : Corresponding to Flink's Sink, there can be multiple outputs.

Input Source & Output

In the input and output section, we directly read the registered Connectors of Ververica Platform for users to choose, as well as the drop-down reminder when configuring parameters, which greatly avoids the omission of fields and parameters when users write CREATE TABLE by hand.picture

operating environment

Students who often use Python can know that the management of the Python operating environment is a troublesome problem: if you use Docker images to manage the packaging process, it is too lengthy, and if you use requirements to manage, you often encounter the problem of not being able to package. Therefore, in the real-time operation platform, we made a compromise and used the Python virtual environment for management.picture

picture

At the same time, we have also extended the concept of environment in combination: a series of objects such as Flink's container image and PyFlink's runtime Jar package are counted as settings in the environment. Since the environment converges all variable resources, the complexity of SRE maintenance operations is greatly reduced, and the original problem of version incompatibility among multiple runtime resources is gone. All operations in the same environment use the same combination.

Currently v1.5 provides two environments available: flink-ml and default, and the self-management capability of the environment will be launched in the next version.

Flink job operation and maintenance

The real-time job platform has made more abstractions, which simplifies the job submission process, but we are well aware of the complexity in Flink job operation and maintenance, and have not done too much extra packaging, directly using Flink Dashboard as the running observation platform , which is convenient for students who are familiar with Flink to quickly get started and troubleshoot problems. The following figure shows the Flink Dashboard page for starting jobs in the real-time job platform:picture

log clustering

On the real-time operation platform, this v1.5 version also open sourced the log clustering algorithm. For the algorithm principle, please refer to "Intelligent O&M Algorithm Service and Application Based on Flink ML". This article mainly explains the practice of open source engineering.

The algorithm code of log clustering is located in the directory https://github.com/alibaba/SREWorks/tree/master/saas/aiops/api/log-clustering

├── db-init.py
├── log-clustering-job
│   ├── pythons
│   │   └── log_cluster_pattern.py
│   ├── settings.json
│   ├── sinks
│   │   └── pattern_output.json
│   ├── sources
│   │   └── log_input.json
│   └── template.py
└── ...

The catalog mainly consists of two parts:

  • db-init.py : Database initialization for feature engineering, which needs to use a small number of typical log samples to initialize the log keyword list and log template features.
  • log-clustering-job/*: log clustering algorithm job, which has been imported into the job platform by default in v1.5 version, and the same effect can be achieved by manually importing it into a zip package.

Next, we will complete a complete log clustering practice based on this open source project. The input of this practice is the log stream of Kafka (the built-in Kafka of SREWorks), and the output is the feature library in MySQL.

image.png

STEP 1 feature engineering initialization

In this practice, we take the application engine (AppManager) log in SREWorks as an example:

First use the label name=sreworks-appmanager-server to query the name of the AppManager Pod. This label will be used later in the collection.

$ kubectl get pods -l name=sreworks-appmanager-server -n sreworks
NAME                                         READY   STATUS    RESTARTS   AGE
sreworks-appmanager-server-c9b6c7776-m98wn   1/1     Running   0          5h39m

Then extract a small amount of logs of the Pod as an initialization log sample, and store the file name as  example.log .

kubectl logs --tail=100 sreworks-appmanager-server-c9b6c7776-m98wn -n sreworks > example.log`

The content of the log in example.log is roughly like this:

[2023-05-26 21:46:02 525] DEBUG [http-nio-7001-exec-6][o.s.web.servlet.DispatcherServlet:119]- GET "/realtime/app-instances?stageIdList=prod&appId=&clusterId=1id&optionKey=source&optionValue=app", parameters={masked}
[2023-05-26 21:46:02 526] DEBUG [http-nio-7001-exec-6][o.s.w.s.m.m.a.RequestMappingHandlerMapping:522]- Mapped to com.alibaba.tesla.appmanager.server.controller.RtAppInstanceController#list(RtAppInstanceQueryReq, HttpServletRequest, OAuth2Authentication)
[2023-05-26 21:46:02 527] DEBUG [http-nio-7001-exec-6][o.s.w.s.m.m.a.RequestResponseBodyMethodProcessor:268]- Using 'application/json', given [*/*] and supported [application/json, application/*+json]
[2023-05-26 21:46:02 527] DEBUG [http-nio-7001-exec-6][o.s.w.s.m.m.a.RequestResponseBodyMethodProcessor:119]- Writing [TeslaBaseResult(code=200, message=SUCCESS, requestId=null, timestamp=1685137562527, data=Pagination( (truncated)...]
[2023-05-26 21:46:02 527] DEBUG [http-nio-7001-exec-6][o.s.web.servlet.DispatcherServlet:1131]- Completed 200 OK.
..

Use db-init.py to initialize the feature engineering database. This action will add a table to the database, and  process the logs in example.log into feature rows and store them in the table:

pyon3 ./db-init.py example.log --host *** --user *** --password *** --database *** --table***

Remember the connection variable of this database, it will be used in the next step.

STEP 2 job parameters run configuration and start

Open the real-time job platform in SREWorks, click the [Run Parameters] button corresponding to the job "Log Clustering Pattern Extraction", and fill in the database connection parameters in STEP1 in the startup parameters:

image.png

After completing the filling, we can directly click on the job to start. After the job starts, we can click the status of [Running] to jump directly to the Flink Dashboard to see that the entire stream computing processing link is ready, but there is no log input yet.

image.png

STEP 3 log collection input and clustering

ilogtail is an open source observable tool of Alibaba Cloud, which is widely used in Alibaba Cloud's collection scenarios. ilogtail is also very well adapted to cloud native. It adopts the method of DaemonSet to pull up in each Node, and as long as the Pod contains the corresponding label, it will be collected.

Therefore, we can easily install ilogtail into the cluster through the operation and maintenance market. At the same time, when installing, the corresponding collection label is configured as name=sreworks-appmanager-server , which is to collect logs from the application engine (AppManager).

image.png

After the log collection starts, we can see through the Flink Dashboard that the originally empty real-time processing link suddenly becomes busy. Like a factory assembly line, each computing unit is constantly sending and receiving processing data. pictureBy looking at the pattern table in MySQL, we can see that after the log feature is processed, it has been dropped into the MySQL table we defined in STEP 2.

picture

We can pay attention to several key points in the feature table:

  • The feature table is the convergence of log features. In the initial stage, the amount of data will expand rapidly. After a period of time, when there are no new features, the amount of data will stabilize.
  • The field pattern in the feature table   is the summary of this line of logs, and the field  top_pattern  is the summary of the logs in the center after clustering. Through **top_pattern**, we can easily count the total types of logs, and we can also see that each What are the logs under the class log.

As shown in the figure below, you can see that these debug logs with very similar but different texts are gathered under the same **top_pattern**.

image.png

Practical Application of Log Clustering

Around the log clustering algorithm, many data intelligence practices can be carried out. Focusing on the disclosed cases of "Intelligent Operation and Maintenance Algorithm Service and Application Based on Flink ML", combined with the above engineering practice, we can take a look at the complete link:

image.png

  • The feature (pattern) table we looked at in STEP 3 can be further evolved into a log knowledge base, which guides SRE to label based on operation and maintenance experience.
  • The logs accumulated in the log knowledge base are input into the Q&A robot as the Q&A corpus to quickly solve user problems and reduce the number of work orders.

Looking forward to your feedback after integrating and using the log clustering algorithm. At the same time, the SREWorks team will continue to polish the digital intelligence operation and maintenance algorithm based on the internal operation effect and your feedback.

Enhanced Enterprise Application Development and Deployment

In the v1.5 version, we also enhanced the application development capabilities of the base, including the following functions:

  • Enterprise applications add multi-branch development capabilities to adapt to the multi-version iteration needs of enterprises.
  • Enterprise application instances deploy complete OAM visualization.

image.png

In terms of enterprise applications, we often combine the voices of SREWorks front-line users to provide SREWorks users with powerful capabilities for internal use through product polishing and optimization. We will also maintain this state, hoping that the cloud-native application development model and digital intelligence operation and maintenance system can help enterprises focus on business value and carry out rapid functional product development iterations.

How to upgrade from current version to v1.5

  • The upgrade includes a base, and the page may be inaccessible for 5-10 minutes, please note.
  • Cloud-native applications developed by users will not be affected (no restart), and the traffic from the SREWorks gateway to the application will be interrupted.
git clone http://github.com/alibaba/sreworks.git -b v1.5 sreworks

cd sreworks
./sbin/upgrade-cluster.sh --kubeconfig="****"

If you encounter problems during use, you are welcome to raise Issues or Pull requests in GitHub.

SREWorks open source address: https://github.com/alibaba/sreworks

You are also welcome to join the DingTalk group ( group number: 35853026 ) to share and communicate~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/9719191