Alibaba's cloud-native big data operation and maintenance platform SREWorks is officially open source

Introduction: SREWorks, Alibaba's cloud-native big data operation and maintenance platform, has accumulated the SRE engineering practice of the team that has been tempered by internal business in the past 10 years. Many practitioners adopt the idea of ​​"digital intelligence" for efficient operation and maintenance.

Author | Sheng Bai
Source | Alibaba Technology Public Account

With the continuous development of the industry, big data & AI are gradually showing the trend of cloud native. The complex business scenarios and the open source and self-research of different technical directions involved behind them make product operation and maintenance face challenges such as high technical complexity, large scale, and multiple scenarios.

SREWorks, Alibaba's cloud-native big data operation and maintenance platform, has accumulated the SRE engineering practice that the team has tempered in internal business for nearly 10 years. Today, it is officially open-sourced to the outside world. Adhering to the "data-based, intelligent" operation and maintenance idea, it helps the operation and maintenance industry more Practitioners adopt the idea of ​​"digital intelligence" to do efficient operation and maintenance.

What is an SREWorks?

In 2003, Google proposed a position called SRE (Site Reliability Engineer, Site Reliability Engineer), which is a combination of software engineers and system administrators. It attaches great importance to the development capabilities of operation and maintenance personnel, and requires that the daily chores of operation and maintenance be within 50%. , and the other 50% to develop automated tools to reduce manpower requirements.

SREWorks, as the engineering practice of the SRE concept by the Alibaba Cloud big data SRE team, focuses on the application-centric one-stop "cloud native" and "digital intelligence" operation and maintenance SaaS management suite, providing enterprise application & resource management and operation and maintenance Develop two core capabilities to help enterprises realize the delivery, operation and maintenance of cloud-native applications & resources.

The Alibaba Cloud Big Data SRE team is naturally close to big data and AI, is very familiar with big data & AI technology, and has big data & AI computing resources that can be used at any time. The concept of maintenance, DataOps (data operation and maintenance) in the industry was first proposed by the team. SREWorks has a set of end-to-end DataOps closed-loop engineering practices, including standard operation and maintenance data warehouses, data operation and maintenance platforms, and operation centers.

With the advent of the cloud-native era, the Alibaba Cloud Big Data SRE team has open sourced the SREWorks operation and maintenance platform, hoping to provide an out-of-the-box operation and maintenance platform for operation and maintenance engineers.

2. What are the advantages of SREWorks?

Returning to the needs of the operation and maintenance field, no matter how the upper-layer products and business forms change, the operation and maintenance essentially solves the requirements related to "quality, cost, efficiency, and safety". SREWorks uses an operation and maintenance SaaS application interface to support the above requirements, and at the same time drives the SaaS capabilities with the idea of ​​"digital intelligence" as the core, including delivery, monitoring, management, control, operation, and service.

1 Layered architecture of the systematic operation and maintenance platform

From the four dimensions of "quality, cost, efficiency, and safety", we should look at the work related to the nature of operation and maintenance. In addition to building platforms, building specifications, and standards, we also need to use automatic concepts to improve efficiency, and use data to drive testing/development/operation. Dimension, use intelligent means to discover/predict risk problems in advance, etc. These can be seen as methodologies. How to quickly obtain a set of systematic, engineering, and productized capability practices from theory to support and meet the needs of the above four dimensions is what SREWorks considers.

Alibaba Cloud's big data SRE team built the SREWorks platform product system using the layered idea, drawing on the classic SPI (SaaS/PaaS/IaaS) three-layer division idea. Operation and maintenance IaaS access layer" consists of three parts.

SREWorks also incorporates the idea of ​​operation and maintenance specifications and standardization, and uses the methodology of products to carry automated processes, data-driven, and intelligent cores. In the whole process from code to online business services, operation and maintenance are more or less involved in some of the work. Therefore, around the application life cycle, the SaaS scenario layer is divided into "delivery, monitoring, management, control, operation, service". "Six Regions. As shown in the figure below, each piece of content has representative core functions.

In SREWorks, application abstraction is unified to describe the business system. After developers deliver the developed application products online, they will monitor, manage and control the life cycle of online application instances. The operation and maintenance data capabilities owned by SREWorks will provide value-added operations and services, and provide convenient views and management capabilities for those in need.

The six scenarios of "delivery, monitoring, management, control, operation, and service" have detailed definitions and boundary descriptions in the SREWorks product manual.

2 Complete data operation and maintenance system practice

一套数据化运维体系,会把所有系统的运维数据全部采集起来、真正打通,并深度挖掘这些数据的价值,为运维提供数据决策;同时构建数据化运维业务模型,基于该模型建立标准化运维数仓,建设数据运维平台,在平台中规范运维数据的采集、存储、计算及分析,并提供一系列数据化服务,供上层运维场景使用。

有了运维相关的量化数据,对运维工作的描述和衡量将更加立体化,可以建立长期可持续优化的运维工作模式,实现真正的运维价值。

3 服务化的 AIOps 智能运维平台

在阿里云大数据 SRE 团队看来, AIOps 的出现并没有改变运维的表现形式,依旧还是“交付、监测、管理、控制、运营、服务”的界面,只是在大量运维数据化工作的基础之上,利用AI能力探索、挖掘智能化运维场景。因此,在一开始构筑 AIOps 工程实践时,就坚持打造“感知、决策、执行”的闭环,类似自动驾驶的理念。

SREWorks将量身定制的算法与运维场景化结合,能够提前预测、关联分析,增强风险预防、故障定界定位能力,实现传统手段无法获得的运维价值。具体而言,将每一个智能化的运维服务包装成感知的“监测器”、决策的“分析器”、执行的“策略器”,供健康管理、变更管理等系列服务调用,即可增强已有运维场景,解决一些普通手段无法解决的问题。

4 运维中台化、低代码化及云原生化运维开发体验

SREWorks 套件自身也是云原生化的应用,并且采用运维中台思想构建,在中台里构建大量的PaaS 化运维服务能力,在前台围绕“交、监、管、控、营、服”六大场景提供SaaS 化运维场景应用。

大部分页面为企业后端控制台类系统,不太需要很酷炫的交互设计,故而,运维开发领域的前端开发始终难于追赶前端流行趋势。针对这些特点,SREWorks 创新性地设计了一套 Serverless 体验的前端开发模式。

三 为什么要开源?

阿里云大数据 SRE 团队之前在多次技术分享时重点介绍过“DataOps、AIOps”的能力,但都是纯理论层面的介绍。具体在 SRE 领域,到底在工程实践上实现这一套理论?对运维的需求、界面、内核这三层的理解如何落地?

为了把数据化、智能化这套数智内核故事讲明白,阿里云大数据 SRE 团队将具有低门槛、高效率特点的云原生运维平台SREWorks开源出来。

他们坚定地认为,运维团队更需要拥抱云原生,只有这样,运维才能在云原生浪潮下找到一席之地。

该团队也希望, SREWorks 的开源,能让更多从业者使用“大数据和AI”的能力做好运维,实现“数据+智能”的运维平台内核。

据介绍,SREWorks背靠阿里云计算平台系列“大数据&AI”产品,如 MaxCompute、Flink、DataWorks、Hologres、Elasticsearch 等,开源版中同样选取了这些产品对应的开源版本,比如开源版 Flink、Elasticsearch 等。

四 后续规划

SREWorks平台目前每个月会进行一次迭代开发任务,后续将由版本管理员统一维护合入相关功能及问题修复等内容,以保证最新的云原生化运维能力持续进入后期版本中。

当前, SREWorks 中有一套 OAM(Open Application Model)规范的工程化实践,可以把该实践看成是 SREWorks 的核心引擎。围绕该引擎,SREWorks团队建设了系列运维中台服务,包含自动化、数据化、智能化能力,之后也将跟随社区 OAM 规范的发展,持续迭代。

五 写在最后

今天 SREWorks 的开源只是迈出的一小步,非常期待得到开发者的反馈。SREWorks中也设计了插件化扩展能力,欢迎使用 SREWorks 来打造属于自己的运维平台。

最后,如果您对 SRE、DataOps、AIOps 或云原生等领域有兴趣,都可以参与到我们的建设中来,这将是我们莫大的荣幸,一起交流,一起打造最具特色的 SRE 云原生运维平台!

原文链接

本文为阿里云原创内容,未经允许不得转载。

Guess you like

Origin juejin.im/post/7078870181941870599