Sharing Record | Alibaba's DevOps Culture

Writing articles
Draft saved
invite preview
release

Please enter a title (max 50 words)

[The following content is a sharing record, with abbreviations]
The three stages of DevOps development
First, let ’s briefly look at what DevOps is and where the word comes from. Here I divide the development history of DevOps into three stages: the birth period, the definition period and the landing period.

DevOps' "grandfather" is an independent IT consultant Patrick Debois in Belgium. In 2007, he was responsible for the testing and verification of a large-scale project. While developing and docking test code, he was docking with the O & M "release version". He found that the development and operation and maintenance roles in the project team had very different ways of thinking. While hoping for "quick and fast", he hoped for "steady and steady", which made him a bit broken.
At the 2008 Agile Conference, Patrick met Andrew, and the two of them hit it off, and began to think about how to change the current situation of Dev and Ops intolerance.
In October 2009, Patrick called development engineers and operation and maintenance engineers through Twitter to hold the first "DevOpsDays" conference in Ghent, Belgium, and began a large-scale discussion on the topic of collaboration between Dev and Ops. Later, in order to facilitate the spread of "DevOpsDays" was abbreviated as "DevOps".
After 2009, DevOps started to spread all over the world. In 2010, The Agile Admin blog published an article "What is DevOps", which elaborated on the definition of DevOps, including a series of values, principles, methods, practices and corresponding tools.
Also in 2010, the author of "Continuous Delivery" Jez Humble attended the second DevOpsDays conference and gave a "continuous delivery" speech. This is a very important milestone. It can be said that the book "Continuous Delivery" is the best practice of DevOps, so that students in China who are engaged in R & D performance have one. It is also this book that has accelerated the industry's understanding and landing of DevOps.
But I think that the industry has really started to implement DevOps on a large scale, but it still cannot leave the credit of containerization technology. "Docker" played a decisive role. By writing a Dockerfile, for the first time, developers can easily define the software operating environment and deliver it through the CI / CD standardized process. However, it is still troublesome to operate and maintain so many containers, so Google opened the open source "k8s" (Kubernetes) in 2014; CNCF (Cloud Native Computing Foundation) was established in 2015, officially using "k8s" as the core and established a Huge ecosystem. With the technical assistance of "docker" and "k8s", the integration of development and operation and maintenance roles is accelerated, so DevOps is no longer a castle in the sky.
How far am I from DevOps? After
reviewing history, let's compare ourselves with three small questions to see if our team is already "DevOps".
1. I can deploy the production environment every time I write the code, without the help of others.
2. There are many monitoring and operation and maintenance tools that I can use to easily deal with various online problems and faults.
3. I am directly responsible for the experience of online users, whether it is a code defect or an operation and maintenance failure, I am responsible for my own fault.
The above three of my questions actually involve the three most important aspects of DevOps, including practices, tools, and culture. These three are indispensable.
What is a good DevOps team

What is a high-performance R & D team? We can refer to this table in the "2018 DevOps Status Report": it can be deployed every hour or once a day, one version can be launched in one day or one week, the service recovery time is less than 1 day, and the change failure rate is less than 15 %. However, this number is actually not good. Taking our own example, the Alibaba R & D platform team can easily publish production multiple times a day, with a usability of 99.95% and a change failure rate of less than 5%.
These requirements seem to be sparse and common in Alibaba. How did Ali come over step by step, and how should our other companies replicate these experiences. Let's move to the next section, Alibaba's DevOps culture landing tips.
Alibaba's DevOps development stage
DevOps development is inseparable from technological changes. In 2008, Taobao started the process of service transformation and created Dubbo, Apache Alibaba RocketMQ, TDDL (Taobao Distributed Data Layer) and other well-known industries Middleware. At the same time, the giant application of Taobao was split into a series of applications such as ordering, membership, and discounts, and hundreds of thousands of foreground applications were born around each sub-business scenario. You can imagine how the development was at the time, with a fixed release window every week, with hundreds of engineers submitting code, fixing bugs, and submitting tests as they approached release. In the evening of the release day, the release will be carried out one by one in order. If there is a major bug after the release, either hotfix (patch) on the spot or rollback will be announced, and the release will be declared to have failed. Everyone was exhausted by the release day. The emergence of the first generation of automated release tools returned the release capabilities to developers, and at the same time forced developers to decouple application dependencies to achieve independent release, and the speed of business delivery was qualitatively improved. Later, everyone gave it a name, "microservices".
In less than two years, with the increasing number of R & D personnel, various complex R & D specifications, various complex scripts, various "digging pits" and "stepping on pits" have emerged, which has made R & D engineers miserable. "All this must be regulated." In 2013, we established a unified build and deployment platform to completely unify the Alibaba Group from code changes to online release and exercise strict control.

In 2016, we encountered a new problem. At that time, online operations required the operation and maintenance students to do it uniformly, and the operation and maintenance students naturally did not want to make changes. Understandably, the service is the most stable without changing anything. But this restricts the developer's innovation to a certain extent, and the clear division of responsibilities also restricts the developer to pay attention to the online status of their applications. This situation has led to obvious bottlenecks in the research and development process, which is the fundamental reason why Alibaba wants to do DevOps. With the advent of the "containerization" wave, our R & D platform was upgraded again, and all the responsibility of online container definition and operation and maintenance monitoring was handed over to developers, and the application operation and maintenance positions no longer existed.
Today, with the gradual maturity of cloud-native technologies, Shanghai Cloud has become the standard for enterprises, and it is inevitable to define the next-generation R & D platform around cloud-native.
In summary, the combination of these three factors, such as the promotion of technology, the change of organization and the construction of R & D tools, has enabled our Alibaba DevOps to mature step by step.
Tools for Alibaba's DevOps landing The
above introduced the development of technologies and platforms at the macro level. Specifically, the following tools have played a major role in Alibaba's DevOps landing and R & D performance improvement.
The first is the "cloud effect" of the DevOps platform. The common open source software Gitlab, Jenkins, and Jira have also been a choice for Alibaba, but later we found that pure tool-type software can only solve some single-point automation problems, such as Code management, build packaging, etc. In fact, in the actual development process, there are still many jobs that cannot be automated, such as the rules of demand circulation, the rules of branch management, the mode of development, testing, and operation and maintenance communication. We can collectively call these tasks "collaboration".
To do a good job of "collaborative ability" requires a deep understanding of people and processes and efficiency, and abstracts these understandings into methods and ultimately into products. Through years of accumulation, Alibaba has produced many unique R & D management methods, such as Aone-flow code management mode, test environment management mode, AGit-Flow code management mode, double eleven layered project management mode, etc. We put these R & D management methods on the cloud effect platform, and finally act on people, subtly affecting the culture of developer collaboration, which can also be said to be DevOps culture.

The second is the traffic playback test technology. The innovation of this technology has a great impact on the test team. By copying online traffic to offline, the problem of test regression is solved at a low cost, and the traditional test is written by writing use cases, and the test is simplified to arrange data for testing. The second layer is the application of Mock technology, which transforms a distributed system problem into a stand-alone problem, which can complete thousands of use cases in a few seconds. With these two basic technologies, a test platform can be developed at the upper layer to identify effective traffic through algorithmic means, to automatically process data, and to identify defects behind abnormal traffic. Through these three levels of change, it can be said that Alibaba has made a qualitative change in test efficiency.
The third is the full-link pressure measurement technology (corresponding to the product on Alibaba Cloud called PTS). The reason why everyone can rest assured that the Double 11 is smoother year by year, the core is that this technology helps developers find risks before each big promotion. After discovery, you need to respond quickly, and use DevOps tools to solve online problems. Each pressure test is a training exercise, a bit similar to a military exercise. It quickly finds problems, solves them quickly, and constantly trains the team's DevOps capabilities. It can also be said that Alibaba's DevOps capabilities are trained by the "Double 11" at a time.
Alibaba's DevOps core concept: loose control and strong card point
When the development begins to define operation and maintenance, when it takes over. Our managers will have some concerns, such as whether they will develop any operations that cause online failures, and arbitrarily release them to cause stability problems.
Alibaba DevOps has a core philosophy of loose control and strong card points.

First look at where is "pine"? "Loose" means that we have a variety of pipelines for development options. The application owner can completely define the various rules of this application, such as how to publish, how to test, how to configure resources and environment, etc. We have universal builds and custom builds that give users maximum freedom. The last is "light release, heavy recovery". In every application dimension, developers can use pipelines to deliver code at any time, without special restrictions, but only need to think about how to recover quickly if something goes wrong.
With enough degrees of freedom, we must set up some "stuck points". For example, code review and quality red line; code security inspection, protocol inspection; release, network closure window, etc. There is also the so-called "change three axes": grayscale, monitorable, and rollback. These checkpoints are to ensure that all development engineers of Alibaba Group are in step with each other and deliver qualified products.
Summary: The core of DevOps is to deliver value quickly, giving maximum freedom to development, and is responsible for the entire process of development and operation and maintenance. With the cooperation of monitoring, fault prevention and control tools and function switches, you can find a balance between ensuring user experience and delivering value quickly.
Alibaba DevOps core concept: application-centric
How does Alibaba land DevOps quickly? Here I want to focus on: DevOps concept of application-centric. The application information can be summarized as a kind of data in CMDB. It is naturally kind to R & D personnel, and it can directly correspond to a service and a code base. Taking the code as the starting point, we can connect the pipeline, environment, test and resources in series. The outermost is the tool chain: monitoring, DB, operation and maintenance, middleware, etc.

Using the application to connect the entire tool chain allows developers to understand and get through the DevOps process. There is no such thing as "developing code, services, operation and maintenance machines, and machine rooms."
After the tool is opened through the application, developers can logically define its application on the platform and also define the operation and maintenance rules. For example, planning the environment, creating resources, setting release strategies, etc., can all be done by developers.
After the definition of application and operation and maintenance is completed, "whoever defines it will be responsible", so in Alibaba, developers need to be responsible for the entire life cycle of the application. Through the advancement of similar concepts and the automation of operation and maintenance tools, "Dev" took over the work of "Ops" subtly. At this time, you will find that the original "DevOps" is not so complicated.
Enjoy the DevOps dividend and become the elite delivery team.
Through the DevOps tools we have mentioned in practice in Alibaba, the "Operation control, strong card point" and "Application-centric" DevOps concepts, Alibaba's DevOps has landed. And get a real efficiency dividend. It eliminates dependence on individuals, reduces wear and tear between teams, reduces testing costs, improves quality, and reduces the risk of releasing software. Eventually accelerate the speed of enterprise innovation, so that Alibaba can respond quickly in one opportunity after another.

The above picture is some of the data we released in 2018. The concept of "211" was first proposed: more than 85% of the requirements can be delivered within two weeks; more than 85% of the requirements can be developed within one week; after the code is submitted, it can be within 1 hour Complete the release within. I also suggest that you can use "211" as your enterprise's performance goal, through advanced DevOps tools, practices and culture, three-pronged approach to bring dividends, not to do it for doing.
New opportunities brought by the cloud era
Through the previous introduction to the development of Alibaba's DevOps, it is not difficult to find such a cycle: we constantly encounter new problems in the software development process, thus spawning new technologies (such as microservices , Containerization); then the new technology has brought about architectural changes (such as service, technology center); eventually formed a new model of software development. Now that cloud native technology is here, what opportunities can this new technology bring us?
What is cloud native? There are various interpretations in the industry, and there is a view that the use of the cloud to build application systems is cloud native. From the perspective of software development, I think the biggest change brought by cloud native is that developers only need to pay attention to business logic, which brings great performance improvement. How did this happen? Let's compare traditional applications and cloud-native applications.

In the traditional software development process, the developer's code will be deeply coupled with middleware, and it needs to pay attention to many aspects such as service discovery, sub-database sub-table, message processing and so on. We also need to pay attention to where the software is deployed, how much capacity is needed, and even the operating system, storage and other issues.
In the cloud-native era, it will be very different. The core capabilities of middleware will sink into the cloud infrastructure. Some common capabilities such as current limiting, downgrade, and authentication do not need to be concerned. The database and operating environment are dynamically scalable. Yes, common operation and maintenance problems do not need to be concerned. Only need to develop good code, and automatically publish it to the cloud through the software delivery platform.
The complexity of software development does not disappear, but exists in another way. Under cloud-native technology, this complexity will sink to the cloud infrastructure layer, and shield this complexity through the cloud.
How to solve this complexity, one of the core is to use data to solve. Under cloud native, we have unified technical standards in the industry, such as middleware standards and container standards. With standardized data and a strong infrastructure, these data can also be easily obtained. With these data, we have the opportunity to create various smart tools to solve the complexity of our software development, or to help developers work through tools to reduce this complexity.
Therefore, under cloud native technology, we have unprecedented opportunities for intelligence and opportunities for inclusion.
The three major technology systems that affect developers
in the cloud native era In the cloud native era, I think there will be these three technologies that will give developers a whole new experience. They are CloudIDE in development state, Service Mesh in running state, and Serverless technology in operation and maintenance state. CloudIDE moved the development environment to the cloud, and can be deeply integrated with the R & D platform to provide developers with the ultimate programming experience. There is no longer any need to care where I develop, as long as there is a browser, you can code.

Middleware will gradually be integrated into Service Mesh technology in the cloud era, and developers such as service routing and current limiting and degradation will no longer care.
Serverless technology allows automatic scaling and capacity evaluation to become history, and developers no longer care where the machine is.
These three technologies will develop cloud-based full links, and generate a large amount of R & D data, service data, and runtime data. In recent years, Alibaba has begun to invest in these data mining and research work, and has maintained a close cooperative relationship with academia.
Alibaba is exploring the direction of data application.
Briefly introduce the direction of data application we are currently exploring: in terms of code, there are code recommendation, intelligent code review, code search and high-quality code sharing. In terms of operation and maintenance monitoring, we have invested in an intelligent baseline that can automatically alarm based on monitoring fluctuations and avoid configuring rules one by one. There is also release risk control, which automatically blocks the release process by identifying changes before and after monitoring changes. There is also an automatic configuration of business panoramic monitoring, full link insight into business stability, etc.
Below I will go through two examples, go into details, and talk about the results we have achieved in data applications.
Application of code big data-PRECFIX defect monitoring technology
Earlier this year, PRECFIX code defect detection technology (Patch Recommendation by Empirically Clustering) has been launched in Alibaba's internal production system to help developers find defects during code reviews.

There are three main difficulties in the application of intelligent methods in the field of defect detection: 1) How to label the data without the precipitation of defect data and the public data set? 2) The code is a heavy logical formal language, how to characterize the content of the code? 3) How to give repair suggestions through non-manual rules?
Our specific approach is this, first mark the commit of the suspected defect through data mining, and extract relevant statistical features for learning, and give a risk assessment through the model. Then the similarity code clustering is performed on the change diff of the defect commit to find out the mistakes frequently made by engineers and the repair methods commonly used by engineers. When a similar error occurs again, the corresponding patch can be given to the developer.
Application of Big Data at Run Time — Unattended Release

The previous one is a tool on the "Dev" side, and the following introduces a tool on the "Ops" side: unattended release.
Once, we analyzed all online faults and found that 80% of the faults were caused by "changes". This also shows that if you don't make "changes", basically there will be little chance of failure. Because code release is an important form of online change, for the system to run steadily and continuously, you must get stuck with the release. Therefore, we made a tool called "unattended release", which can collect system data, log data, business data, etc., and check various indicators, and compare the index changes before and after the release through the algorithm. Once a problem is found, the release process can be blocked, and even an automated rollback can be achieved. With this technology, any development team can safely do release work, and the operation and maintenance team does not have to worry about major failures due to frequent online changes.
The future of Alibaba's software R & D platform: a new cloud effect is coming soon.
In summary, "cloud" and "data" are the biggest opportunities for our next-generation software R & D platform. Although these data intelligence tools are good, they cannot be used only by Alibaba. What is more important is to realize the value of "cloud", which is the value of inclusive computing that we are talking about.

Therefore, this year we will launch the new DevOps tool platform "Alibaba Cloud Cloud Effect" on Alibaba Cloud. Not only can we continue to provide you with enterprise-level one-stop DevOps capabilities, but also integrate cloud native capabilities and intelligent capabilities. We are actively preparing, so stay tuned! Interested developers can also contact us in the Yunxiao user group (Dingding Group Number: 23362009) to apply for a trial, thank you all.
[Trailer for next live broadcast]
Live broadcast time: April 10, 19: 00-20: 00
Live broadcast topic: How SMEs can achieve software development at home
Brief introduction: Through Alibaba Cloud cloud effect products, demonstrate how multiplayer and multi-role online software development, including continuous Integration, continuous delivery and other processes
Lecturer introduction: Jiao Ba, person in charge of continuous delivery of Alibaba's R & D collaboration platform, long-term investment in CI / CD, DevOps construction
Viewing method: live broadcast of nail group (scan code to join nail group: 23362009)
[About Cloud Effect]
Cloud Effect, an enterprise-level one-stop DevOps platform, stems from Alibaba's advanced R & D concepts and engineering practices, and is committed to becoming a digital enterprise's R & D performance engine! Yunxiao provides end-to-end online collaborative services and R & D tools from "demand-> development-> testing-> release-> operation and maintenance-> operations", and helps developers improve R & D efficiency through the application of artificial intelligence and cloud native technologies Deliver effective value.
For cloud, see Yunqi: more cloud information, cloud cases, best practices, product introductions, visit: https://yqh.aliyun.com/
This article is original content of Alibaba Cloud and may not be reproduced without permission.

Published 1217 original articles · 90 praised · 230,000 views +

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/105225842