Let’s chat about production safety and talk to front-line experts at station B and Vipshop! |TakinTalks big coffee talks

"Safe production" is a newly emerging concept in the industry. It comes from traditional industries. Strengthening safe production is to prevent and reduce production safety accidents, protect the safety of people's lives and properties, and promote sustainable and healthy economic and social development.

With the development of the Internet, the digital economy accounts for more than 30% of the total economic volume. In emerging Internet enterprises mainly supported by IT systems, various system failures will not cause personal harm, but the losses to enterprises are not. It should not be underestimated, in addition to direct economic losses, it will also cause a large loss of users and seriously damage the corporate image.

Because of this, digital business also needs "safety production", and building a complete safety production system can help companies solve the current pain points:

1. The safety production infrastructure of the Internet enterprise system is weak;

2. The technical staff lacks systematic and normative guidance;

3. The safety production supervision is weak and cannot be implemented in place;

 

The overall framework for the safe production (stability governance) construction of Shuli Technology's micro-service system

 

In this issue of [Big Coffee Talks], we have invited industry representatives of Internet companies and three lecturers from the TakinTalks community - Lv Fan, the leader of the B&C side architecture group of the B station live broadcast, Chen Junfeng, an expert in middleware technology at Vipshop, and the co-founder of Shuli Technology. Yang Dehua made an in-depth discussion on activities guarantee efficiency, chaos engineering, personnel standard management, safety production, etc. The following are the highlights of the dialogue, I hope it can inspire you-

(See the full version of the playback video at the end of the article)

 

Expert introduction

 

Expert opinion clashes

 

1. Activity guarantee is something that many companies have to do. In addition to the guarantee effect, the cost and efficiency of guarantee have become the most concerned issues. Is there any good way to reduce costs and increase efficiency?

Station B Lu Fan

Stress testing required the most manpower in activity support. At that time, there should be 20 or 30 people working together to do this. The most troublesome thing was the link combing. At that time, it took three weeks for the stress test link combing, but B The stress test scene of the station will not change very much. In addition, the newly released functions, such as the virtual anchor we recently made, the traffic is not particularly high and will not have a big impact on the stability, so review the core The stress test scenarios and scripts can be reused in the future, which can reduce costs and increase efficiency for subsequent activities.

In terms of improving efficiency, there is actually a very important point. When you make some preparations to the extreme, you will save a lot of things. For example, we have several scenarios that need to be stress tested together and tested separately. The result is ineffective, that is not going to happen with sufficient preparation.

Sequence Yang Dehua

Array Technology has helped many companies such as SF Express, China Mobile, China Life and other companies to perform full-link stress testing in the production environment. They will also pay special attention to ensuring efficiency and cost. For example, link sorting takes up labor and time. The work is also the key part of improving efficiency . Due to frequent business changes and frequent changes in dependencies for some enterprises, Takin, an open-source product of Shunli Technology, has also improved the function of automatic link combing according to user needs, which can help enterprises reduce the time spent on link combing and effectively improve efficiency by more than 50%.

Vipshop Chen Junfeng

It is true that the security of each large-scale event takes a lot of time and manpower. Recently, we are also doing things in the direction of optimization, including business-level and technical-level actions.

For example, the business level is to update the activity method and make promotions routine, and will not lead all the traffic of the big promotion to a concentrated time point. For example, Double Eleven will not only be sold that night, but will start from November 1st. If the traffic is evenly distributed, the security pressure will also be reduced.

At the technical level, it will involve the product maturity of the underlying components. We currently have a pressure measurement platform that integrates several pressure measurement methods. As in the past, each team performed the stress test in their own stress testing environment, including recording and replaying traffic. Now this is integrated and the standardized process is implemented into the product, which also improves human efficiency. Great help.

 

2. Killing problems in the cradle in advance is the best means of protection. Chaos engineering and fault drills have begun to enter the public eye. How does it work?

Station B Lu Fan

The core of security work is still people. There is no way to avoid online system abnormalities. In addition to exercising the stability of the system, we should also exercise the people involved in security, so that they can deal with problems calmly and in a timely manner. An important part of chaos engineering is failure drills, and it is also the main means of training people and systems. At present, the drills at station B are not online drills, but in the drill environment. Because online drills are prone to pollute the production environment with dirty data, we adopt the method of copying traffic + copying the environment, and we will continue to improve in the future.

Sequence Yang Dehua

The purpose of chaos engineering is to find and solve problems in advance to ensure system stability and user experience improvement. System stability is the result index we pursue, and chaos engineering is a new and effective way we can improve process index. Many customers of Shuli Technology have also practiced in this regard, and most of them also use fault drills as an entry point. Ideally, the process of failure drills should be: routine failure drills, identifying system risk points, optimizing business systems, and producing feasible and effective failure handling plans.

Vipshop Chen Junfeng

Chaos engineering is still very useful. It can turn passive into active, let failures occur in advance, and see the performance of the system, and prepare plans in advance to deal with them. At present, we have built an isolated environment to do chaos engineering, and we also have related product prototypes, but they have not yet landed in the production environment. After all , it takes a certain amount of courage to implement chaos engineering in production. In addition to having a great grasp of the system, it is necessary to promote this matter from top to bottom before it can truly be implemented.

 

3. The core of security work lies in people. Various norms have begun to prevail, but how to formulate and implement them has become a new difficulty. Is there any good practice?

Sequence Yang Dehua

There is an old saying in China that there is no rule without a square. At the same time, Redario also wrote a book called "Principles". To a certain extent, norms are equivalent to principles, that is, the rules by which one speaks and acts. Specifications are actually a management method, even if the specifications formulated for the system are actually mainly aimed at people. In the early stage of formulating the specification, the purpose of the specification and the specific scope of application should be clearly defined, so as to be more targeted.

I also have an idea here, whether it is possible to refer to the "fault drill" to actively inject some violations of the specification to see if the person in charge can find it, so as to obtain quick feedback, and maybe try this in the future.

Station B Lu Fan

Regarding how to implement the specification, we will pay attention to ROI in everything we do . It is not necessary to force all services to comply with the specification. The main requirement is that the core business conforms to the relevant specification. In addition, we will conduct reviews through some specific tools. We will also consciously strengthen everyone's awareness of norms in our usual publicity and training. For core businesses, we will also have a penalty mechanism. Violation of relevant norms will trigger penalties. The corresponding personnel are required. responsible.

Vipshop Chen Junfeng

Most of the specification and process formulations are based on experience, summed up from the past badcases and goodcases , which will be more in line with the actual application scenarios, and the resistance to landing will also be reduced. As for the actual implementation of the specification, we often implement it in combination with platform tools, and it is more standard and convenient for inspection, review and update iterations.

 

4. "Safe production" is a newly emerging concept in the industry, what is your opinion on this?

Sequence Yang Dehua

Safety production in digital business is different from traditional industries, and is closely related to user experience and the normal operation of business. At present, some enterprise applications involve the national economy, people's livelihood, and infrastructure business. From the injection of the cause of the fault, the occurrence of the fault, the fault detection, and the launch of the technical personnel, there will be some refined indicators of when it can be located and when it can be restored. Set goals in advance, conduct drills around the goals or review these time points through real faults, first look at the surface problems and then find the root cause. It is believed that these contents will gradually form a set of general standards in the future.

The ultimate goal of safe production is 0 major failures . Of course, many companies have major failures every month. Therefore, when most companies are landing, they will reduce the number of major failures as a phased goal, and then pursue 0 major failures. In order to finally achieve 0 major failures, it is necessary to measure the risk situation of the entire R&D process, and then solve it in advance.

Vipshop Chen Junfeng

Safety production currently means avoiding system failures for our front-line technology. Many major failures are caused by small changes. Due to the neglect of change personnel, small problems gradually become big problems and finally lead to major production failures. Therefore, personnel constraints, Change monitoring These are all important things. Vipshop will also make efforts in areas such as normalization of chaos engineering, automation of capacity planning, and unitization.

Station B Lu Fan

Regarding safety production, we mainly focus on the concept of "one to fifty", that is, 1 minute to find 5 minutes to locate and 10 minutes to solve, but we do not have all mandatory requirements and measures. In S11, when everyone sits together, the processing speed will be faster, but if you want to achieve "one to fifty" in normal or non-core business, the requirements are a bit high. Of course, "one-fifty", "multi-active", "multi-machine room deployment" and so on are also the direction we will continue to develop in the future.

Wonderful playback address: https://news.shulie.io/?cat=5&cnel=ff530

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5129714/blog/5533161