From technology to management, apply system optimization technology to business management

Many technicians have high demands on themselves in their professions, work diligently, take on increasing responsibilities, and eventually gain trust and are promoted to management positions. However, they often lack professional management knowledge and cannot optimize the work process from the overall scope of work. They are still working as "individual contributors". When they encounter problems, they often delay their own work. So I turned a lot of books, read a lot of articles, learned a lot of "the art of being in the world" and "strategy of corporate development", and finally became the head of the R&D department, but the technology was gradually abandoned. What is management? Are technology and management two completely different development directions?

no. Both technology and management must achieve quantitative analysis and global optimization. There are many similar methods. Here is an example of a system performance optimization scenario, you can experience it:

There is a program in the company that runs on a cluster of 10 servers. Now the business volume has increased, and the request cannot be processed. The boss asked you to optimize this program. After receiving this headache, you bring people from all departments of development, testing, operation and maintenance to a meeting to find a way. Some people say that the database should be upgraded, some people say that the code is too bad to be optimized, and some people say that there are too few machines and add 5 more machines. , Some people say that we need to change the architecture to go to the cloud. After going to the cloud, there will be no such problems. Who should you listen to?

Don't worry about it. There is a saying called "No optimization without measurement". First of all, we must "measure" this phenomenon. First find the designer to understand what the function of this program is and what the workflow is like.

Program architecture: This program handles the business of picture recognition, receives pictures from the network port, recognizes the information in the pictures, then compares them in the picture library, and finally outputs similar pictures. The process is like this:

 

 

To figure out the program architecture, we need measurement data next. Some data is easy to get, and some data seems to be unclear. So you assign a task to the R&D team, let them bury points in the program, and collect some data indicators as soon as possible. The developer changed a version of the program and deployed it. After running for a day on the production line, I got some data indicators: 

  • Input: 1 million pictures need to be processed every day, which are collected from upstream processes
  • Recognition function: the average time to recognize a picture is 0.5 seconds
  • Comparison function: the average time for comparing 1 picture is 0.4 seconds

Now let's calculate: the time to process 1 picture is 0.9 seconds (0.5 + 0.4), 1 machine can process 96,000 pictures per day (86400 / 0.9), and 10 machines can process 960,000 pictures per day (96000 * 10) , Less than 1 million. To complete the processing capacity of 1 million per day, 10.4 servers (1 million / 96000) are required, which is approximately equal to 11.

Did you tell the boss that you must buy a server: "You need to buy a server with a GPU!". Don't worry.

Let's analyze the running process of the program: the identification function and the comparison function are executed serially. When the recognition function is busy, the comparison function is idle and it is waiting for the recognition result. Similarly, when the comparison function is busy, the recognition function has nothing to do. In other words, the server resources are not fully utilized, and the GPU card and database resources are greatly wasted.

How to improve resource utilization? You can change the structure of the program and adjust it to the following:

 

 

Divide the original program into two, deploy them on two servers, and exchange data with a message queue in the middle. Both programs can now make full use of the server's resources. Let's calculate the throughput again: 

  • Program X: It takes 0.5 seconds to process a picture, a server processes 172800 pictures (86400 / 0.5) a day, and 1 million pictures requires 5.8 servers (1 million / 172800), which is approximately equal to 6.
  • Program Y: It takes 0.4 seconds to process a picture, 1 server processes 216,000 pictures per day (86400 / 0.4), and 1 million pictures requires 4.6 servers (1 million / 216,000), which is approximately equal to 5.

11 servers are still needed, and there seems to be no improvement. Let's analyze it again: the original solution required 11 servers with GPUs, but now only 6 servers are needed. We have saved 5 GPU cards, which is already a lot of expense.

The architect provided another piece of information: In the original solution, the identification function and the comparison function are executed serially, so they can only be executed with the same number of concurrent threads. The new scheme has been separated into two programs, so the comparison function can set a higher number of concurrent threads, which can be increased to 4 times.

This is good news. The throughput of Program Y can be increased by 4 times. In this way, only 1.16 servers are needed to process 1 million data, which is about 2 servers.

According to the improved architecture, only 6 servers with GPUs are needed, plus 2 servers without GPUs, for a total of 8 servers. Not only can complete processing tasks, but also reserve some GPU cards for future business development.

The example is over, the above is the process of optimizing the operating efficiency of an IT system. In fact, business management is also a similar process, except that the object of optimization is no longer machines and programs, but human activities. In a software company, there are multiple processes such as requirement collection, product development, and project implementation. Sometimes these processes will be stuck and slow, which seems to be the same as an IT system problem. A well-known question is: "In your team, how long does it take for a change that involves only one line of code to go live?" How far is the journey from demand to delivery. We may often encounter such a problem: a defect in an on-site operation and maintenance feedback, which seems to be a small problem, is not troublesome to repair, but it took a long time to solve it. Looking back at this issue afterwards, people in every department have something to say:

  • Operation and maintenance: As soon as I found this problem, I raised it on the Jira platform, and there was no response from the development at the time, so I got off work.
  • Development: I was developing a new version of the function, writing a very complicated code. When I saw this problem, it was already off work time. O&M only describes the problem phenomenon, not the version deployed on site. I don't know which version to fix this problem, so I had to change it in the latest release and then send the package to the test. I also returned a message on Jira, requesting the operation and maintenance to send out the on-site version number;
  • Test: I received the development package and plan to test it. The entire integrated environment has been upgraded, and I need to restore the test environment to the old version. I worked on this all morning, and in the afternoon I tested it again, found a few defects, and raised the problem to the development.
  • Development: I received a bug from the test, and after the modification, I posted a version. It should be fine this time.
  • Operation and maintenance: The package on the environment does not have a version identification. It took me a long time to check the Md5 codes of all versions before I found the version number and returned it to Jira. This problem is very urgent, I want to solve it as soon as possible, so I took the latest version tested to me and wanted to try to install it. I don't know if this package is compatible with the on-site environment, so I can only try it. I spent a day in the pre-release environment and didn't install it. It seems to be impossible.
  • Development: I see the live version number, this is a very old version, it has been more than a year. I have only been in this project for three months, and I have AT several people on WeChat. I don't know where the code baseline is, it took a long time to find it. It was too late after the repair. Still have to give it to the test.
  • Test: The integrated environment still needs to be restored. I worked on it for three hours. The test confirms that there is no problem, and it is handed over to the operation and maintenance.
  • Operation and maintenance: I received the installation package and tried it on the pre-release environment, no problem. The production environment is more troublesome. I only updated one node at the beginning and found that the problem still appeared intermittently. Later, I learned that there are two more nodes to deploy. I did it for a day this time, and I will know how to do it next time.

From everyone's perspective, I am very busy and spend a lot of time solving problems. But from the perspective of defect resolution, things are constantly stuck and waiting. In these labor processes, how much labor is truly effective and can produce value? This is the value flow problem that DevOps needs to solve, and a system needs to be established to measure this process and continuously optimize it.

 

 

Judging from the above defect resolution process, there are many problems in the technical department, some of which are single points, such as:

  • Code management: the code baseline is not clear, and the version cannot be traced back
  • Release management: release documents are not kept properly
  • Version management: The version number is not clearly branded and the number is not clear. Unable to determine the compatibility between the new and old versions
  • Infrastructure management: R&D personnel have no way to get the infrastructure quickly, and it takes a long time to establish a test environment
  • Deployment management: testers manually deploy, it takes a long time to complete a deployment
  • Environmental management: Which processes are deployed on the server on site? There is no set of management methods, so you need to log in to check

Seeing these problems, can you start to improve? Still don't worry. Like optimizing an IT system, we have to figure out the work process, then measure this process, and then optimize the overall. When the overall situation is not clear, local optimization is useless. Optimizing a local efficiency may be counterproductive and cause greater waste.

There are of course many difficulties in figuring out the overall process. One big problem is that the business process is not as clear as the IT system process. IT systems generally have various documents, at least the source code can be viewed. There are often some ambiguities in the corporate workflow, and the definition of department and job responsibilities is not very clear. People will not be "obedient" like a program. People are creative in order to complete their tasks. Therefore, every company must sort out positions and work processes, try to sort out these vague processes, and formulate a set of process specifications according to their own business characteristics. This is a very necessary work. People in technical positions are more familiar with the actual work process, and they have an advantage in this respect when they enter management positions.

After the workflow is clarified, the process nodes can be measured. We can use visualization technology to analyze data, such as Kanban, resource input status, task burndown chart, etc., to find stuck activities and determine bottleneck resources. There are some scientific methods in this regard, and the software industry is also learning the theory of lean production from the manufacturing industry. For a large-scale software company, management has improved, and the resulting efficiency improvement is huge.

Recommended reading for high-quality articles:

Alibaba advanced interview questions (first issue, high frequency 136 questions, including answers)

https://blog.csdn.net/weixin_45132238/article/details/107251285

GitHub Biaoxing 20w's 4 low-level interview guidelines (computer bottom layer + operating system + algorithm), interview headlines/Tencent is right!

https://blog.csdn.net/weixin_45132238/article/details/108640805

Alibaba internal architecture combat: SpringBoot/SpringCloud/Docker/Nginx/distributed

https://blog.csdn.net/weixin_45132238/article/details/108666255

Guess you like

Origin blog.csdn.net/weixin_45132238/article/details/108955806