Viruses and Malfunctions: Talking About the Malfunctions of Computer Software

Original address: Liang Guizhao's blog

Blog address: http://blog.720ui.com

Welcome to the official account: "Server Thinking". A group of people with the same frequency, grow together, improve together, and break the limitations of cognition.

The recent raging new coronavirus has become the focus of public vision. The author has recently read some related news and books on the occasion of the Chinese New Year. Among them, there is a book called Karl Zimmer's "Virus Planet" that impressed me a lot. Of course, this article does not talk about the new coronavirus and "Virus Planet", but compares the fault with the virus, and talks about the fault response mechanism of computer software, and the information and data about the virus-related popular science come from "Virus". Planet" book.

1. Malfunction: Viruses lurking in computer software

Human rhinovirus, the culprit of the common cold and asthma, is a widespread and old friend of mankind. Rhinoviruses cleverly use snot to spread themselves. When a person blows their nose, the virus will take the opportunity to run to the hand, and then rub against the doorknob and other places touched by the hand. The next time someone else touches these places, the virus will take the opportunity to get on their hands and into their bodies—and most of the time, through their noses. Rhinoviruses cleverly allow cells to open a "little door" to them and then invade cells located in the inside of the nose, throat or lungs. Over the next few hours, rhinoviruses use host cells to replicate their own genetic material and the protein coat that coats them. These replicated viruses then break out of the host cell. In addition, each of us carries nearly 100,000 endogenous retroviral DNA fragments in our genome, accounting for 8% of the total human DNA. While most of this viral DNA is useless, our ancestors did "requisition" some viruses that were good for us. We wouldn't even be born without these viruses. In the most recent moment in evolutionary history, humans came to the fore, and viruses contributed to our survival. There's no such thing as "them" and "us" -- living things are essentially just a bunch of DNA that keeps mixing and twitching. Thus, rhinoviruses began to cause colds in ancient Egyptians thousands of years ago, and endogenous retroviruses invaded the genomes of our primate ancestors tens of millions of years ago. (From "Virus Planet")

A glitch is similar, it's like a living organism's DNA fragment entangled in computer software that cannot be separated. Nowadays, software development iterations are frequent, and it is difficult for us to eliminate all faults. We can only say that we can find and solve as many problems as possible, so as to avoid faults occurring in the production environment and causing online problems. When we are infected by a virus, cells release signaling molecules called "cytokines" that summon nearby immune cells. They create an inflammatory response in our body, waiting for the immune system to wipe out all the viruses in our body. In computer software, we will also have similar scenarios. Once our developers or testers confirm that it is a program bug, they will immediately record and notify relevant personnel to deal with and repair, and continue to track until the fault is resolved.

2. I have heard many cases, but still can't solve the fault

One reason the cold is so intractable is that it exists in a variety of forms, due to genetic diversity based on mutation and rapid replication. In the face of failures, although there may be only several underlying fuses, the overall complexity of computer software is caused by the complexity of technology and business complexity.

We know that NPE (NullPointerExcepion) will bring us great disaster, but we often forget or ignore it in actual R&D. Here, the NPE fails because no non-null judgment is made for the inspected object.

public static void npe03(){
    Person person = null;
    System.out.println(person.blog);
}

The example below will cause NPE, did you spot it?

public static void npe01(){
    Integer x = 1;
    Integer y = 2;
    Integer z = null;
    Integer val = false ? x * y : z;
}

And this example is also a very typical NPE failure caused by Java's automatic boxing and unboxing.

public static void invoke(){
    Long x = null;
    npe02(x);
}

public static void npe02(long x){
    System.out.println(x);
}

Let's talk about an interesting bug. Everyone knows that the CPU will be 100% due to an infinite loop. However, the causes of CPU 100% are diverse. The author's team once encountered a bug in JDK 8, which was caused by the recursive creation of object expansion in ConcurrentHashMap, resulting in an infinite loop. Article link: https://mp.weixin.qq.com /s/O6UmB7YDKIYtNvqCOjNwDQ .

3. Fault emergency mechanism: monitoring, alarm, and plan

Under normal circumstances, once an online fault occurs, the consequences are generally more serious. Therefore, we need to solve it as soon as possible to reduce its impact and asset loss. For example, our team's previous slogan was: "1-5-10", that is, one minute to discover, five minutes to deal with, and ten minutes to solve. So, how to find online faults quickly? It is very important to build a mature monitoring system, such as monitoring the operation of various infrastructures (MySQL, Redis, MongoDB, ElasticSearch, etc.) through Zabbix or Prometheus, as well as the operation of business systems, as well as CPU, memory, disk I/O, Fluctuations such as network I/O, as well as GC, binlog synchronization, and so on. Then, after a problem is found, it is necessary to contact the fault handler through multiple channels (email, DingTalk, telephone) according to business rules through the alarm system. How to solve it quickly? First of all, a complete log inspection system (log aggregation + link tracking) is required. In addition, it is also necessary to quickly dump stack information related to the JVM, and to have DevOps platform support for self-service analysis. However, the most important thing is to have a plan system. What is a plan system? It is to have a complete set of pre-plans to respond quickly to different problems. For example, if a service is unavailable, the author finds that the process is suspended due to suspended animation, so you can execute the shell script through the execution plan in the plan to quickly pull it back to life. For another example, the author senses abnormal fluctuations in the funds of a certain merchant through monitoring alarms, then through the execution plan in the plan, it is quickly blown through the dynamic switch through the dynamic switch.

However, if there is no emergency plan for this online failure, and it is more difficult, what should I do? There is no other way but to deal with it because of the scalp. It should be noted that the rating of faults is generally determined according to the scope of business impact. In fact, the most essential capital losses include direct capital losses. For example, some time ago, a well-known e-commerce company’s free coupons caused very serious capital losses. ;Indirect asset losses, such as breaking the cable and making the entire app unusable, then converting it into normal transaction volume is also a lot of asset losses. Then, after 2 hours, the fault has not been repaired, maybe the fault originally rated P2 will be upgraded to P1.

If the worst situation occurs, it may not be resolved in a short time, then it is likely to need to stop service for maintenance. For example, the recent raging new coronavirus has led to nationwide city closures and road closures. In fact, it is also because we have no specific medicines (preparations in the field of computer software), and no new drugs have been developed (solutions in the field of computer software). Only the city and the road can be closed (stop service in the field of computer software).

4. Early detection of faults - fault drill

Usually, the occasional cold provides our immunity. In the computer field, incidental failure drills can also ensure that the systems running online are free of defects and failures as much as possible. Here, Netflix brings a new mindset to the field of uncertainty: Chaos Engineering. In fact, chaos engineering advocates that we accept that the system must have defects and failures, and then we use a series of experiments to identify the risk points that may cause problems, and then continuously strengthen the system.

image.png

The fault drill can simulate scenarios such as full CPU load, killing of specified processes, unreachable domain name access, network delay, network packet loss, disk filling, and high disk IO, as shown below.

image.png

To sum up, faults are like viruses lurking in computer software. Due to the complexity of technology and business, it is difficult to troubleshoot and solve them. We can take monitoring, alarms, plans, and fault drills to find faults and solve them in advance. Fault.

Finally, welcome to pay attention to my WeChat public account "Server Thinking", which contains more exciting articles and technical dry goods. Reply "669" to get an exclusive collection of selected data sets, and reply "Jiaqun" to join the national high-end server-side community "back-end circle". The content is too professional and limited to server-side practitioners.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324222772&siteId=291194637