In layman's language computer composition principle: data integrity (on) - the hardware is broken how do? (Lecture 49)

First, the primer

2012, when I first at work, because they do not encounter a Bug hardware reliability triggered. It is for this Bug, so I began to gradually spend a lot of time to review the review of the underlying knowledge of the entire computer system inside.

At that time, I was MediaV to lead a team of 20 people, responsible for the company's advertising data and machine learning algorithms. Which part of the job is to handle all the data and reports business use Hadoop cluster. At that time our business has grown rapidly, it will go a Hadoop cluster frequently
inside the machine purchase. 2012, when the domestic cloud computing platform is not yet mature, so we are purchasing their own hardware, hosted on the data inside.

At that time, our Hadoop cluster server, server to 1,000 from 100 servers. We feel that such brands like Dell servers too expensive, and the hardware configuration can provide and our expectations are different. Thus, the operation and maintenance of the students began to plant and OEM
business cooperation, their own custom servers, bulk purchasing hard drives, memory.

At that time, we have heard during the early development of Google, in order to reduce the cost to buy a lot of second-hand hardware to reduce costs, to guarantee the reliability of the system by way of a distributed-place. Although we have not stingy to buy a second-hand hardware, but at that time, we choose to buy the ordinary mechanical hard disk, rather than enterprise-class mechanical hard disk in the data center; procurement of common memory, rather than with ECC error correction server memory, thinking you could save a little bit.

Second, the single will be special flip: The software can not solve hardware error

1. Symptom

Suddenly one day, our greatest, every hour of data processing reporting application, complete a lot of time becomes later than usual. At first, we were not too concerned about, after all, was the amount of data is growing every day, slow down a little slow. But then bad

Things start to happen.

Put a face, we found that the report could not finish the task execution times within an hour , then occasionally throughout the reporting task execution failed . So, we had to stop development work at hand, start to troubleshoot the problem.

Used Hadoop, you may know that, as a distributed application, taking into account the failure of hardware, Hadoop itself in the case of a particular node calculation error, retry the entire calculation process. Before reporting a slow runner, because some computing tasks node failed, just
after the retry has succeeded. Further analysis, we found that errors in the program is very strange. Some of the results calculated data, such as "34 + 23", the result should be "57", but it turned into a dollar sign "$."

Toss back and forth for a week, we found that, from the point of view logs on, a part of the job in error in a fixed hardware node.

On the other hand, we found that problems occur after a new batch of our own custom hardware shelves. After then, operation and maintenance team and colleagues to communicate recent hardware changes and read a lot of e-mail list Hadoop community groups, we made a bold speculation.

2, investigation and analysis

We speculate that this error, from our own custom hardware. Custom hardware without the use of ECC memory, a large amount of data, there has been a hardware error in memory single-bit flip (Single-Bit Flip) This legendary.

That is how this symbol come from? Is due to the memory of an integer character, encountered a single bit flip transformed from. Its binary representation of the ASCII code is 00100100, so it's entirely possible from 00,110,100 encountered a single bit flip in the first four bits, that is,
from the integer "4" change over. But we can only speculate that this mistake, but this mistake is not convinced. Because the single-bit flip is a random phenomenon, we can not stabilize reproduce this problem.

 

 

ECC memory stands for Error-Correcting Code? Memory, in the file name is called error-correcting memory. As the name suggests, it is time errors in memory, which can correct them yourself.

3, to solve the problem

After the students and the operation and maintenance of communication, we have to replace all their own custom memory server has become ECC memory, then this problem disappears . It also makes us confident that the basic source of the problem because it did not use ECC memory. All our engineers develop machines group in 2012, he had been replaced 32G of memory. Yes, for there is no other place to go down the memory, are installed on the R & D team to develop machine.

Third, parity and parity bit: a good way to catch errors

In fact, inside the single-bit memory errors or reversal, is not a particularly rare phenomenon. Whether it is because of manufacturing quality memory leakage caused, or external radiation, there is a certain probability, it will cause single-bit errors. The memory level data error, software engineer and I do not know
road, and this error is likely to be random. Error encountered difficult to reproduce random, it certainly can not stand. We must have a way to avoid this problem.

In fact, before the invention of ECC memory, engineers have begun by parity-place to find these errors.

1, the parity check bits, and

Parity idea is very simple. We inside the N bits of the memory as a set. Common, such as eight is a byte. Then, with an additional one to record the eight bits which has an odd number or an even number 1. If it is an odd number 1, that extra one on record is 1;
if it is an even number 1, that extra one on record to zero. That extra one, we call it parity code bits.

If this byte which, unfortunately we single bit reversal occurs, then the data bits of the calculated checksum, and the actual parity bit on data inside is not the same. Our memory will know wrong.

2, the advantages of parity

In addition, parity bit has a great advantage is that the calculation is very fast, often only need to be verified again traverse data, the complexity of the algorithm by a O (N) time, will be able to check result Calculated.
Ideas check code, used in many things.

For example, we download some software, you will see, in addition to the package file to download, there will be a hash value or a cyclic redundancy code such as corresponding MD5 (CRC) checksum file. In this way, when we downloaded the corresponding software, we can calculate the checksum corresponding to the software, and check code official offer to do a comparison, see if it is the same.

If not, you will not be able to easily install the software. Because it is possible, this package is bad. However, there is a more dangerous situation in which you download this package, may have been planted as a person after their. After the installation, the
security of your computer would be no protection.

3, using the parity defects

However, the use of parity, there are two relatively large defects.

The first flaw is that parity can only solve a single bit error is encountered, or that an odd number of bit errors. If the two-bit appears was flipped, then the byte parity bit calculation fact has not changed, our parity bit naturally can not find the error.

The second flaw is that it can only find errors, but can not correct the error. Therefore, even if the in-memory data errors found inside, and we can only stop the program, and not allow the program to continue normally run down. If this is just our personal computers, doing something marginal applications
, it touches does not matter.

But you think about it, if you make a complex computing tasks on the server, this calculation has been running for a week or even a month, there twenty-three days ran over. This time, the memory error inside, running from scratch again, your heart is estimated crash .

Defect parity scheme

So, we need a better solution than a simple checksum program, one can find more bit errors, and these errors can be corrected over the solutions, that is, ECC memory were engineers have invented to be used must solve the case.

We can not only capture the error, but also be able to correct the errors. This strategy, we usually called error-correcting code (Error CorrectingCode). It also intends to upgrade a version, called erasure codes (Erasure? Code), not only can correct the error, the error can not also be able to
directly delete the data when corrected. Whether it is our ECC memory, or network transmission, and even hard disk RAID, in fact, benefit Using the relevant technical error correction code and the erasure code.

We want to see how through the algorithm, how to configure the hardware, so that we are not only able to find a single bit errors, but also to find more bits of errors, you must remember to keep up with the next lesson content.

IV Summary extension

Well, let us work together to summarize the contents of today.

I introduce you to the Bug of my own personal experience of a hardware error caused. In the absence of enough mining ECC memory, resulting in our data processing, there has been a large number of single-bit error data inversion. These hardware error caused, in fact, we do not have a solution at the software level.
If not familiar with how the hardware and the hardware itself, I am afraid solution to this problem is still nowhere in sight. If you understand the principles of computer organization and be able to realize that, with demand for data validation and error correction in hardware storage layer,
you will be able to locate the problem within a limited time.

Further, I will brief the parity for you, that is how a redundant data and found that bit errors occur at the hardware level. However, parity check code as well as other, only errors are found, there is no way to correct the error. So, the next lesson, we take a look at how the use of error-correcting code in such a way to solve the problem.

Guess you like

Origin www.cnblogs.com/luoahong/p/11498124.html