System architecture design notes (92)-reliability and failure model

1 The source and manifestation of the failure

Let me introduce a few concepts first.
(1) Failure: the physical change of the hardware.
(2) Malfunction: An error state in the hardware or software caused by component failure, physical interference in the environment, operating errors or incorrect design.
(3) Error (error): The specific location of the fault in the program or data structure. There may be a certain distance between the error and the fault location.

There are several manifestations of failures or errors as follows.
Permanence: Describes continuous and stable failures, malfunctions or errors. In hardware, permanent failures reflect unrecoverable physical changes.
Intermittent: Describes only occasional failures or errors caused by unstable hardware or changing hardware or software status.
Transient: Describe those failures or errors caused by temporary environmental conditions.

A failure may be caused by a physical failure, improper system design, environmental impact, or the operator of the system. Permanent failure will lead to permanent failure. Intermittent failures can be caused by instability, marginal stability, or incorrect design. Environmental conditions can cause transient failures. All these failures can cause errors. Incorrect design and operator error can directly cause errors. Failures caused by physical conditions of the hardware, incorrect hardware or software design, or unstable but recurring environmental conditions may be detectable and can be repaired by replacement or redesign; however, due to temporary environmental conditions The resulting failure cannot be repaired because the hardware itself is not actually damaged. Transient and intermittent faults have become a major source of errors in the system.

2 Several commonly used failure models

The manifestations of faults vary greatly, and fault models can be used to abstract the various fault manifestations. Failure models can be established at all levels of the system. Generally speaking, the lower the level of fault model establishment, the lower the cost of fault handling, but the fewer faults the fault model covers. If the failure model at a certain level cannot contain certain manifestations of the failure, it can be summarized with a higher level model. The following introduces several commonly used failure models.

2.1 Logic-level failure model

Fixed fault means that the logic of the input or output lines of the components in the circuit is fixed to 0 or fixed to 1. A fixed fault may be caused by grounding a certain line, short circuit of the power supply, or component failure.

Short-circuit fault means that the logic value of the output line of a component is always equal to the logic value of the input line;

The open circuit fault of the component is that the output line of the component is floating, and the logic value can be determined according to the specific circuit.

Bridging fault refers to a fault that occurs when two wires that should not be connected are connected together.

2.2 Data structure level failure

The performance of the fault in the data structure is called an error. Common errors are as follows.

Independent error: The effect of a fault is manifested as a change in a binary bit.

Arithmetic error: The effect of a fault is shown as increasing or decreasing the value of a data by 2i (i = 0, 1, 2, …).

One-way error: The effect of a fault is that some bits in a binary vector change in one direction (0 or 1).

2.3 Software failure and software error

Software failure refers to the inconsistency between the software design process and the design specification. The performance of software failure in data structure or program output is called software error. Unlike hardware, software does not become fatigued due to environmental pressure, nor does it age due to the passage of time. Therefore, software failures are only related to design.

Common software errors are as follows.
Illegal transfer: The program performed a transfer that did not exist in the description.
Mistransfer: The program has executed a transfer that should not be carried out based on the current control data despite the existence of the description.
Infinite loop: The program execution time exceeds the specified limit.
Space overflow: The space used by the program exceeds the specified limit.
Data execution: The instruction counter points to the data unit.
Irrational data: The data output by the program is unreasonable.

2.4 System-level failure model

The performance at the system level is a functional error, that is, the system output is inconsistent with the system design specification. If the system output has no fault protection mechanism, the performance of the fault at the system level will cause the system to fail.


Guess you like

Origin blog.csdn.net/deniro_li/article/details/108900847