System architecture design notes (93)-fault-tolerant technology

Fault-tolerant technology is a technology to ensure that the system can still work normally when certain components fail or go wrong. Corresponding fault-tolerant technologies are usually adopted according to different system configuration methods: single-machine fault-tolerant technology, dual-machine hot backup technology and server cluster technology.

1 Single machine fault tolerance technology

Fault-tolerant technology is a technology to ensure that the system can still work normally when certain components fail or go wrong. System failures can be divided into two categories: one is "fatal" and cannot be repaired by itself, for example, all the main components of the system are damaged; the other is partial and may be repaired, such as partial component failure, line failure, accidental Errors caused by interference, etc.

Fault-tolerant technology is used to construct a system that can automatically eliminate non-fatal faults, that is, a fault-tolerant system. In the single-machine fault-tolerant technology, the main methods to improve the reliability of the system are self-check technology and redundant technology. There are many forms of fault tolerance, such as hardware fault tolerance, software fault tolerance, and whole machine fault tolerance.

1.1 Self-check technology

Self-checking means that the system can automatically discover the fault and determine the nature and location of the fault when a non-fatal fault occurs, and automatically take measures to replace and isolate the faulty component. Self-inspection needs to use diagnostic technology, commonly used special procedures to achieve, belonging to the scope of program design. The realization of a fault-tolerant system requires that the system must have duplicate components or backup components, or have more than one channel to complete a certain function. Therefore, self-inspection technology is often used with redundant technology. Computer fault-tolerant systems generally require the application of self-checking technology.

1.2 Redundancy technology

Redundancy can be divided into hardware redundancy (adding hardware), software redundancy (increasing programs, such as using different algorithms or programs compiled by different people at the same time), time redundancy (such as repeated execution of instructions, repeated execution of programs), and information redundancy (Such as increasing data bits) and so on.

The two most commonly used methods in redundant technology are duplicate lines and backup lines. Duplicate circuit refers to the parallel connection of multiple components or components of the same variety and specifications, as a component or component, as long as one does not fail, the system can work normally. When working in parallel, the reliability probability of each component is independent of each other. The difference between the backup line and the repeated line is that the components participating in the backup are not connected to the system. Only when the component in the working state fails, the input and output are connected to the backup component, and the input and output of the faulty component are cut off. Fault-tolerant technology has been widely used, and is often used in systems with high reliability requirements, especially in critical parts that endanger personal safety. Most of these parts adopt double redundancy, and some adopt triple, quadruple or even quintuple redundancy.

Modern large-scale complex systems are often fault-tolerant systems. Fault-tolerant technology is the earliest and most widely used in computers.

2 Dual-system hot backup technology

The dual-system hot backup technology is a high fault-tolerant application scheme combining software and hardware. The program is composed of two server systems and an external shared disk array cabinet and corresponding dual-system hot backup software. The external shared disk array cabinet can also be omitted, but RAID (Redundant Array of Independent Disk) cards are used in the respective servers.

In this fault-tolerant solution, the operating system and application programs are installed on the local system disks of the two servers, and the data of the entire network system is centrally managed and backed up through the disk array. Centralized data management is to read and store the data of all sites directly from the central storage device through the dual-machine hot backup system, and is managed by professionals, which greatly protects the security and confidentiality of the data. The user's data is stored in an external shared disk array. When a server fails, the standby machine takes the initiative to replace the host to ensure uninterrupted network services.

The dual-system hot backup system uses the "heartbeat" method to ensure the connection between the main system and the backup system. The so-called "heartbeat" means that the master and slave systems send communication signals at a certain time interval to each other, indicating the current operating status of their respective systems. Once the "heartbeat" signal indicates that the host system has failed, or the backup system cannot receive the "heartbeat" signal from the host system, the high availability management software of the system considers that the host system has failed and immediately transfers system resources to the backup system. Replace the host to ensure the normal operation of the system and uninterrupted network services.

In the dual-system hot backup solution, there can be three different working modes according to the working mode of the two servers, namely: dual-system hot backup mode, dual-system mutual backup mode and dual-system duplex mode.

The dual-system hot standby mode is currently called the active/standby mode, the active server is in working state; while the standby server is in the monitoring ready state, the server data including database data is written to two or more servers at the same time (usually each server uses RAID disk array card) to ensure instant data synchronization. When the active server fails, the standby machine is activated through software diagnosis or manually to ensure that the application can be fully restored to normal use in a short time. Typical applications are securities fund servers or market quotation servers. This is a mode that is currently used more often, but because the other server is in a backup state for a long time, some computing resources are wasted.

Users can decide whether to use dual-system hot backup based on the importance of the system and the end user's tolerance for service interruption. For example, the maximum time users in the network can tolerate the restoration of the service, and the consequences if the service cannot be restored quickly are the basis for whether to adopt dual-system hot backup. For servers that undertake key business applications of enterprises, which require extremely high stability and availability, and need to provide uninterrupted services 7 (days) × 24 (hours) a week, dual-system hot backup is recommended.

The dual-server mutual backup mode is that two relatively independent applications run on two machines at the same time, but each is set as a backup machine. When one server fails, the other server can restore the faulty server in a short period of time. The application takes over, thus ensuring the continuity of the application, but the performance requirements of the server are relatively high.

Duplex mode is a form of clustering. Two servers are active and running the same applications at the same time to ensure the performance of the overall system. It also achieves load balancing and mutual backup. Disk cabinet storage technology is usually used. Web servers or FTP servers often use this method.

3 server cluster technology

Cluster technology refers to a group of independent servers in the network combined into a single system to work, and to manage in a single system mode. This single system provides highly reliable services for client workstations. In most cases, all computers in the cluster have a common name, and the services running on any system in the cluster can be used by all network clients. The cluster must be able to coordinate and manage the errors and failures of separate components, and transparently add components to the cluster.

A cluster contains multiple (at least two) servers that share data storage space. When any one of the servers runs the application, the application data is stored in the shared data space. The operating system and application files of each server are stored in its own local storage space.

The node servers in the cluster communicate with each other through an internal LAN. When a node server fails, the applications running on this server will be automatically taken over on the other node server. When an application service fails, the application service will be restarted or taken over by another server. When any of the above failures occurs, clients will be able to connect to other application servers soon.


Guess you like

Origin blog.csdn.net/deniro_li/article/details/108900856