Advanced skills in system architecture design·Software reliability analysis and design

Table of Contents of Series Articles

Advanced skills in system architecture design · Software architecture concepts, architectural styles, ABSD, architecture reuse, DSSA (1) [System Architect] Advanced
skills in system architecture design · System quality attributes and architecture evaluation (2) [System Architect]
Advanced skills in system architecture design · Software reliability analysis and design (3) [System Architecture Designer]

现在的一切都是为将来的梦想编织翅膀,让梦想在现实中展翅高飞。
Now everything is for the future of dream weaving wings, let the dream fly in reality.

1. Basic concepts of software reliability★

System reliability is the system's ability to complete specified functions within a specified time and under specified environmental conditions, that is, the probability of the system operating without failure. That is, the basic ability of a software system to maintain the functional characteristics of the software system in the face of application or system errors, accidental or incorrect use.
System availability refers to the probability that the system can perform as required at a given point in time, that is, the proportion of time the system can operate normally.

Software reliability ≠ hardware reliability , the difference is:

  • Complexity , software complexity is higher than hardware, and most failures come from software failures .
  • Physical degradation , hardware failure is mainly caused by physical degradation, and there is no physical degradation in software .
  • Uniqueness, the software is unique , each COPY version is the same, and two hardware cannot be exactly the same.
  • Version update cycle, hardware is slower and software is faster .

Quantitative description of software reliability :
Software reliability is a data expression composed of variables such as software usage conditions, within a specified time, system input/output, system usage, etc.
(1) Specified time: natural time, running time, execution time, which is the best way to measure the reliability of software.
(2) Failure probability: From the time the software starts running, until a certain time t, the probability of failure is a random function, called failure probability.
(3) Reliability: It is the probability that the software will not fail under specified conditions and within a specified time.
(4) Failure intensity: the probability of software failure per unit time.
(5) Failure rate: also called risk function or conditional failure intensity, it is the probability of a software system failure per unit time when the running system does not fail.
(6) Average failure-free time: the average time to the next failure after the software is running. Reflect the reliability of software more intuitively.

Reliability goals :
Software reliability refers to user expectations for performance satisfaction of the software they use. It can be described by reliability, mean failure time and failure intensity.

The significance of software reliability testing :
(1) Software failure may cause catastrophic consequences.
(2) Software failures account for a relatively high proportion of failures in the entire computer system.
(3) Compared with hardware reliability technology, software reliability technology is immature.
(4) Software reliability problems will cause software costs to increase.
(5) The system is highly dependent on software and has an increasing impact on production activities and social life.

The purpose of software reliability testing :
(1) Discover defects in software systems [requirements analysis, software design, system coding, test implementation] (
2) Provide reliability basis for the use and maintenance of software
(3) Confirm whether the software reaches reliability Quantitative requirements

Generalized software reliability testing is a test of the software system using a series of methods such as modeling, statistical testing, analysis and evaluation in order to ultimately evaluate the reliability of the software system.

Software reliability testing in the narrow sense refers to a type of testing performed on software in the expected use environment of the software in order to obtain reliability data according to predetermined test cases.

2. Software reliability modeling★

Software reliability model refers to the reliability block diagram and mathematical model established to predict or estimate the reliability of software .

A software reliability model usually (but not exclusively) consists of the following parts :

  • Model Assumptions
    A model is a simplification or standardization of the actual situation and always contains a number of assumptions, such as the selection of tests to represent the actual operating profile, and the independent occurrence of different software failures.

  • Performance Measurement
    The output of the software reliability model is performance measurement, such as failure intensity, number of remaining defects, etc. Performance measures are usually given as mathematical expressions in software reliability models.

  • Parameter estimation method
    The actual value of some reliability measures cannot be obtained directly, such as the number of residual defects. In this case, a certain method needs to be used to estimate the parameter value, thereby indirectly determining the value of the reliability measure.

  • Data Requirements
    A software reliability model requires certain input data, namely software reliability data.

Most models contain three common assumptions :

  • Representative hypothesis
    refers to the software reliability data generated by testing that can be used to predict software reliability behavior during the running phase.

  • Independence Assumption
    This assumption holds that software failures occur independently at different times, and the occurrence of one software failure does not affect the occurrence of another software failure.

  • Identity assumption
    This assumption holds that the consequences (levels) of all software failures are the same, that is, the modeling process only considers the specific moment of software failure and does not distinguish the severity level of software failure.

Software reliability (model classification) modeling methods include :

  • The seed method model
    uses capture-recapture sampling technology to estimate the number of errors in the program, intentionally "sowing" some set error "seeds" in the program in advance, and then based on the original number of errors tested and the proportion of induced errors found. Estimate the number of errors remaining in a program.
  • Failure rate model
    is used to study the failure rate of programs
  • Curve fitting model
    uses regression analysis to study software complexity, number of defects in the program, failure rate, and failure interval time.
  • Reliability growth model
    This type of model predicts the reliability improvement of software during the error detection process, and uses a growth function to describe the software improvement process.
  • The program structure analysis model
    forms a reliability analysis network based on programs, subroutines and their mutual calling relationships.
  • The input domain classification model
    selects certain sample "points" in the software input domain to run the program, and infers the reliability of the software based on the success/failure rate of the test run based on the usage probability of these sample points in the "actual" usage environment.
  • Execution path analysis method model
    The analysis method is similar to the above model. It first calculates the execution probability of each logical path of the program and the execution probability of the wrong path in the program, and then synthesizes the reliability of the software.
  • The non-homogeneous Poisson process model
    uses the number of failures per unit time in the software testing process as an independent Poisson random variable to predict the cumulative number of failures at a certain time point in the future use of the software.
  • Markov process model
  • The Bayesian model
    uses the pre-test distribution of failure rates and current test failure information to evaluate software reliability.

3. Software reliability management★

The various stages of software reliability management, as shown in the figure:
Insert image description here

Figure 3_1 Software reliability management stage

4. Software Reliability Analysis★★★

4.1 Reliability indicators

  • Mean time between failures, MTTF = 1/λ, λ is the failure rate
  • Mean time to repair, MTTR = 1/μ, μ is the repair rate
  • Mean time between failures, MTBF = MTTR + MTTF
  • System availability, MTTF / MTBF = MTTF / (MTTR + MTTF) × 100%
    Insert image description here
Figure 4_1 Reliability index

4.2 Series system (reliability)

Insert image description here

Figure 4_2 Series system (reliability)

4.3 Parallel systems (reliability)

Insert image description here

Figure 4_3 Parallel system (reliability)

4.4 Hybrid systems (reliability)

Insert image description here

Figure 4_4 Hybrid system (reliability)

5. Software reliability design★★★★

5.1 Main factors affecting software reliability

From a technical perspective, factors that affect software reliability include: operating environment, software scale, software internal structure, software development methods and development environment, and software reliability investment .

Insert image description here

Figure 5_1 Main factors affecting software reliability

5.2 Software reliability design technology

Insert image description here

Figure 5_2 Software reliability design

5.2.1 Fault-tolerant design technology

5.2.1.1 Redundant design - fault-tolerant design technology

In addition to a complete software system, design a module or system with a different path, different algorithm or different implementation method as a backup, and the redundant part can be replaced when a failure occurs.

Both the N version programming and the recovery block method are based on the idea of ​​design redundancy.

5.2.1.2 N version programming-fault-tolerant design technology

By designing multiple modules or different versions, majority voting is implemented on the operation results of the same initial conditions and the same inputs to prevent the failure of one module/version from providing incorrect services.

N version programming is a static fault shielding technology that adopts a forward recovery strategy.
Insert image description here

Figure 5_3 N version programming

  • Different from the usual software development process, N version programming adds three new stages: dissimilar component specification review, dissimilarity confirmation, and back-to-back testing .

  • Synchronization of N-version programs, communication between N-version programs, voting algorithms (congruent voting, inexact voting, Cosmetie voting), consistent comparison problems, and data dissimilarity.

5.2.1.3 Recovery block method - fault-tolerant design technology

Select a set of operations as a fault-tolerant design unit to turn ordinary program blocks into fast recovery.

The recovery block method is a dynamic fault masking technology that adopts a backward recovery strategy.

Insert image description here

Figure 5_4 Recovery block method

  • The design should ensure the independence between the main block and the backup block , avoid the occurrence of related errors, and minimize the common errors between the main block and the backup block.

  • The correctness of the verification test procedures must be ensured.

Comparison of N version programming and recovery block methods

Compared N version programming Recovery block method
Hardware operating environment vote Standalone
error detection method Multiple machines Validation test procedures
recovery strategy forward recovery backward recovery
real-time good Difference
  • Forward recovery : continue the current calculation, restore the system to a coherent and correct state, and make up for the incoherence of the current state.
  • Backward recovery : The system recovers to the previous correct state and continues execution.
5.2.1.4 Defensive programming-fault-tolerant design technology

N-version programming and the recovery block method are both based on the idea of ​​design redundancy, which adds a lot of work to both programmers and processors, and their structures themselves bring some problems and difficulties, for example, multi-version programming Correlation error issues in and the design of verification tests in the recovery block approach, etc.

Defensive programming is a method that can achieve software fault tolerance without using any traditional fault tolerance technology. The basic idea of ​​defensive programming is to include error checking code and errors in the program to deal with errors and inconsistencies in the program. Recovery code allows the program to undo the error state and return to a known correct state once an error occurs .

The implementation strategy includes three aspects: error detection, damage estimation and error recovery.

5.2.2 Error detection and design technology

  • Error detection technology is less expensive than fault tolerance technology and redundancy technology, but it cannot automatically solve faults and requires manual intervention .

  • Error detection technology focuses on considering four elements: detection object, detection delay, implementation method, and processing method.

5.2.3 Complexity reduction design technology

Reduce complexity design design ideas : On the basis of ensuring software functions, simplify the software structure, shorten the software structure, shorten the length of program code, optimize software data flow, reduce software complexity, and improve software reliability.

5.2.3 System configuration technology

System configuration technology can be divided into dual-machine fault-tolerance technology and server cluster technology .

5.2.3.1 Dual-machine fault-tolerance technology - system configuration technology

Dual-machine fault-tolerant technology is a fault-tolerant application solution that combines software and hardware. This solution is composed of two servers, an external shared disk array and the corresponding dual-machine software. In a dual-machine fault-tolerant system, the two servers are generally divided into a master system and a slave system (standby system). The two servers are master-slave to each other. relation. Each server has its own system disk (local disk) where the operating system and applications are installed. Each server is equipped with at least two network cards. One is connected to the network to provide external services; the other is connected to another server to detect the other party's working status. At the same time, each server is connected to a shared disk array, and user data is stored in the shared disk array. When one server fails, another server takes the initiative to take over the work, ensuring uninterrupted network services. The data of the entire network system is centrally managed through the disk array, which greatly protects the security and confidentiality of the data.

The dual-machine fault-tolerant system can have three different working modes according to the different working methods of the two servers, namely dual-machine hot standby mode, dual-machine mutual backup mode and dual-machine duplex mode .

The heartbeat method is used to ensure the connection between the main system and the backup system .

  • Dual-server hot standby mode - (one working, one backup)
    Under normal circumstances, one server is in working status (main system) and the other server is in monitoring preparation status (standby system). If a shared disk array is not used, user data is written to two servers at the same time to ensure instant data synchronization. When the main system fails, the backup system is activated through dual-machine software to ensure that the application is fully restored to normal use in a short time. After the main system is repaired, you can reconnect to the system and get your applications back.
    The dual-machine hot backup mode is a mode that is widely used at present. Typical applications include securities capital servers or market quotation servers.
    The main disadvantage of the dual-machine hot standby mode is that the standby system is in a backup state for a long time, resulting in a certain waste of computing resources.

  • Dual-machine mutual backup mode - (two servers run relatively independent applications and serve as backup for each other).
    Both servers are in working status, providing different application services to front-end clients, and detecting each other's operating conditions. That is, both servers are running at the same time, but each is set as a backup system. When a server fails, another server can take over the application of the failed server in a short time, thus ensuring the continuity of the application. The main disadvantage of the dual-machine mutual backup mode is the relatively high performance requirements for the server.

  • Dual-machine duplex mode - (Two servers run the same application at the same time and serve as backup for each other)
    Dual-machine duplex mode is a form of cluster technology. Both servers are working and provide the same service to front-end clients at the same time. application services to ensure the performance of the overall system and achieve load balancing and mutual backup.

5.2.3.2 Server cluster technology-system configuration technology

Cluster technology is to organize multiple computers to work together. It is a technology that improves system availability and reliability. In a cluster system, each computer is responsible for some computing tasks and fault-tolerance tasks. When one of the computers fails, the system uses cluster software to isolate the computer from the system and completes this through the load transfer mechanism between computers. New load sharing while alerting system administrators. The cluster system achieves high availability and reliability of the system through functional integration and fault transition.

The node servers in the cluster communicate with each other through the internal LAN. If a node server fails, the application running on this server will be automatically taken over by another node server.

  • High-performance computing cluster
    refers to computer cluster technology with the purpose of improving scientific computing capabilities. It is an implementation method of parallel computing clusters. Parallel computing refers to a method of dividing an application program into multiple parts that can be executed in parallel and specifying them for execution on multiple processors.

  • Load balancing cluster
    A load balancing cluster distributes load among multiple nodes according to a certain strategy (algorithm). Load balancing is built on the existing network structure, which provides a cheap and effective way to expand server bandwidth, increase throughput, and improve data processing capabilities. Load balancing is a kind of dynamic balancing that uses some tools to analyze data packets in real time, grasp the data flow status in the network, and allocate tasks reasonably.
    The more commonly used load balancing implementation technologies mainly include the following:
    1) Load balancing based on specific software (application layer) : Many network protocols support redirection functions, for example, based on HTTP redirection service. The main principle is that the server uses HTTP Redirect directive, which relocates a client to another location. The server returns a redirect response instead of returning the requested object. The client confirms the new address and resends the request to achieve load balancing.
    2) DNS-based load balancing belongs to the transport layer load balancing technology : its main principle is to configure multiple addresses for the same host name in the DNS server. When responding to DNS queries, the DNS server will respond to each query with the host name in the DNS file. The recorded IP addresses return different parsing results in order, guiding client access to different nodes, allowing different clients to access different nodes, thereby achieving the purpose of load balancing.
    3) NAT-based load balancing : Maps an external IP address to multiple internal IP addresses, dynamically converts each connection request to the address of an internal node, and directs the external connection request to the node where the address is converted. Thus achieving the purpose of load balancing.
    4) Reverse proxy load balancing : dynamically forward connection requests from the Internet to multiple nodes on the internal network for processing in a reverse proxy manner, thereby achieving load balancing.
    5)Hybrid load balancing

  • High Availability Cluster
    In a high availability cluster system, multiple computers work together, each running one or several services, and each defines one or more backup computers for the service. When a computer fails, the backup computer immediately takes over the application of the failed computer and continues to provide services to front-end users.

6. Software reliability testing★

  • Software reliability testing includes:
    determination of reliability goals, development of operating profiles, design of test cases, test implementation, analysis of test results, etc.

  • Testing steps:
    Define software operation profile (model the usage behavior of the software) => Design reliability test cases => Implement reliability testing.

7. Software reliability evaluation★

  • Software reliability evaluation has three processes:
    selecting reliability model, collecting reliability data, and reliability assessment and prediction.

  • Factors to consider when selecting a reliability model:
    the applicability of model assumptions, the ability and quality of predictions, whether the model output value can meet the needs of reliability evaluation, and the ease of use of the model.

  • Collection of reliability data:
    Reliability data mainly refers to software failure data, which is the basis for software reliability evaluation. It is mainly collected during the software testing and implementation stages. Solutions that should be adopted: determine the reliability model to be used early, formulate a highly implementable reliability data collection plan, pay attention to the organization and analysis of software test data, and make full use of the database to complete the storage and statistics of reliability data. analyze.

  • Reliability assessment and prediction:
    Determine whether the reliability target has been achieved; if it cannot be achieved, how much more investment will be made; after the software system is put into actual operation for a year or a certain period of time, after maintenance, upgrades and modifications, whether the software can meet the delivery or partial The level of reliability delivered to users. Auxiliary methods: graphical analysis of failure data, heuristic data analysis techniques.

Guess you like

Origin blog.csdn.net/weixin_30197685/article/details/132125234