Key points of programming for reliability design of embedded software?

Edited from: https://mp.weixin.qq.com/s/I2AsC3WiBu6sJo7yDDngEw

The reliability of equipment involves many aspects: stable hardware, excellent software architecture, strict testing, and the test of the market and time, etc. Here I will focus on the author's own understanding of embedded software reliability design, and improve software reliability through certain skills and methods.

1. Wrong judgment

If a worker wants to do a good job, he must first sharpen his tools. The ultimate purpose of error judgment is to expose and correct bugs in the design, so it is necessary to provide error information to programmers. Sometimes it is necessary to store fault information in non-volatile memory for easy viewing. Here we use the serial port to print error information to the PC display as an example to illustrate what information generally needs to be displayed.

Write or transplant a printf function similar to the C standard library, which can format and print characters, strings, decimal integers, and hexadecimal integers. Here it is called UARTprintf().

unsigned int WriteData(unsigned int addr)
{
    if((addr>= BASE_ADDR)&&(addr<=END_ADDR)) 
    {
        …/*地址合法,进行处理*/
    } 
    else 
    { /*地址错误,打印错误信息*/
        UARTprintf ("文件%s的第 %d 行写数据时发生地址错误,错误地址为:0x%x\n",__FILE__,__LINE__,addr);
        …/*错误处理代码*/
    }

Assuming that the UARTprintf() function is located at line 256 of the main.c module, and the WriteData() function passes the wrong address 0x00000011 when reading data, the UARTprintf() function will be executed and the following information will be printed:

An address error occurs when writing data in line 256 of the file main.c, and the error address is: 0x00000011. Information like this will help programmers locate and analyze the root cause of errors and eliminate bugs faster.

2. Determine whether the actual parameter is legal

Programmers may pass wrong parameters unconsciously; strong external interference may modify the passed parameters, or use random parameters to accidentally call the function, so before executing the function body, it is necessary to determine whether the actual parameters are legal.

int exam_fun( unsigned char *str ) 
{ 
    if( str != NULL )
    { //  检查“假设指针不为空”这个条件 

        ... //正常处理代码
    } 
    else 
    {
        UARTprintf(…); // 打印错误信息
        …//处理错误代码
    }
}

3. Carefully check the return value of the function

The error code returned by the function should be handled comprehensively and carefully, and error records should be made when necessary.

char *DoSomething(…)
{
    char * p;
    p=malloc(1024);
    if(p==NULL) 
    { /*对函数返回值作出判断*/
        UARTprintf(…); /*打印错误信息*/
        return NULL;
    }
    retuen p;
}

4. Prevent the pointer from crossing the boundary

If an address is dynamically calculated, it must be ensured that the calculated address is reasonable and points to a meaningful place. Especially for pointers pointing to the inside of a structure or array, when the pointer is added or changed, it still points to the same structure or array.

5. Prevent the array from crossing the bounds

The problem of array out-of-bounds has been mentioned a lot in the previous section. Since C does not effectively detect arrays, it is necessary to explicitly detect array out-of-bounds problems in applications. The following example can be used to interrupt receiving communication data.

#define REC_BUF_LEN 100
unsigned char RecBuf[REC_BUF_LEN];
… //其它代码
void Uart_IRQHandler(void)
{
    static RecCount=0;   //接收数据长度计数器
    …       //其它代码
    if(RecCount < REC_BUF_LEN)
    {
        RecBuf[RecCount]=…;  //从硬件取数据
        RecCount++;
        …      //其它代码
    } 
    else 
    {
        UARTprintf(…);   //打印错误信息
        …      //其它错误处理代码
    }
    …
}

When using some library functions, the bounds also need to be checked:

#define REC_BUF_LEN 100
unsigned char RecBuf[REC_BUF_LEN];
 
if(len< REC_BUF_LEN)
{
    memset(RecBuf,0,len);  //将数组RecBuf清零
} 
else 
{
    //处理错误
}

6. Mathematical operations

  • • Check if the divisor is zero
  • • Detect arithmetic overflow conditions

Signed integer division, is it enough to just detect divisor by zero?

When dividing two integers, in addition to checking whether the divisor is zero, it is also necessary to check whether the division overflows. For a signed long type variable, the value range it can represent is: -2147483648 ~ +2147483647, if -2147483648 / -1, then the result should be + 2147483648, but this result is beyond the range that signed long can represent .

#include <limits.h>
signed long sl1,sl2,result;
/*初始化sl1和sl2*/
if((sl2==0)||((sl1==LONG_MIN) && (sl2==-1)))
{
    //处理错误
} 
else 
{
    result = sl1 / sl2;
}

Addition overflow detection:

a) Unsigned addition

#include <limits.h>
unsigned int a,b,result;
/*初始化a,b*/
if(UINT_MAX-a<b)
{
    //处理溢出
} 
else 
{
    result=a+b;
}

b) signed addition

#include <limits.h>
signed int a,b,result;
/*初始化a,b */
if((a>0 && INT_MAX-a<b)||(a<0) && (INT_MIN-a>b))
{
    //处理溢出
} 
else 
{
    result=a+b;
}

Multiplication overflow detection:

a) Unsigned multiplication

#include <limits.h>
unsigned int a,b,result;
/*初始化a,b*/
if((a!=0) && (UINT_MAX/a<b)) 
{
    //
} 
else 
{
    result=a*b;
}

b) signed multiplication

#include <limits.h>
signed int a,b,tmp,result;
/*初始化a,b*/
tmp=a * b;
if(a!=0 && tmp/a!=b)
{
//
} 
else 
{
    result=tmp;
}

7. Other places where runtime errors may occur

Run-time error checking is something that C programmers need to pay special attention to, because the C language is weak in providing any run-time checking. For software that requires high reliability, dynamic detection is necessary. So the question that C programmers need to consider carefully is to increase the dynamic detection of code wherever runtime errors may occur. Most of the dynamic detection is closely related to the application, and the dynamic code detection should be set according to the system requirements during the program design process.

8. Compiler Semantic Check

In order to design compilers more easily, almost all compilers currently have relatively weak semantic checks. In addition, in order to obtain faster execution efficiency, the C language is designed to be flexible enough and hardly perform any runtime checks, such as array out-of-bounds and pointers. Whether it is legal, whether the operation result overflows, and so on.

The C language is flexible enough. For an array a[30], it allows the use of a form like a[-1] to quickly obtain the data in front of the address where the first element of the array is located; it allows a constant to be converted into a function pointer, using the code ( * ((void( * )())0))() to call the function at address 0. The C language gives programmers enough freedom, but programmers also bear the responsibility for abusing freedom. The following two examples are infinite loops. If similar codes appear in infrequently used branches, it will cause seemingly inexplicable crashes or restarts.

a. unsigned char i;                  
   for(i=0;i<256;i++)  {… }              
b. unsigned chari;
   for(i=10;i>=0;i--) { … }

For the unsigned char type, the expressed range is 0~255, so the unsigned char type variable i is always less than 256 (the first for loop executes infinitely), and is always greater than or equal to 0 (the second for loop executes wirelessly). It should be noted that the assignment code i=256 is allowed by the C language, even if this initial value has exceeded the range that the variable i can represent. The C language will do everything possible to create opportunities for programmers to make mistakes, which is evident. If you mistakenly add a semicolon after the if statement to change the logic of the program, the compiler will cooperate to help cover it up, even without warning. code show as below:

if(a>b);          //这里误加了一个分号
a=b;               //这句代码一直被执行

Not only that, the compiler will also ignore extra spaces and newlines, just like the following code will not give enough hints:

if(n<3)
return    //这里少加了一个分号
logrec.data=x[0];
logrec.time=x[1];
logrec.code=x[2];

The original intention of this code is that when n<3, the program returns directly. Due to the programmer's mistake, return lacks an ending semicolon. The compiler translates it into returning the result of the expression logrec.data=x[0], and even an expression after the return is allowed by the C language. In this way, when n>=3, the expression logrec.data=x[0]; will not be executed, burying hidden dangers for the program. It can be said bluntly that weak compiler semantic checks largely allow unreliable code to exist unscrupulously.

As mentioned above, arrays are often an important factor that causes program instability, and programmers often write arrays out of bounds inadvertently. A colleague's code was running on the hardware, and after a while it was discovered that a number on the LCD display was changing abnormally. After a period of debugging, the problem was located in the following piece of code:

int SensorData[30];
for(i=30;i>0;i--)
{
    SensorData[i]=…;
    …
}

An array with 30 elements is declared here. Unfortunately, the non-existing array element SensorData[30] is misused in the for loop code, but the C language acquiesces in this use, and happily changes the array element SensorData[30] according to the code. 30], the position of SensorData[30] was originally an LCD display variable, which is why the value on the display is changed abnormally. I'm glad I found this bug so easily.

9. The key data is backed up in multiple areas, and the data is taken using the "voting method"

The data in RAM may be changed under the condition of interference, and the key data of the system must be protected. Key data includes global variables, static variables, and data areas that need to be protected. The data backup and the original data should not be adjacent to each other, so the compiler should not allocate the backup data location by default, but should be stored in the designated area by the programmer.

The RAM can be divided into three areas. The first area stores the original code, the second area stores the inverted code, and the third area stores the XOR code. A certain amount of "blank" RAM is reserved between the areas for isolation. Variables can be stored separately in these areas using the compiler's "scatter-loading" mechanism. When it is necessary to read, read 3 copies of data at the same time and vote, and take at least two of the same values.

If the RAM of the device starts from 0x1000_0000, I need to store the original code in 0x1000_00000x10007FFF of the RAM, store the inverse code in 0x1000_90000x10009FFF, and store the XOR code of 0xAA in 0x1000_B000~0x1000BFFF. The scatter loading of the compiler can be set as:

LR_IROM1 0x00000000 0x00080000 { ; load region size_region
ER_IROM1 0x00000000 0x00080000 { ; load address = execution address
*.o (RESET, +First)
*(InRoot$$Sections)
.ANY (+RO)
}

RW_IRAM1 0x10000000 0x00008000 { ;保存原码
.ANY (+RW +ZI )
}

RW_IRAM3 0x10009000 0x00001000{ ;保存反码
.ANY (MY_BK1)
}

RW_IRAM2 0x1000B000 0x00001000 { ;保存异或码
.ANY (MY_BK2)
}
}

If a key variable needs multiple backups, you can define the variable in the following way, assign the three variables to three discontinuous RAM areas, and initialize according to the original code, inverse code, and XOR code of 0xAA when defining .

uint32 plc_pc=0; //原码
__attribute__((section("MY_BK1"))) uint32 plc_pc_not=~0x0; //反码
__attribute__((section("MY_BK2"))) uint32 plc_pc_xor=0x0^0xAAAAAAAA; //异或码

When the variable needs to be written, the three positions must be updated; when the variable is read, the three values ​​are read for judgment, and at least two of the same values ​​are taken.

Why choose XOR code instead of complement code? This is because the integers of MDK are stored according to the complement code, and the complement code of the positive number is the same as the original code. harmful. For example, due to interference in a non-zero integer area stored, the RAM is cleared to zero. Since the original code and the complement code are consistent, according to the "voting method" of 2 out of 3, the interference value 0 will be regarded as the correct data.

10. Data storage in non-volatile memory

Non-volatile memory includes but not limited to Flash, EEPROM, ferroelectric. It is not enough to read and verify the data written in the non-volatile memory. In the case of strong interference, it may cause data errors in the non-volatile memory. During the writing of the non-volatile memory, the system will be powered off and the data will be lost. lead to data storage disorder. A reliable method is to divide the non-volatile memory into multiple areas, and each data will be written into these areas in different forms. The value with the greater number of the same.

For programs running away to write non-volatile memory functions due to interference, software locks and strict entry inspection should also be used. It is not enough and unwise to rely solely on writing data to multiple areas, and it should be blocked at the source.

11. Software lock

Software locks can be implemented but not limited to interlocking. For initialization sequences or function calls with a certain sequence, in order to ensure the order of calls or to ensure that each function is called, we can use interlocking, which is essentially a software lock. In addition, for some security-critical code statements (statements, not functions), software locks can be set for them, and only those who hold specific keys can access these key codes.

For example, when writing a data to Flash, we will judge whether the data is legal, whether the written address is legal, and calculate the sector to be written. Call and write the Flash subroutine afterwards, in this subroutine, judge whether the sector address is legal, whether the data length is legal, will write the data into Flash afterwards. Since writing Flash statements is a security-critical code, the program locks these statements: you must have the correct key to write Flash. In this way, even if the program runs away to write the Flash subroutine, the risk of writing by mistake can be greatly reduced.

/***************************************************************
* 名称:RamToFlash()
* 功能:复制RAM的数据到FLASH,命令代码51。
* 入口参数:dst 目标地址,即FLASH起始地址。以512字节为分界
* src 源地址,即RAM地址。地址必须字对齐
* no 复制字节个数,为512/1024/4096/8192
* ProgStart 软件锁标志
* 出口参数:IAP返回值(paramout缓冲区) CMD_SUCCESS,SRC_ADDR_ERROR,DST_ADDR_ERROR,
SRC_ADDR_NOT_MAPPED,DST_ADDR_NOT_MAPPED,COUNT_ERROR,BUSY,未选择扇区
****************************************************************/
void RamToFlash(uint32 dst, uint32 src, uint32 no,uint8 ProgStart)
{
    PLC_ASSERT("Sector number",(dst>=0x00040000)&&(dst<=0x0007FFFF));
    PLC_ASSERT("Copy bytes number is 512",(no==512));
    PLC_ASSERT("ProgStart==0xA5",(ProgStart==0xA5));
    paramin[0] = IAP_RAMTOFLASH; // 设置命令字
    paramin[1] = dst; // 设置参数
    paramin[2] = src;
    paramin[3] = no;
    paramin[4] = Fcclk/1000;
    if(ProgStart==0xA5) //只有软件锁标志正确时,才执行关键代码
    {
        iap_entry(paramin, paramout); // 调用IAP服务程序
        ProgStart=0;
    }
    else
    {
        paramout[0]=PROG_UNSTART;
    }

}

This program segment is for programming the internal Flash of lpc1778, in which the function iap_entry(paramin, paramout) that calls the IAP program is the key security code, so before executing this code, first judge a specific security lock flag ProgStart, only this flag conforms to the setting value, the programming Flash operation will be performed. If the program runs to this function due to an accident, the Flash will not be programmed because the ProgStart flag is incorrect.

12. Error detection of communication data

The data error on the communication line is relatively serious. The longer the communication line and the worse the environment, the more serious the bit error will be. Regardless of the effects of hardware and environment, our software should be able to identify erroneous communication data. There are some applied measures for this:

  • • When formulating the protocol, limit the number of bytes per frame;

The more bytes per frame, the greater the chance of bit errors and the more invalid data. In this regard, Ethernet stipulates that each frame of data shall not exceed 1500 bytes, and the high-reliability CAN transceiver stipulates that each frame of data shall not exceed 8 bytes. For RS485, the most widely used Modbus protocol based on RS485 links requires no more than one frame of data. More than 256 bytes. Therefore, it is recommended that when developing an internal communication protocol, when using RS485, it is stipulated that each frame of data should not exceed 256 bytes;

  • • Use multiple checks

The parity check should be enabled when writing the program. For applications with more than 16 bytes per frame, it is recommended to write at least a CRC16 check program.

  • • add extra judgment
  1. \1. Increase buffer overflow judgment. This is because the data reception is mostly completed in the interrupt, and the compiler cannot detect whether the buffer overflows, so it needs to be checked manually, which has been explained in detail in the section on data overflow above.
  2. \2. Increase the timeout judgment. When half of a frame of data is received and the rest of the data cannot be received for a long time, the frame of data is considered invalid and reception starts again. Optional, related to different protocols, but buffer overflow judgment must be implemented. This is because for the protocol that needs to judge the frame header, the upper computer may suddenly power off after sending the frame header. After restarting, the upper computer starts to send a new frame, but the lower computer has received the frame header that was not sent last time. Therefore, the frame header of the upper computer will be received by the lower computer as normal data. This may cause the data length field to be a very large value, and a considerable amount of data is required to fill the buffer of this length (for example, a frame may be 1000 bytes), which affects the response time; on the other hand, if the program does not have a buffer overflow Judgment, then the buffer is likely to overflow, and the consequences are disastrous.
  • • Retransmission mechanism

If an error occurs in the communication data is detected, a retransmission mechanism is required to resend the erroneous frame.

13. Detection and confirmation of switch input

The switching value is susceptible to spike interference, if not filtered out, it may cause malfunction. In general, it is necessary to sample the digital input signal multiple times and make logical judgments until the signal is confirmed to be correct. There needs to be a certain time interval between multiple samplings, which is related to the maximum switching frequency of the switching value, generally not less than 1ms.

14. Switch output

A simple one-time output of the switching signal is not safe, and interfering signals may reverse the state of the switching output. Taking repeated refresh output can effectively prevent the flipping of the level.

15. Preservation and restoration of initialization information

The register value of the microprocessor may also change due to external interference, and the initialization value of the peripheral needs to be stored in the register for a long time, and it is most likely to be destroyed. Since the data in the Flash is relatively difficult to be destroyed, the initialization information can be written in the Flash in advance, and when the program is idle, it is compared whether the register value related to the initialization has been changed, and if an illegal change is found, the value in the Flash is used for recovery.

16. while loop

Sometimes programmers will use the while(!flag); statement to wait for the flag to change, such as waiting for a byte of data to be sent when the serial port is sent. Such codes are risky, and if the flag remains unchanged for some reason, it will cause the system to crash. A good redundant program is to set a timeout timer, and after a certain period of time, the program is forced to exit the while loop.

The W32.Blaster.Worm worm event that occurred on August 11, 2003 caused global economic losses of up to 500 million U.S. dollars. This vulnerability exploited a logical flaw in the remote procedure call interface of the Windows Distributed Component Object Model: when calling GetMachineName( ) function, the loop only sets an insufficient end condition.

The original code is simplified as follows:

HRESULT GetMachineName ( WCHAR *pwszPath,
WCHARwszMachineName[MAX_COMPUTTERNAME_LENGTH_FQDN+1])
{
    WCHAR *pwszServerName = wszMachineName;
    WCHAR *pwszTemp = pwszPath + 2;
    while ( *pwszTemp != L’\\’ )               /* 这句代码循环结束条件不充分 */
        *pwszServerName++= *pwszTemp++;
    /*… */
}

Security patch MS03-026 released by Microsoft addresses this issue by setting a sufficient termination condition for the GetMachineName() function. A solution code is simplified as follows (not Microsoft patch code):

HRESULT GetMachineName( WCHAR *pwszPath,
WCHARwszMachineName[MAX_COMPUTTERNAME_LENGTH_FQDN+1])
{
    WCHAR *pwszServerName = wszMachineName;
    WCHAR *pwszTemp = pwszPath + 2;
    WCHAR *end_addr = pwszServerName +MAX_COMPUTTERNAME_LENGTH_FQDN;
    while ((*pwszTemp != L’\\’ ) && (*pwszTemp != L’\0’)
            && (pwszServerName<end_addr))  /*充分终止条件*/
            *pwszServerName++= *pwszTemp++;
    /*… */
}

17. System self-test

Self-test for CPU, RAM, Flash, external power-down memory and other circuits.

18. Some other programming suggestions:

  • • In-depth understanding of embedded C language and compiler
  • • Meticulous, careful programming
  • • Use good style and sensible design
  • • Don't write code in haste, think twice before writing every line of code: what could go wrong? Have all branches of logic been considered?
  • • Turn on all compiler warning switches
  • • Analyze code using static analysis tools
  • • Safe reading and writing of data (check all array bounds...)
  • • Check the legality of pointers
  • • Check the legality of function entry parameters
  • • Check all return values
  • • Initialize all variables where they are declared
  • • Use parentheses wisely
  • • Be careful with casts
  • • Use good diagnostic information logs and tools

This article comes from the Internet, conveying knowledge for free, and the copyright belongs to the original author. If it involves copyright issues, please contact me to delete it.

Guess you like

Origin blog.csdn.net/qq_41854911/article/details/130461768