Click on "Uncle Wheat" above and select "Top/Star Public Account"
Welfare dry goods, delivered as soon as possible
Hello everyone, my name is Wheat.
In the normal project development process, various problems will be encountered. The following article will share some ideas and solutions for common problems.
1. Problem recurrence
Only by stably reproducing the problem can the problem be correctly located, solved and verified. In general, the easier the reproducible problem is, the easier it is to solve.
1.1 Simulation reproduction conditions
Some problems exist under specific conditions and can be reproduced only by simulating the conditions under which the problem occurs. For conditions that depend on external input, if the conditions are complex and difficult to simulate, you can consider the preset in the program to directly enter the corresponding state.
1.2 Increase the frequency of execution of related tasks
For example, an exception occurs when a task runs for a long time, and the execution frequency of the task can be increased.
1.3 Increase the test sample size
If the program is abnormal after running for a long time, the problem is difficult to reproduce. You can build a test environment with multiple sets of equipment to test at the same time.
2. Problem location
Narrow the scope of the investigation to identify the task, function, and statement that introduced the problem.
2.1 Print LOG
According to the phenomenon of the problem, add LOG output at the code in question, so as to track the execution flow of the program and the values of key variables, and observe whether it is in line with expectations.
2.2 Online debugging
Online debugging can play a similar role as printing LOG. In addition, this method is especially suitable for troubleshooting bugs such as program crashes. When the program falls into an abnormal interrupt (HardFault, watchdog interrupt, etc.), you can directly STOP to view the call stack and kernel registers. value to quickly locate the problem point.
2.3 version rollback
When using version management tools, you can continuously roll back versions and test and verify to locate the version that introduced the problem for the first time, and then you can check the code added and changed in this version.
2.4 Dichotomous Notes
二分注释即
Comment out part of the code in a manner similar to binary search to determine whether the problem is caused by the commented out part of the code.
The specific method is to comment out half of the code that is irrelevant to the problem, see if the problem is solved, comment the other half if it is not solved, continue to reduce the scope of the comment by half, and so on to gradually reduce the scope of the problem.
2.5 Save kernel register snapshot
When the Cortex M core is caught in an abnormal interrupt, it will push the values of several core registers onto the stack, as shown in the following figure:
We can write the kernel register value on the stack into the area where the default value is retained after a period of reset when we are caught in an abnormal interrupt, and then read out and analyze the information from the RAM after the reset operation, and confirm the execution at that time through PC and LR. The function of , through R0-R3 to analyze whether the variable processed at that time is abnormal, through SP to analyze whether a stack overflow may occur, etc.
3. Problem analysis and handling
Combine the problem phenomenon and the location of the problem code to analyze the cause of the problem.
3.1 The program continues to run
3.1.1 Numerical exception
3.1.1.1 Software problems
1. Array out of bounds
When writing the array, the subscript exceeds the length of the array, causing the content of the corresponding address to be modified. as follows:
Such problems usually need to be analyzed in combination with the map file. Observe the array near the address of the tampered variable through the map file, check whether there is an unsafe code as shown in the above figure for the write operation to the array, and modify it to a safe code.
2, stack overflow
0x20001ff8 | g_val |
---|---|
0x20002000 | bottom of stack |
………… | stack space |
0x20002200 | top of stack |
As shown in the figure above, such problems also need to be analyzed in conjunction with the map file. Assuming that the stack grows from a high address to a low address, if a stack overflow occurs, the value of g_val will be overwritten by the value on the stack.
When a stack overflow occurs, the maximum usage of the stack should be analyzed. There are too many function call layers, function calls in the interrupt service function, and large temporary variables declared inside the function, which may cause stack overflow.
There are the following ways to solve such problems:
In the design stage, memory resources should be allocated reasonably, and an appropriate size should be set for the stack;
Convert the larger temporary variable in the function to a static variable by adding the "static" keyword, or use malloc() to dynamically allocate it and put it on the heap;
Change the function calling method and reduce the number of calling layers.
3. Judging the condition of the sentence is wrong
The condition of the judgment statement is easy to write the equality operator "==" as the assignment operator "=", which will cause the value of the variable being judged to be changed. This type of error will not be reported at compile time and will always return true.
It is recommended to write the variable to be judged to the right side of the operator, so that an error will be reported at compile time if it is written as an assignment operator. You can also use some static code inspection tools to find such problems.
4. Synchronization problem
For example, when the queue is operated, an interrupt (task switching) occurs during the execution of the dequeue operation, and the queue structure may be destroyed when the enqueue operation is performed in the interrupt (the task after the switch). mutex synchronization).
5. Optimization problem
As shown in the above program, the original intention is to not execute the foo() function after waiting for the irq interrupt, but after being optimized by the compiler, flg may be loaded into the register during the actual operation and the value in the register is judged every time without reloading from the ram. Reading the value of flg causes foo() to keep running even if the irq interrupt occurs. Here, you need to add the "volatile" keyword before the declaration of flg to force the value of flg to be obtained from ram every time.
3.1.1.2 Hardware Problems
1. Chip BUG
There is a bug in the chip itself. In some specific cases, an incorrect value is returned to the microcontroller. The program needs to judge the read back value and filter out the abnormal value.
2. Communication timing error
For example, the power management chip Isl78600, assuming that two chips are cascaded now, when reading the voltage sampling data of the two chips at the same time, the high-end chip will transmit the data to the low-end chip through the daisy chain at a fixed cycle, and there is only one cache on the low-end chip. Area.
If the microcontroller does not read the data on the low-end chip within the specified time, the new data will overwrite the current data when it arrives, resulting in data loss. Such problems require careful analysis of the chip's data sheet to strictly meet the timing requirements of chip communication.
3.1.2 Abnormal action
3.1.2.1 Software problems
1. Design issues
There are errors or omissions in the design, and the design documents need to be re-evaluated.
2. The implementation does not match the design
If the implementation of the code does not match the design document, it is necessary to add unit tests to cover all conditional branches and conduct code cross-review.
3. The state variable is abnormal
For example, the variable that records the current state of the state machine is tampered with, and the method for analyzing this type of problem is the same as the previous section on numerical anomalies.
3.1.2.2 Hardware Problems
1. Hardware failure
The target IC fails and does not act after receiving the control command, and the hardware needs to be checked.
2. Communication abnormality
If the communication with the target IC is wrong, and the control command cannot be executed correctly, it is necessary to use an oscilloscope or a logic analyzer to observe the communication sequence and analyze whether the signal sent is incorrect or subject to external interference.
3.2 Program crashes
3.2.1 Stop running
3.2.1.1 Software problems
1、HardFault
The following conditions can cause a HardFault:
operate the peripheral's registers when the peripheral clock gate is not enabled;
The jump function address is out of bounds, which usually occurs when the function pointer is tampered with. The troubleshooting method is the same as the numerical exception;
Alignment issues when dereferencing pointers:
Taking little endian as an example, if we declare a structure that enforces alignment as follows:
address | 0x00000000 | 0x00000001 | 0x00000002 | 0x00000003 |
---|---|---|---|---|
variable name | Val0 | Val1_low | Val1_high | Val2 |
value | 0x12 | 0x56 | 0x34 | 0x78 |
At this time, the address of a.val1 is 0x00000001. If you dereference this address with uint16_t type, it will enter HardFault due to alignment problems. If you must use pointers to manipulate the variable, you should use memcpy().
2. The interrupt flag is not cleared in the interrupt service function
The interrupt service function does not clear the interrupt flag correctly before exiting. When the program execution exits from the interrupt service function, it will immediately enter the interrupt service function, showing the phenomenon of "suspended death" of the program.
3. NMI interrupt
When debugging, I encountered the MISO pin of SPI multiplexing the NMI function. When the peripheral connected through SPI is damaged, MISO is pulled high, which causes the microcontroller to directly enter the NMI interrupt before the NMI pin is configured as the SPI function after reset. Hanging on NMI interrupt. In this case, the NMI function can be disabled in the NMI's interrupt service function to make it exit the NMI interrupt.
3.2.1.2 Hardware Problems
1. The crystal oscillator does not start to vibrate
2. Insufficient supply voltage
3. Pull the reset pin low
3.2.2 Reset
3.2.2.1 Software problems
1. Watchdog reset
In addition to the reset caused by the dog feeding timeout, pay attention to the special requirements of the watchdog configuration. Taking the Freescale KEA microcontroller as an example, the watchdog of the microcontroller needs to perform an unlock sequence during configuration (write two different value), the unlock sequence must be completed within 16 bus clocks, timeout will cause the watchdog reset. This kind of problem can only be familiar with the data sheet of the microcontroller, and pay attention to similar details.
3.2.2.2 Hardware Problems
1. The power supply voltage is unstable
2. Insufficient power load capacity
4. Regression testing
After the problem is solved, a regression test needs to be carried out, on the one hand to confirm whether the problem does not recur, and on the other hand to confirm that the modification will not introduce other problems.
V. Experience Summary
Summarize the cause of this problem and the method to solve the problem, think about how to prevent similar problems in the future, and whether it is worth learning from the same platform products, so as to draw inferences from one case and learn from failures.
Original text: https://www.cnblogs.com/jozochen/p/8541714.html
—— The End ——
Recommended in the past
Musk's brain-computer interface can be made with a Raspberry Pi?
3 open source libraries commonly used by experts, making MCU development more efficient
Seemingly simple code, but hidden secrets...
How to prevent cracking? MCU encryption technology revealed
Click on the card above to follow me
Everything you ordered looks good , I take it seriously as I like it