Overview of C++ Software Exception Analysis

       A large part of the work in recent years is to troubleshoot various anomalies encountered during the operation of the software. Whether it is the underlying network module, protocol module and component module, or the upper-level UI module, they have been dealt with many times and have seen various Various C++ exceptions or crashes have accumulated a lot of practical experience, and I will share with you here. This article will describe in detail the classification of C++ software exceptions in the Windows system and the commonly used troubleshooting methods , to provide you with a reference and reference.

1. Classification of software exceptions

       Common software exceptions include memory out-of-bounds , memory access violation , stack overflow, thread stack overflow , null pointer and wild pointer , infinite loop , deadlock , memory leak , GDI object leak , stack imbalance caused by inconsistent function calling conventions , etc.

      Some exceptions cause the software to crash immediately. Some exceptions will cause crashes after running for a period of time or for a long time, such as memory leaks and GDI object leaks. Some exceptions do not cause crashes, but only cause software to block or freeze, such as infinite loops and deadlocks.

       There is another type of problem that can cause exceptions in the execution of business code. Such problems will not cause software crashes, but will cause business code to not be executed according to normal logic or branches, resulting in abnormal business logic. For example , a function throws an exception and causes part of the code to be skipped , that is, the executed code is not executed, resulting in a logical exception when the subsequent business code is executed. Under normal circumstances, these skipped codes will perform some judgments and set the values ​​of some variables, which will directly affect the judgment and execution logic of the subsequent code, so it will cause business logic exceptions when the subsequent code is running. We have encountered this problem several times before.

      Another example is that the lasterror of the system API function is overwritten , resulting in a logical error in the subsequent conditional judgment of the lasterror value, that is, a misjudgment occurs. We have encountered this problem before. We added a print to the open source code of the libjingle library. Because the interface encapsulation of the open source code is relatively deep, we did not see that the lasterror value would be overwritten when adding the code for log printing. question. Specifically, the system API function is called in the code that prints the log. After the code for printing the log is executed, the lasterror value generated by the open source library code in the previous sentence will be overwritten. The interface in the open source code in the next sentence below the print log will determine the lasterror value after the execution of the open source code in the previous sentence. Because the lasterror value is overwritten, the judgment condition of the lasterror is misjudged, resulting in subsequent business codes. A logical exception has occurred.

       For the GDI object leak , it is caused by using the GDI object to draw the window, and the GDI object is not released after the drawing operation is performed. GDI objects include Pen brush , Brush brush , BItmap bitmap , Font font , DC device context , Region area , etc. If there is a GDI object leak in the program, it will not cause an exception or crash immediately. When the total number of GDI objects in the program process reaches about 10,000 , an exception will occur, and a flashback crash will occur. In the Windows system, the upper limit of GDI objects for a single process is 10,000. When the GDI objects of a process approach 10,000, a GDI function drawing exception will occur, and then a crash will occur. In fact, the investigation of GDI object leaks is much simpler than memory leaks. You only need to use the GDIView software tool to find out which GDI object is leaking, and you can quickly find out with the code. The interface of the GDIView tool is as follows:

 2. Use windbg to analyze software exceptions

       windbg is one of the most powerful and general software debugging and analysis tools on the Windows platform. It is mainly used by the Windows platform to analyze various software anomalies.

       Most of the abnormal crash problems can be captured by the exception capture module in the software (everyone is using the open-source carshreport exception capture library), and the context when the exception occurs is saved to the dump file , and windbg can be used to statically analyze these dumps afterwards. document.

       For some exceptions that do not cause software crashes , such as deadlocks, infinite loops and memory leaks, we need to mount windbg on the target process for dynamic analysis.

       For a few exceptions that cannot be captured by the exception capture module , such as the program crashes during the running process, it is necessary to attach windbg to the target process to run, that is, windbg and its attached target process are bound to run together, once the target process is abnormal , windbg can immediately sense and interrupt. After attaching windbg to the target process, we need to find a way to reproduce the exception. After the exception occurs, windbg will capture and interrupt it. At this time, we can directly use the windbg command to analyze it, or we can use the .dump command to remove the exception. The context is exported to a dump file for post-mortem analysis. It may be time-consuming to analyze the problem. The computer in question may belong to a colleague or leader, and cannot occupy other people's computers all the time. At this time, you can choose to export the dump file for later analysis with windbg.

       For some pop-up box errors or software stuck exceptions , the software has been stuck at this point (the target process is still there), and you can directly hang windbg at this time. An exception has occurred at this point in time, but the timing of mounting windbg It is not too late, and the context information of the exception can also be obtained. For this kind of problem, don't click the OK button of the error report, or don't rush to kill the target process through the resource manager, and keep the target process. At this time, it is the right time to hang up windbg, and you can also get the complete exception context information. . These exceptions may be difficult to reproduce afterwards. We must seize this opportunity and directly mount windbg to the process in question for analysis. If you miss this opportunity, it may be difficult to reproduce next time, which leaves a lot of hidden dangers in the software.

3. Common exception troubleshooting methods other than windbg

       Of course, in addition to the static analysis and dynamic debugging of windbg , there are some other common methods. These methods are also very important and need to be mastered, such as using VS to debug directly (Debug or Release debugging), attaching to the process to debug , adding print logs , Historical version comparison method (find out the point in time when the problem started), block commented code , set data breakpoints (real-time monitoring of memory), etc. Sometimes, we need to use a combination of methods.

       For some logical exceptions in the business, it is generally necessary to add log printing to troubleshoot. Another typical exception is that an error is encountered during the running of the software, and the software judges it by itself and considers it a fatal error, and will directly call abort or exit to forcibly terminate the process. For example, in the open source jsoncpp library, if an error occurs when parsing an abnormal json node, abort will be called directly to forcibly terminate the entire process. For another example, in the open source webrtc library, when the application for heap memory with new fails, the webrtc library will think that a fatal error has occurred, and will also call abort to forcibly terminate the process. This kind of actively forcibly terminates the process, the exception capture module is an exception that cannot be caught, so a dump file will not be generated.

       In this case, windbg can be mounted on the target process to run. Once the abort interface is called, windbg will be interrupted. At this time, by looking at the call stack of the function, you can see where the code triggers the problem. Why can windbg perceive that the exception capture module installed in the software cannot capture it? Because the software itself does not have RaiseException , the software actively terminates the process. So why can windbg perceive it? Because a SIGABRT termination signal notification is generated in the abort interface , the debugger can perceive it, so windbg generates an interrupt. At this point, use the kn command to view the function call stack, and you can see which interfaces trigger the problem. You can directly go to the implementation of the abort function in Visual Studio to view the internal implementation of the abort function, as follows:

/***
*void abort() - abort the current program by raising SIGABRT
*
*Purpose:
*   print out an abort message and raise the SIGABRT signal.  If the user
*   hasn't defined an abort handler routine, terminate the program
*   with exit status of 3 without cleaning up.
*
*   Multi-thread version does not raise SIGABRT -- this isn't supported
*   under multi-thread.
*
*Entry:
*   None.
*
*Exit:
*   Does not return.
*
*Uses:
*
*Exceptions:
*
*******************************************************************************/

void __cdecl abort (
        void
        )
{
    _PHNDLR sigabrt_act = SIG_DFL;

#ifdef _DEBUG
    if (__abort_behavior & _WRITE_ABORT_MSG)
    {
        /* write the abort message */
        _NMSG_WRITE(_RT_ABORT);
    }
#endif  /* _DEBUG */


    /* Check if the user installed a handler for SIGABRT.
     * We need to read the user handler atomically in the case
     * another thread is aborting while we change the signal
     * handler.
     */
    sigabrt_act = __get_sigabrt();
    if (sigabrt_act != SIG_DFL)
    {
        raise(SIGABRT);
    }

    /* If there is no user handler for SIGABRT or if the user
     * handler returns, then exit from the program anyway
     */

    if (__abort_behavior & _CALL_REPORTFAULT)
    {
        _call_reportfault(_CRT_DEBUGGER_ABORT, STATUS_FATAL_APP_EXIT, EXCEPTION_NONCONTINUABLE);
    }


    /* If we don't want to call ReportFault, then we call _exit(3), which is the
     * same as invoking the default handler for SIGABRT
     */


    _exit(3);
}

As can be seen from the code, a SIGABRT termination signal notification is generated. 

       When some modules detect an exception, they will also call the DebugBreak API function, which will cause the debugger to break. The description of the DebugBreak function is as follows:

For example, when a failure to apply for memory in the open-source webrtc library caused a flashback problem, the webrtc library internally believed that the memory application failure was a fatal error, and the abort function was called to forcibly terminate the process. Before calling the abort function, the DebugBreak function will be called first . If the current windbg is hanging on the problem process, the call of the DebugBreak function will interrupt the windbg, so that the debugger can perceive the problem. Looking at the function call stack at this point, you can determine what operation triggered the problem. 

4. Problems existing in the open source CrashReport library

      Many manufacturers are using the open source crashreport exception capture library, but the native CrashReport library is flawed. Many big manufacturers should use the deeply improved crashreport library.

    The open source crashreport exception capture library dynamically converts the API function CreateThread HOOK of creating threads in the import table of the loaded library into our custom MyCreateThread function (no matter which interface to create a thread is called, it will eventually go to the CreateThread interface) , so that the system API function SetUnhandledExceptionFilter can be called in MyCreateThread to mount an exception handler for each created thread.

     However, this mechanism is flawed. Exception handling functions cannot be mounted on the threads of all modules of the software. Exception handling functions can only be mounted on libraries loaded before crashreport. Libraries loaded after crashreport cannot be hooked. This will cause exceptions that occur in those libraries that do not perform hook operations to be caught. When the exe starts, all dependent libraries will be loaded into the process space. We can't control that all the libraries are loaded before the crashreport library, which also leads to some abnormal crashes that cannot be captured when crashreport.

       Later, we improved the crashreport library and used the code in Microsoft's open source detours project to HOOK the UnhandledExceptionFilter interface in the windows system library . Because basically all exceptions will eventually enter this function, we hook the UnhandledExceptionFilter interface into our custom interface, and we can perceive almost all exceptions in the custom interface. Once an exception is sensed, a dump file containing the exception context can be generated. In this way, the problem that the old version of crashreport cannot be loaded after the hook can be well solved. The new version of crashreport can act on all modules of the current process, and basically all exceptions of the process can be captured.

       Of course, the improved crashreport can not catch 100% of the exceptions, but it can catch more than 90% of the exceptions. For scenarios that cannot be captured, you need to mount windbg on the target process and let windbg capture it.

Guess you like

Origin blog.csdn.net/chenlycly/article/details/123991269
Recommended