Use Process Explorer and Clumsy to locate software high CPU usage issues

Table of contents

1. Problem description

2. Use Process Explorer to initially find the cause of high CPU usage

3. Use the Clumsy tool to reproduce the problem in the company's intranet environment

4. According to the function call stack in Process Explorer, analyze the source code, and finally find out the problem

5. Summary


       When troubleshooting the flickering problem of the project client's video image, I accidentally discovered a deeply hidden bug with high CPU usage. This article will describe in detail the whole process of using the network environment simulation tool clumsy and Process Explorer to locate the high CPU usage problem.

1. Problem description

       The project for a certain client is coming to an end, and it is currently in the client trial stage. If there are no major problems and everything goes well, the client is ready to purchase products. The result was severe video image flickering on one of the customer's laptops, and the customer demanded that the issue be resolved before any product purchases could be made.

       This video flickering is a very difficult problem. The currently used video codec library has a small probability of incompatibility with different manufacturers and different types of USB cameras. It happened to be encountered in the entire project, so a long troubleshooting process began. .

       One day, I used Sunflower to remote to the client's laptop, and found that the CPU usage of the system was very high, and the system had obvious lag. Check the resource manager of the system, the sunflower software accounts for about 15% of the CPU, the customer also started some other software, these software also accounted for about 30%, our software also accounted for about 30%! It shouldn't be, we didn't do any business after the software was logged in, and it can still account for 30%, which must be a problem!

2. Use Process Explorer to initially find the cause of high CPU usage

       So I downloaded Process Explorer on the client's machine, and used this tool to look at the CPU usage of each thread of our software process, and see which module took up about 30% of the CPU when it was not doing anything.

Process Explorer is a tool that we use frequently when troubleshooting Windows software problems. It mainly uses the following functions:
1) You can view the dll library information loaded by the target process, including the library path, library version and other information. You can also check whether the library dynamically started by LoadLibrary is started.
2) You can view the thread information of the target process, including the CPU usage rate of each thread, the real-time function call stack of the thread and other information.
3) You can view the GPU hardware used by the target process (the GPU module is integrated on the CPU). Many software will use GPU, such as video encoding and decoding, you can use GPU to implement hard encoding and hard decoding (encoding and decoding using the computing power of CPU is called soft encoding and soft decoding), thereby effectively reducing CPU usage.

After starting Process Explorer, find our software process in the process list:

 Double-click our software process, the detailed information window of the process will pop up, and then switch to the Threads tab:

From the figure, we can see that the thread whose thread id is 12292 has a problem, and it actually accounts for about 25% of the CPU!

       Double-click the thread to view the current function call stack of the thread as follows:

The function call stack is related to the interface call of the libwebsockets open source library. We have done simple encapsulation of the libwebsockets open source library before. Could it be that there is something wrong with our encapsulation?

       The A business module of our software will communicate with the A server of the platform through libwebsockts, and the A server of the platform is responsible for the related transaction processing of the A business. But at present, our software has not done anything, and has not initiated the operation of A business. Why do the underlying modules frequently call libwebsockets?

       When our software logs in, it will establish a link with the A server of the platform to register with the A server, and this link is a long connection. If the link between the bottom layer of the software and the A server is disconnected, the bottom layer will initiate an automatic reconnection. Could it be that the bottom layer is currently unable to log in to the A server, and the bottom layer is constantly reconnecting at regular intervals, and there is a problem with the code for timing reconnection, resulting in high CPU usage?

       So by checking and printing, I found that the address of server A is 172.16.72.235, and pinged this server address on the customer's laptop, but the server address failed to ping:

So we can basically estimate that it is because the server cannot be connected, and the bottom layer is constantly reconnecting to the server, resulting in high CPU usage.

3. Use the Clumsy tool to reproduce the problem in the company's intranet environment

       The above probably guesses that the bottom layer has been regularly reconnecting to the A server when it cannot connect to the A server, causing high CPU usage. But according to the function call stack viewed in Process Explorer:

Compared with the source code, we did not find the problem.

       We can't keep remote customers' computers, and customers still have a lot of things to do, so we try to reproduce this problem in our company environment (our company has built multiple sets of platform environments for testing, and those with intranet environments , there is also a public network environment). So how to reproduce it within the company? In fact, it is very simple. Use network tools to intercept all the data interacting with server A, so that the client software can not connect to server A, and the underlying automatic reconnection will be triggered, which should be able to reproduce the problem.

       So I thought of a very useful and lightweight network environment simulation tool clumsy, and directly used this tool to intercept all the data packets sent by the client to the A server.

Clumsy is a weak network environment simulation tool, which can directly intercept network data, and can also set the packet loss rate of the target address to simulate a poor network environment. It is also a software that we use more in our daily work tool.

       After opening clumsy, the default filter condition is: outbound and ip.DstAddr >= 127.0.0.1 and ip.DstAddr <= 127.255.255.255, as follows:

We only need to set the target address to the address of server A in the platform 139.224.XXX.XXX (this platform is a test platform built by our company), that is, ip.DstAddr == 139.224.XXX.XXX:

Then check the Drop option, check both Inbound and Outbound, and set the packet loss rate to 100%, so that all the data packets sent by the software terminal to the A server are discarded, so that the A server cannot be connected. Note that you may need to run the tool with administrator privileges (especially in win10 systems).
       If the A server cannot be connected, the bottom layer will continue to reconnect regularly, and then the problem will reappear. Our work machine also occupies a high CPU, so it is more certain that the high CPU usage caused by the reconnection code up.

4. According to the function call stack in Process Explorer, analyze the source code, and finally find out the problem

       In fact, this high CPU usage bug is very hidden. It does not have this problem when it can connect to server A. It will only appear when it triggers reconnection when it cannot connect to server A.       

       So I found colleagues who are responsible for maintaining the underlying modules, and let's take a look at why their code caused high CPU usage. Because I am responsible for troubleshooting software exceptions, I often assist colleagues in the underlying modules to troubleshoot various software exceptions, such as protocol modules, network modules, audio and video codec modules, and component modules.

       Compared with the function call stack displayed in Process Explorer, I found the location of the source code, but after looking at the code in detail, no obvious flaws were found! Why is there no problem with the execution of the code when it can be connected, but there will be problems with this code when it is not connected? The colleague in charge of the underlying module is very busy, and jokingly said, since the problem is that the server cannot be connected, let the customer check why the server cannot be connected, and leave this problem for now. How can this be done! This is obviously a big hidden danger, regardless of whether the customer's environment can be connected, it must be solved!

       So I took their code over and studied it carefully to see what was going on. The code for reconnecting to server A is processed in a thread, and the relevant code is as follows: (The problem lies in the code that calls lws_service!)

static void* WSSocketProc( void* pParam )
{
    s_ptContext = CreateContext();
    if ( NULL == s_ptContext )
    {
        MLOG::MLogErr( ML_WEBSOCKET, "[%s] Create Context Failed!!!", __func__ );
        return NULL;
    }
 
    if ( FALSE ==OspSemBCreate( &s_hWsiCloseSem ) )
    {
        MLOG::MLogErr( ML_WEBSOCKET, "[%s] s_hWsiCloseSem Inited Failed!!!", __func__ );
        return NULL;
    }
 
    SemGive( g_hWSInitSem );
    while ( TRUE )
    {
        CheckSvrConnect();
 
        SemTake( s_hWsiCloseSem );
        std::vector<u64>::iterator itWsi = s_vecToBeClosedWsi.begin();
        for ( ; s_vecToBeClosedWsi.end() != itWsi; ++itWsi )
        {
            SemTake( g_hSessionIDSem );
            std::map<u64, std::string>::iterator itSessionID = g_mapSessionID.find( *itWsi );
            if ( g_mapSessionID.end() != itSessionID )
            {
                bClientForceClose = TRUE;
                lws_close_free_wsi( (lws *)( *itWsi ) , LWS_CLOSE_STATUS_NOSTATUS );
            }
 
            SemGive( g_hSessionIDSem );
        }
 
        s_vecToBeClosedWsi.clear();
        SemGive( s_hWsiCloseSem );
 
        // 问题就出在这句代码上,在没有websockets连接时,该接口没有起到sleep的作用
        lws_service( s_ptContext, LWS_SERVICE_TIMEOUT );
 
        if ( s_bExitSocketProc )
        {
            MLOG::MLogHint( ML_WEBSOCKET, "[%s] SocketProc Thread Exit!!!", __func__ );
            OspSemDelete( s_hWsiCloseSem );
            s_vecToBeClosedWsi.clear();
            break;
        }
    }
 
    lws_context_destroy( s_ptContext );
 
    return NULL;
}

Ordinarily, when using a thread to process transactions, a Sleep must be added, and the thread cannot be kept running, otherwise the thread will always occupy the CPU time slice, which will lead to high CPU usage, similar to an infinite loop. For programmers, this is common sense!

It seems that there is also Sleep in the code. When calling the interface lws_service of libwebsockets, a timeout parameter is passed in. It is estimated that the function of Sleep is realized through this function.

       Can you directly call the Sleep interface without calling this lws_service? The answer is no, go to the implementation of the lws_service interface and check the comments of the lws_service interface:

/**
 * lws_service() - Service any pending websocket activity
 * @context:    Websocket context
 * @timeout_ms:    Timeout for poll; 0 means return immediately if nothing needed
 *        service otherwise block and service immediately, returning
 *        after the timeout if nothing needed service.
 *
 *    This function deals with any pending websocket traffic, for three
 *    kinds of event.  It handles these events on both server and client
 *    types of connection the same.
 *
 *    1) Accept new connections to our context's server
 *
 *    2) Call the receive callback for incoming frame data received by
 *        server or client connections.
 *
 *    You need to call this service function periodically to all the above
 *    functions to happen; if your application is single-threaded you can
 *    just call it in your main event loop.
 *
 *    Alternatively you can fork a new process that asynchronously handles
 *    calling this service in a loop.  In that case you are happy if this
 *    call blocks your thread until it needs to take care of something and
 *    would call it with a large nonzero timeout.  Your loop then takes no
 *    CPU while there is nothing happening.
 *
 *    If you are calling it in a single-threaded app, you don't want it to
 *    wait around blocking other things in your loop from happening, so you
 *    would call it with a timeout_ms of 0, so it returns immediately if
 *    nothing is pending, or as soon as it services whatever was pending.
 */
 
LWS_VISIBLE int
lws_service(struct lws_context *context, int timeout_ms)
{
    return lws_plat_service(context, timeout_ms);
}

The translation of the above note is as follows:

This function deals with any pending websocket traffic, for three kinds of event. It handles these events on both server and client types of connection the same. 1) Accept
new connections to our context's server
2) Call the receive callback for incoming frame data received by server or client connections.
This function handles any (pending) websocket traffic that needs to be processed, for three types of events. It handles these events on server and client connection types in the same way.
1) Receive a new connection to our context server
2) Call the callback function to call back the data received by the server or client connection.

You need to call this service function periodically to all the above functions to happen; if your application is single-threaded you can just call it in your main event loop. You need to call this service function periodically to make all the above functions happen
; If your application is single-threaded, you can call it from the main event loop.

As can be seen from the above comments, to ensure that the libwebsockets library can send and receive data normally, the lws_service interface must be called.

       Then why do the above codes have problems running when connected to the server, but have problems when they cannot connect to the server? It is estimated that when the server cannot be connected, there is no valid websockets connection in libwebsockets. When the lws_service interface is called, the interface returns immediately, that is, the interface does not play the role of Sleep! It should be caused by this reason.

       The final solution is to add a Sleep to the thread function: (add a sentence of Sleep below the call to lws_service)

lws_service( s_ptContext, LWS_SERVICE_TIMEOUT );
 
Sleep( LWS_SERVICE_TIMEOUT ); // 需要人为地添加sleep,以保证当前线程有睡眠时间

No matter what the situation is, a certain period of Sleep must be executed. After modifying the code, compile the library and overwrite it in the software directory. After re-running the software, there will be no more problems.

5. Summary

       As a software developer, it is very necessary to master the use of some common tools. Using these tools to help troubleshoot various problems encountered in the operation of our software products can effectively improve the efficiency of troubleshooting.

       In this case, the problematic block of code was found by looking at the function call stack in the Process Explorer tool, and the problem was reproduced in the corporate environment by the Clumsy tool. It is by relying on these tools that we can gradually locate and solve problems.

Guess you like

Origin blog.csdn.net/chenlycly/article/details/130038272