Example Analysis of Alternative Memory Leaks in Network Programming

This article is shared from Huawei Cloud Community " [Network Programming Development Series] An Alternative Memory Leak in Network Programming ", author: architect Li Ken.

1 written in front

Recently, I was investigating a stress test problem of network communication, and finally found that it was related to " memory leak ", but this is a little different from the conventional understanding of memory leak. This article will take you to understand the beginning and end of the problem.

In the face of such a memory leak problem, this article also provides some conventional analysis methods and solutions, which are for your reference only, and you are welcome to correct the problem.

2 Problem description

Let's look directly at the issue description provided by the test:

To put it simply, after the device executes [Disconnect from the Internet - "Reconnect to Online] several times, it finds that it cannot successfully connect to the Internet again, and it has not been able to succeed until the device restarts and returns to normal.

3 scene reproduction

3.1 Build a stress testing environment

Since the test department has a special test environment, but I don't want to fix their set, I have to fix a test phone.

Their test method is to use the mobile phone hotspot as an AP, then the device connects to the AP, and then runs a script on the mobile phone to dynamically switch the Wi-Fi hotspot to achieve the test purpose of letting the device drop off the network and then restore the network .

After having this idea, I thought that I just have a portable mobile Wi-Fi in my hand, wouldn't it just be possible to implement a wireless hotspot? As long as the 360Wi-Fi hotspot switch can be dynamically switched on the PC, can the same test purpose be achieved?

After having the above physical conditions, I started to look for such a script.

To say that under Linux, it is not difficult to write such a script, but if you write a BAT script under Windows, you have to look for it.

After a while, I found a pretty good BAT script on the Internet. After I modified it, it looks like this. The main function is to switch the network adapter regularly.

@echo off

:: Config your interval time (seconds)
set disable_interval_time=5
set enable_interval_time=15

:: Config your loop times: enable->disable->enable->disable...
set loop_time=10000

:: Config your network adapter list
SET adapter_num=1
SET adapter[0].name=WLAN
::SET adapter[0].name=屑薪鈺犘も晲协
::SET adapter[1].name=屑薪鈺犘も晲协 2

:::::::::::::::::::::::::::::::::::::::::::::::::::::::

echo Loop to switch network adapter state with interval time %interval_time% seconds

set loop_index=0

:LoopStart

if %loop_index% EQU %loop_time% goto :LoopStop

:: Set enable or disable operation
set /A cnt=%loop_index% + 1
set /A result=cnt%%2
if %result% equ 0 (
set operation=enabled
set interval_time=%enable_interval_time%
) else (
set operation=disable
set interval_time=%disable_interval_time%
)
echo [%date:~0,10% %time:~0,2%:%time:~3,2%:%time:~6,2%] loop time ... %cnt% ... %operation%

set adapter_index=0
:AdapterStart
if %adapter_index% EQU %adapter_num% goto :AdapterStop
set adapter_cur.name=0

for /F "usebackq delims==. tokens=1-3" %%I in (`set adapter[%adapter_index%]`) do (
	set adapter_cur.%%J=%%K
)

:: swtich adapter state
call:adapter_switch "%adapter_cur.name%" %operation%

set /A adapter_index=%adapter_index% + 1

goto AdapterStart

:AdapterStop

set /A loop_index=%loop_index% + 1

echo [%date:~0,10% %time:~0,2%:%time:~3,2%:%time:~6,2%] sleep some time (%interval_time% seconds) ...
ping -n %interval_time% 127.0.0.1 > nul

goto LoopStart

:LoopStop

echo End of loop ...

pause
goto:eof

:: function definition
:adapter_switch
set cmd=netsh interface set interface %1 %2
echo %cmd%
%cmd%
goto:eof

Note: This place is filled with the network adapter that transmits the AP hotspot, such as the following. If it is a Chinese name , you must also pay attention to the encoding of the BAT script, otherwise the correct network adapter name will not be recognized.

3.2 Description of pressure measurement problems

At the same time, in order to accurately locate the problem of network disconnection and reconnection, I added three variables to the place where the network was disconnected and reconnected, recording the total number of reconnections, the number of successful reconnections, and the number of failed reconnections.

On the other hand, as described in the issue description, this is a problem that is strongly related to a fixed number of times, and may also be closely related to the running time. After restarting, everything returns to normal. These series of features lead the problem to a very serious problem. Common problems: memory leaks .

Therefore, before the stress test, I reprinted the memory status of the system (total remaining memory, historical minimum remaining memory) after each reconnection (regardless of whether it was successful or not), in order to judge the memory status of the problem node.

By adjusting the disable_interval_time and enable_interval_time parameters in the stress test script, the problem was reproduced in a relatively short period of time. Indeed, if the issue described it, after more than 30 times, the reconnection could not be successful, and it could be recovered after restarting.

4 Problem Analysis

Most of the problems, as long as there is a way to reproduce, are relatively easy to check, but it takes a little time and research.

4.1 Simple Analysis

First of all, we must suspect the most likely memory leak information. At first glance:

Since the Wi-Fi hotspot may be closed at the corresponding time point during the operation of disconnecting and reconnecting, the reconnection will definitely fail. When the Wi-Fi hotspot appears, it can be successful, so we will I saw that the free free memory fluctuated in a range, and did not see a steady downward trend.

On the other hand, with this evmin (minimum free memory) value, after the problem occurs, it has a fixed value and it has continued. From this point of view, I suspect that there must be a problem with this memory, but I am analyzing it for the first time. This conclusion was not drawn at the time of this situation. Looking back now, this is a warning sign.

The point I speculated at the time (the point I wanted to verify) was that when a problem occurred, was the system insufficient free memory due to a memory leak, so that memory-consuming operations such as new connection hotspots and network connections could not be completed.

Therefore, through the above memory table, I am basically certain of my conclusion: there is no obvious sign of memory leak, it is not because of insufficient memory that it cannot be reconnected .

At this point in the analysis of the problem, we must not stop, but the original SDK, such as the logic of the hotspot, is a black box for us, and we can only consult the original factory to see if we can get any effective information.

After asking in a circle, the valid information obtained is basically 0, so you have to rely on yourself for your own problems!

4.2 Looking for a breakthrough

In the above problem scenario, we have ruled out the possibility of insufficient memory , then we should focus on three aspects:

  • Did the device successfully connect to the Wi-Fi hotspot in the end? Can the IP address of the subnet be assigned normally?
  • After the device is successfully connected to the Wi-Fi hotspot, is the external network normal?
  • The external network of the device is normal, why can't it successfully connect to the server?

These three questions are a progressive relationship, one link is another!

Let's look at the first problem first. Obviously, when the problem is reproduced, we can see the connected device from the PC's Wi-Fi hotspot and see the assigned subnet IP address.

Next, let's look at the second question. The test of this question is also very simple, because our command line integrates the ping command, and when we enter the ping command, we find an important piece of information:

# ping www.baidu.com
ping_Command
ping IP address:www.baidu.com
ping: create socket failed

A normal ping log looks like this:

# ping www.baidu.com
ping_Command
ping IP address:www.baidu.com
60 bytes from 14.215.177.39 icmp_seq=0 ttl=53 time=40 ticks
60 bytes from 14.215.177.39 icmp_seq=1 ttl=53 time=118 ticks
60 bytes from 14.215.177.39 icmp_seq=2 ttl=53 time=68 ticks
60 bytes from 14.215.177.39 icmp_seq=3 ttl=53 time=56 ticks

WC! ping: create socket failed  This also failed to create a socket! ! ! ?

I first wondered if there was a problem with the lwip component?

Second doubt: Are there not enough socket handles? Therefore, most of the operations to create memory are to apply for socket memory resources, and no other advanced operations are performed.

Thinking about it this way, the second possibility is very large. Combined with the previous total and total signs, it is an object that needs to be investigated.

4.3 Filling in knowledge points

Before accurately locating the problem, we first supplement the relevant knowledge points, so as to facilitate the subsequent knowledge expansion and explanation.

4.3.1 The socket handle of lwip

  • The creation of sockets

The way the socket function is called is as follows:

socket -> lwip_socket -> alloc_socket

Implementation of the alloc_socket function:

/**
 * Allocate a new socket for a given netconn.
 *
 * @param newconn the netconn for which to allocate a socket
 * @param accepted 1 if socket has been created by accept(),
 *                 0 if socket has been created by socket()
 * @return the index of the new socket; -1 on error
 */
static int
alloc_socket(struct netconn *newconn, int accepted)
{
  int i;
  SYS_ARCH_DECL_PROTECT(lev);

  /* allocate a new socket identifier */
  for (i = 0; i < NUM_SOCKETS; ++i) {
    /* Protect socket array */
    SYS_ARCH_PROTECT(lev);
    if (!sockets[i].conn && (sockets[i].select_waiting == 0)) {
      sockets[i].conn       = newconn;
      /* The socket is not yet known to anyone, so no need to protect
         after having marked it as used. */
      SYS_ARCH_UNPROTECT(lev);
      sockets[i].lastdata   = NULL;
      sockets[i].lastoffset = 0;
      sockets[i].rcvevent   = 0;
      /* TCP sendbuf is empty, but the socket is not yet writable until connected
       * (unless it has been created by accept()). */
      sockets[i].sendevent  = (NETCONNTYPE_GROUP(newconn->type) == NETCONN_TCP ? (accepted != 0) : 1);
      sockets[i].errevent   = 0;
      sockets[i].err        = 0;
	  SOC_INIT_SYNC(&sockets[i]);
      return i + LWIP_SOCKET_OFFSET;
    }
    SYS_ARCH_UNPROTECT(lev);
  }
  return -1;
}

Everyone noticed that the for loop in the above function has a macro  NUM_SOCKETS . The specific value of this macro is adaptable. Different platforms can choose an appropriate value according to their actual usage and memory conditions.

Let's look at the implementation of this NUM_SOCKETS macro definition:

宏定义替换
#define NUM_SOCKETS MEMP_NUM_NETCONN

在lwipopts.h中找到了其最终的替换
/**
 * MEMP_NUM_NETCONN: the number of struct netconns.
 * (only needed if you use the sequential API, like api_lib.c)
 *
 * This number corresponds to the maximum number of active sockets at any
 * given point in time. This number must be sum of max. TCP sockets, max. TCP
 * sockets used for listening, and max. number of UDP sockets
 */
#define MEMP_NUM_NETCONN	(MAX_SOCKETS_TCP + \
	MAX_LISTENING_SOCKETS_TCP + MAX_SOCKETS_UDP)

Looking at this, it's a bit confusing. How much is this value?

  • Destruction of socket handle

With the destruction, we all know that the close interface is used, and its function call path is as follows:

close -> lwip_close -> free_socket

The implementation of the lwip_close function is as follows:

int
lwip_close(int s)
{
  struct lwip_sock *sock;
  int is_tcp = 0;
  err_t err;

  LWIP_DEBUGF(SOCKETS_DEBUG, ("lwip_close(%d)\n", s));

  sock = get_socket(s);
  if (!sock) {
    return -1;
  }
  SOCK_DEINIT_SYNC(1, sock);

  if (sock->conn != NULL) {
    is_tcp = NETCONNTYPE_GROUP(netconn_type(sock->conn)) == NETCONN_TCP;
  } else {
    LWIP_ASSERT("sock->lastdata == NULL", sock->lastdata == NULL);
  }

#if LWIP_IGMP
  /* drop all possibly joined IGMP memberships */
  lwip_socket_drop_registered_memberships(s);
#endif /* LWIP_IGMP */

  err = netconn_delete(sock->conn);
  if (err != ERR_OK) {
    sock_set_errno(sock, err_to_errno(err));
    return -1;
  }

  free_socket(sock, is_tcp);
  set_errno(0);
  return 0;
}

Free_socket is called here:

/** Free a socket. The socket's netconn must have been
 * delete before!
 *
 * @param sock the socket to free
 * @param is_tcp != 0 for TCP sockets, used to free lastdata
 */
static void
free_socket(struct lwip_sock *sock, int is_tcp)
{
  void *lastdata;

  lastdata         = sock->lastdata;
  sock->lastdata   = NULL;
  sock->lastoffset = 0;
  sock->err        = 0;

  /* Protect socket array */
  SYS_ARCH_SET(sock->conn, NULL);
  /* don't use 'sock' after this line, as another task might have allocated it */

  if (lastdata != NULL) {
    if (is_tcp) {
      pbuf_free((struct pbuf *)lastdata);
    } else {
      netbuf_delete((struct netbuf *)lastdata);
    }
  }
}

This SYS_ARCH_SET(sock->conn, NULL); will release the corresponding socket handle, thus ensuring that the socket handle can be used cyclically.

4.3.2 close and shutdown in TCP network programming

The reason why this knowledge point is discussed here is because this knowledge point is the key to solving the whole problem.

Here's the straight-forward conclusion:

  • close decrements the descriptor's reference count by 1 and closes the socket only when the count reaches 0. shutdown can trigger TCP's normal connection termination sequence regardless of reference counts.
  • close terminates data transmission in both read and write directions. TCP is full-duplex, and sometimes it is necessary to inform the other party that the data transfer has been completed, even if the other party still has data to send to us.
  • Shutdown has nothing to do with socket descriptors. Even if shutdown(fd, SHUT_RDWR) is called, fd will not be closed, and close(fd) is required in the end.

4.4 In-depth analysis

After understanding the creation and closing of socket handles in the lwip component, let's go back to the reproduction problem itself.

From the most subtle log, we know that the problem lies in the inability to allocate new sockets. Let's look at the logic of allocating sockets. There is a judgment condition:

if (!sockets[i].conn && (sockets[i].select_waiting == 0)) {
      //分配新的句柄编号
      sockets[i].conn       = newconn;
      。。。
}

By increasing the log, we know that the value of select_waiting is 0, so the problem is that conn is not NULL.

In lwip_close, .conn is assigned NULL, so I wonder if lwip_close is not called? Does the process cause the handle not to be fully released?

To answer this question, we need to go back to our software architecture. In the implementation of the architecture, our different chip platforms use different versions of lwip components, and the MQTT protocol running on the upper layer is public, that is, if it is in the upper layer logic If the close logic is not handled correctly, then this problem should occur on all platforms, but why is it only this platform that has problems.

There is only one answer, the problem may lie in the lwip implementation layer.

Since lwip was adapted by the original factory, I immediately found the native lwip-2.0.2 version for comparison. I mainly wanted to know what optimizations and adjustments were made when the original factory was adapted.

After comparing the results, the problem was found.

Let's take the problem sockets.c as an example, we focus on the application and release of sockets:

In order to better describe the optimization done by the original factory, I made a few modifications to the added code, and roughly added a few macro definitions. The comments of these macro definitions should be used to deal with new and closed sockets under multi-tasking . synchronization issues.

#define SOC_INIT_SYNC(sock) do { something ... } while(0)#define SOC_DEINIT_SYNC(sock) do { SOCK_CHECK_NOT_CLOSING(sock); something ... } while(0)#define SOCK_CHECK_NOT_CLOSING(sock) do { \		if ((sock)->closing) { \			SOCK_DEBUG(1, "SOCK_CHECK_NOT_CLOSING:[%d]\n", (sock)->closing); \			return -1; \		} \	} while (0)

Just follow its logic. When the upper layer calls lwip_close, it will call SOC_DEINIT_SYNC, and it will call SOCK_CHECK_NOT_CLOSING, thus ending the whole process of socket release.

But when the upper layer of MQTT we made calls the TCP link to hang up, it plays like this:

/* * Gracefully close the connection */void mbedtls_net_free( mbedtls_net_context *ctx ){    if( ctx->fd == -1 )        return;    shutdown( ctx->fd, 2 );    close( ctx->fd );    ctx->fd = -1;}

Gracefully close the TCP link, at this time you should remember the knowledge points in chapter 4.3.2 .

Will this call affect those macros?

The answer is yes.

It turns out that lwip_shutdown also calls SOC_DEINIT_SYNC during the original factory adaptation, which leads to the fact that if the upper layer closes the link and calls both shutdown and close, there will be a problem with its logic, which will cause the process of close to be incomplete.

In order to simplify this problem, I roughly wrote its logic:

1) When the shutdown function is called, start the shutdown process SOC_DEINIT_SYNC, enter those macros, there will be a step: (sock)->closing = 1; then return to 0 normally;

2) When the close function is called, enter the closing process SOC_DEINIT_SYNC again. As soon as it is judged that (sock)->closing is already 1, an error is reported and -1 is returned; so the return of close is abnormal;

3) Look at the logic of the lwip_close function:

So there is the previous problem. The index of the socket handle keeps rising, and the old socket handle should be occupied all the time until the number of handles is exhausted.

What is the maximum number of handles NUM_SOCKETS? You can refer to my previous article on how to look at the precompiled code. We can clearly see that its value is 38 .

All the doubts are opened, in order that the problem must be after more than 30 times, the answer is given here!

Here I boldly guessed that when the original factory was adapting this synchronous operation logic, it did not consider that the upper layer can also shutdown first and then close , which caused this problem.

5 bug fixes

In the above analysis, the problem code has been initially located, and the next step is to repair the problem.

The root cause of the problem is to adjust shutdown first and then close. Since it is an upper-level code, it is shared by other platforms, and there is no problem with other platforms. Therefore , the operation of closing the TCP link gracefully by the upper layer must not be removed. The lwip component is optimized and solved by itself. The so-called is: whoever is to blame, who will wipe the ass!

The key to solving the problem is to ensure that after the shutdown is adjusted, the close operation needs to go through a complete process, so that the occupied socket handle can be released.

Therefore, when executing shutdown and close, SOC_DEINIT_SYNC needs to take a parameter to inform whether it is a close operation. If it is not close, then a simple process will be followed to ensure that the close process is complete.

When the upper layer only calls close, it can also ensure that the process of close is complete.

However, if the upper-level shareholding is called close first, and then shutdown, the process will not work.

Of course, the upper layers can't play like this. For details, please refer to the knowledge points in 4.3.2.

6 Question Verification

After the problem is fixed, the same process needs to be retested to ensure that the problem is indeed fixed.

The problem verification is also very simple. Modify the NUM_SOCKETS in sockets.c and change it to a small value, such as 3 or 5, to speed up the recurrence of the problem. At the same time, type out the handle id obtained in alloc_socket and observe whether it rises. , in normal testing, in the absence of other network communication links, it should stabilize at 0.

It will be verified soon, and the problem will not be reproduced again.

Next, you need to restore the value of NUM_SOCKETS to the value of the principle, and test the original reproduced scene to ensure that only this place caused the problem, and other codes did not interfere.

Fortunately, the test after the restoration also passed, which proves that the problem is completely fixed without side effects, which is a successful bug fix.

7 Experience Summary

  • There are many kinds of memory leaks, but we must pay attention to their essential characteristics;
  • Socket handle leak is also a type of memory leak;
  • Each optimization has its specific scenario, and you need to reconsider the universality of this optimization without this specific scenario;
  • Enhance the sensitivity to key log information, which is conducive to finding the direction light for troubleshooting in the vast problem;
  • Accurate understanding of the close function and shutdown function in the TCP programming interface can help to solve the problem of network drop;
  • A stress test before going online is essential.

8 Reference Links

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/124018572