Assist in troubleshooting and optimization of enterprise network applications from the perspective of RST packets in the microcosm

1.  Preface

With the popularization and development of the Internet, businesses and applications in various industries are increasingly dependent on the Internet. However, the instability and complexity of the network environment makes the probability of various abnormal phenomena higher. These abnormal phenomena will cause the business to fail to run normally, bring troubles to users, and even affect the image and interests of the enterprise. Therefore, a solution capable of quickly diagnosing and locating network abnormalities is needed.

Network packet capture analysis technology is a commonly used network debugging tool, which can capture network data packets and provide detailed data analysis and statistical information to help users quickly locate network problems. Among them, the TCP RST message is a common network abnormal phenomenon, which usually indicates that the connection is reset or interrupted, resulting in the failure of the normal operation of the service.

In response to this problem, the NetInside full traffic backtracking analysis solution (hereinafter referred to as NetInside full traffic) collects and stores all traffic data of the specified link through the bypass of the probe device, and realizes unattended operation, which is critical to business systems, applications and networks. Links conduct 7×24 hours, all-round traffic monitoring, intelligently learn the baseline value of key performance indicators, and when there are network abnormalities, business or network-related statuses reach thresholds, active warnings are given, and real-time/post-event real-time or post-event analysis of required original Download and analyze and decode the package to quickly locate the cause of the problem.

NetInside has created a new generation of full-traffic retrospective analysis system tailored for the information department with the concept of people-oriented, making full use of network data packets to establish a comprehensive monitoring view covering important links, key equipment ports, and core services, and organizing it according to the workflow of the network department Functions and operations make it widely applicable to various scenarios.

The service-oriented full-flow method enables the NetInside system to directly reflect the support capability of the network infrastructure for business applications, and provides real-time fault diagnosis, alarm triggering, and in-depth analysis of post-event application response. The whole system fundamentally simplifies network fault diagnosis, deployment of new applications and services, reduces operating costs, and provides a fast way to realize fault location. Provide reliable data basis for analyzing network attacks, locating abnormalities, evaluating and judging user experience, application performance and network service quality. Relying on real network traffic, quickly discover and define applications, and provide data correctness and change result verification capabilities, greatly improving the visual coverage of network traffic and work efficiency. Using advanced data statistical analysis technology, functions such as discovery and alarm greatly simplify the tedious and complicated operation process in the past.

2.  Introduction to RST

TCP RST (Reset) is a control message in the TCP protocol, which is located in the transport layer of the TCP protocol and is used to terminate the TCP connection or reject illegal connections. TCP RST is usually sent by one party of a TCP connection, it can immediately terminate the TCP connection and notify the other party that the connection has been reset.

The TCP RST message is located in the Control Bits Field (Control Bits Field) in the TCP header. This field occupies one byte, and its sixth bit (RST) is used to represent the TCP RST message. The following is a diagram of a TCP header:

It can be seen that the 6th bit of the 13th byte (that is, the control bit field) in the TCP header indicates the TCP RST message. When this bit is set to 1, it means sending a TCP RST packet.

3.  Common causes and confirmation methods of RST

Common situations and reasons for generating TCP reset packets include the following seven situations. It should be noted that the TCP reset packet may affect the normal TCP connection, so it should be used with caution. If you're not sure whether you should send a TCP reset packet, check with your network administrator or security professional first.

3.1.  Intervention of firewalls or network devices on TCP connections

When a firewall or network device interferes with a TCP connection, it may send a TCP reset packet to terminate the connection.

Reason judgment: Check the logs of the firewall or network device to confirm whether there is any interference with the TCP connection. If a TCP connection is terminated and a TCP reset packet is received, it is likely sent by a firewall or network device.

3.2.  The endpoint believes that the connection has failed or is abnormal

When an endpoint of a TCP connection believes that the connection has failed or has abnormal conditions, it can send a TCP reset packet to terminate the connection.

Reason judgment: You can check whether there is a TCP reset packet by checking the log of the TCP connection or a network packet capture tool. If it is found that a certain party has sent a TCP reset packet, it is likely that the party believes that the connection has failed or there is an abnormal situation.

3.3.  An attacker may send a TCP reset packet

In some attacks, the attacker may send a TCP reset packet to terminate the connection or trick the receiver.

Reason judgment: You can check whether there are abnormal TCP reset packets by checking the network packet capture tool. If it is found that the source address of the TCP reset packet is unknown or does not match the IP address of normal communication, it is likely to be attacked.

3.4.  Network congestion

When the network is congested, routers or firewalls may discard some data packets, causing TCP connections to time out or fail. At this point, one side of the connection may send a TCP reset packet to terminate the connection.

Reason judgment: You can check whether there is a lost data packet or TCP connection timeout by checking the network packet capture tool. If you find that TCP reset packets are being sent during periods of network congestion, it is likely that one party to the connection believes the connection has failed.

3.5.  Program Abnormal

When an abnormality occurs in the application program or the operating system, the TCP connection may be damaged or invalidated. At this point, one side of the connection may send a TCP reset packet to terminate the connection.

Reason judgment: You can check whether there is abnormal behavior or error information by checking the logs of the application program or the operating system. If it is found that a certain party has sent a TCP reset packet, it is likely to be caused by a program exception.

3.6.  Timeout

When a TCP connection has been inactive for a long time, it may be considered dead. At this point, one side of the connection may send a TCP reset packet to terminate the connection.

Reason judgment: You can check whether the connection has been inactive for a long time by checking the logs of the TCP connection or the network packet capture tool. If you find that one party is sending a TCP reset packet, and the connection has been inactive for a long time, it is likely due to a timeout.

3.7.  Illegal connection

When some malicious programs or attackers try to establish an illegal TCP connection, one side of the connection may send a TCP reset packet to block the connection.

Reason judgment: You can check whether there is an illegal TCP connection request or connection behavior by checking the network packet capture tool. If a party is found to send a TCP reset packet, it is likely due to an illegal connection request.

4.  RST case analysis

The following will analyze the abnormal behavior of RST through two real cases on site to quickly help users solve difficult problems.

4.1.  A provincial public security case

4.1.1.  Background

There are scheduled tasks in the integrated command platform to regularly transmit data to the headquarter, and the headquarter sends data to the municipal traffic police detachment regularly. The Municipal Traffic Police Detachment found that the execution of the scheduled tasks had been failing. The Municipal Traffic Police Detachment contacted the Headquarters and said that the Municipal Traffic Police Detachment needed to check its own network. Two days ago, it captured the data packets of the scheduled tasks on the application server and found that it had been RST during the connection process. It is not possible to confirm at which link it was RSTed.

This analysis uses the NetInside traffic analysis system, which has been deployed in the business environment, using the traffic analysis system to provide real-time and historical raw traffic. This analysis focuses on troubleshooting integrated scheduled tasks for setting forensics, performance analysis, network quality monitoring, and deep network analysis.

4.1.2.  Environment deployment

The municipal traffic police detachment and the traffic police corps are interconnected by private network and dedicated line. At the core switch of the security domain integration command platform of the municipal traffic police detachment, the traffic bypass mode is mirrored to the NetInside traffic analysis system, and the scheduled task traffic is collected and analyzed through NetInside.

4.1.3.  Analysis time

The report analysis time range is: 2023-05-19 daytime working hours.

4.1.4.  Analysis purpose

Analyze the reasons for the execution failure of scheduled tasks in the integrated command platform, find out the root cause of the problem, and take corresponding measures to solve these problems.

Through analysis, it is possible to determine which factors lead to the failure of scheduled task execution, such as network connection problems, system failures, configuration errors, etc. Doing so can help administrators discover and solve problems in a timely manner and ensure the normal operation of the integrated command platform. Find and verify that there are no business system health issues.

4.1.5.  Detailed analysis

The following is a detailed analysis of this failure.

4.1.5.1.  There are obvious time intervals in traffic transmission

By analyzing the second-level data distribution trend of the system, it is found that there are obvious and regular transmission intervals between 10.61.132.78 (hereinafter referred to as 78) and server 10.56.81.80 (hereinafter referred to as 80). This is most likely caused by some unknown factors .

4.1.5.2.  In-depth analysis of data transmission interval phenomenon

Download the corresponding data packets from the analysis system and find a large number of RST packets, as shown in the figure below.

Randomly check a session information (5-tuple-based dialogue) in the above figure, and find that there is an anomaly.

In the figure below, all the packets before Frame 19695 are a normal POST request operation, but Frame 20756 and 20757 obviously have nothing to do with the previous connection.

Continue to analyze.

In normal data transmission, the TTL of data packets flowing from 80 to 78 is 119, as shown in the figure below.

Looking at Frame 20756 again, it is also a packet from 80 to 78, but the TTL is 124 .

At the same time, this RST package also contains more application layer information for reference.

And Frame 20757 is the RST of the above RST message.

After analysis, a total of 1846 similar RSTs occurred within a time range of about 43 minutes.

4.1.5.3.  The impact of abnormal RST on data transmission

The analysis found that after the above-mentioned RST occurs, 78 will stagnate for a period of time, and will initiate a TCP handshake request to 80 again, and then perform the POST data operation.

The following are a few data randomly viewed as evidence.

After the abnormal RST, 78 waits for 8.19 seconds before sending a connection establishment request to 80.

After the abnormal RST, 78 waits for 13.14 seconds before sending a connection establishment request to 80.

After the abnormal RST, 78 waits for 34.34 seconds before sending a connection establishment request to 80.

After the abnormal RST, 78 waits for 58.43 seconds before sending a connection establishment request to 80.

No longer list them one by one.

4.1.6.  Analysis conclusion

When data is transmitted between 78 and 80, there will be a large number of RST data packets of unknown systems or nodes, and the data packets will cause obvious delay to 78's request at the same time.

4.1.7.  Resolution suggestions

Since the abnormal data packet contains address and prompt information, the device sending RST can be located according to this information. The location of the device can also be calculated and located based on the TTL information.

Configure and optimize policies for devices that send out abnormal RSTs.

4.1.8.  Question verification

Analyzing the abnormal RST, it was determined that it was sent by the terminal control software, and the management personnel made corresponding settings for the software so that it would no longer send RST messages.

After downloading the modified policy from the NetInside traffic analysis system, open and view the data transmission packets between 78 and 80, and no abnormal RST packets will appear.

Similarly, within a period of time, an abnormal RST will no longer appear, as shown in the figure below.

This indicates that the terminal control software policy settings are valid.

4.1.9.  Efficiency comparison before and after abnormality

Finally, analyze and compare the traffic transmission characteristics before and after the abnormality.

4.1.9.1.  Comparison of Traffic Transmission Status

The following is the data transmission between 78 and 80 before the policy adjustment.

The following is the data transmission between 78 and 80 after the policy adjustment.

Through comparison, it can be seen that after the policy is adjusted, the data transmission is significantly accelerated , and there are no obvious intervals and blank waiting periods in the middle.

4.2.  A case of an electrification bureau

4.2.1.  Background

According to user feedback from the Electrification Bureau, the video system has experienced frequent interruptions during use recently, which affects the user's video experience and work efficiency.

To solve this problem, we deployed the NetInside traffic analysis system to the computer room of the electrification bureau, and used the traffic analysis system to provide real-time and historical raw traffic. Focus on the fault analysis of the video system of the Electrification Bureau to find out the specific reasons for the interruption of the video system.

4.2.2.  Fault phenomenon

According to user feedback, there was a video interruption at 7:20 on February 8, 2023, and the video traffic was disconnected through the video system management terminal, as shown in the figure below:

4.2.3.  Environment deployment

At the location of the aggregation switch in the video area of ​​the central computer room of the intranet, the relevant traffic bypass mode is mirrored to the NetInside traffic analysis system, and all video traffic is collected and analyzed through NetInside.

4.2.4.  Fault analysis

For this problem, let's analyze it in detail.

4.2.4.1.  Analyzing equipment flow collection diagram

Communicate with on-site network personnel to understand the real network structure. The following is the traffic collection diagram of the video conference:

The traffic collection diagram mainly includes: 1 video system client, 1 video system server, and 1 core switch;

The corresponding MAC addresses are displayed below the video system client and video system server;

The core switch is marked to display the MAC addresses of the corresponding two ports;

The core switch sends traffic to the traffic analysis system through mirroring.

4.2.4.2.  Analysis ideas

1. Find abnormal data packets in the traffic analysis system and download the data packets.

2. Open the data packet and find the contents of the data packet at the abnormal time point.

3. Analyze the protocol and detailed content of the video data packet.

4.2.4.3.  Detailed analysis

Packet download

According to the traffic analysis system, the abnormal node is quickly found and the data package is downloaded.

Traffic Analysis

The traffic analysis system deduplicates the mirrored traffic, so the analysis system only obtains the traffic sent to the core switch by 10.21.106.11 and the traffic sent to the core switch by 10.21.30.106.

normal transmission

According to the data packet analysis, the mac address sent by the client 10.21.106.11 to the server 10.21.30.106 is: Src: HuaweiTe_XX:XX:26 ———> dst: HuaweiTe_XX:XX:01.

According to the data packet analysis, the mac address sent by the server 10.21.30.106 to the client 10.21.106.11 is: Dst: HuaweiTe_XX:XX:0f <——— src: EdgeCore_XX:XX:a8.

The TTL inside the packet is 64.

At the same time, the data packet carries the VLAN protocol, and the VLAN ID is 59.

abnormal transmission

According to the data packet analysis, RST was found at the time when the video was interrupted, and the MAC address of the data packet was abnormal. The MAC address sent by the server 10.21.30.106 to the client 10.21.106.11 was: src:HuaweiTe_XX:XX:01, dst:HuaweiTe_XX:XX: 26. Normally it should be src: EdgeCore_XX:XX:a8, dst: HuaweiTe_XX:XX:0f.

The TTL inside the packet is 127.

No VLAN protocol found.

The above comparison found that the MAC address of the abnormal packet was abnormal, there was no VLAN protocol, and the TTL hop count was abnormal.

4.2.5.  Analysis conclusion

Through the above system analysis, it is found that when the video is interrupted, an abnormal RST data packet appears in the network, resulting in the interruption of the TCP connection between the two communicating parties.

4.2.6.  Recommendations

Through the analysis of the video data of the Electrification Bureau, it is found that there are abnormal messages on the network, and it is recommended to conduct further analysis based on the actual situation of the network.

5.  Summary

In addition to network packet capture analysis technology, there are other methods that can help enterprises quickly diagnose and locate anomalies, such as anomaly detection technology based on artificial intelligence and machine learning. This technology can learn normal business patterns and rules by analyzing business data, and then detect abnormal data and deal with it accordingly. This method has the advantages of high precision and high efficiency, which can help enterprises quickly find and solve problems.

To sum up, network anomalies are problems that all industries may encounter, and a solution that can be quickly diagnosed and located is needed. Network packet capture analysis technology is a commonly used solution, which can help users quickly locate network problems. In addition, there are other methods that can help enterprises quickly diagnose and locate anomalies, such as anomaly detection technology based on artificial intelligence and machine learning.

Guess you like

Origin blog.csdn.net/NetInside_/article/details/131172779