Fault flow analysis of scheduled tasks on the integrated command platform

01 Fault phenomenon

There are scheduled tasks in the integrated command platform to regularly transmit data to the headquarter, and the headquarter sends data to the municipal traffic police detachment regularly. The Municipal Traffic Police Detachment found that the execution of the scheduled tasks had been failing. The Municipal Traffic Police Detachment contacted the Headquarters and said that the Municipal Traffic Police Detachment needed to check its own network. Two days ago, it captured the data packets of the scheduled tasks on the application server and found that it had been RST during the connection process. It is not possible to confirm at which link it was RSTed.

This analysis uses the NetInside traffic analysis system, which has been deployed in the business environment, using the traffic analysis system to provide real-time and historical raw traffic. This analysis focuses on troubleshooting integrated scheduled tasks for setting forensics, performance analysis, network quality monitoring, and deep network analysis.

02 Analysis purpose

Analyze the reasons for the execution failure of scheduled tasks in the integrated command platform, find out the root cause of the problem, and take corresponding measures to solve these problems.

Through analysis, it is possible to determine which factors lead to the failure of scheduled task execution, such as network connection problems, system failures, configuration errors, etc. Doing so can help administrators discover and solve problems in a timely manner and ensure the normal operation of the integrated command platform. Find and verify that there are no business system health issues.

03 Deployment Architecture and Traffic Collection

Through communication with network technicians, when the computer terminal of the street transmits data to the city bureau through the district and county, the connection is RST. The network belongs to three areas, namely the street area, the district and county area, and the city bureau area. After analysis, we deployed NetInside in a bypass mode to the district and county computer rooms, and mirrored the traffic of the core switch to NetInside. This location can capture all the traffic of scheduled task data and conduct a comprehensive analysis of abnormalities.

04Analysis process

The following is a detailed analysis of this failure.

Significant time gaps exist in traffic transmission

By analyzing the second-level data distribution trend of the system, it is found that there is an obvious and regular transmission of data between 10.XXX.XXX.78 (hereinafter referred to as 78) and server 10.XXX.XXX.80 (hereinafter referred to as 80) spacing phenomenon. This is most likely caused by some unknown factors .

In-depth analysis of data transmission interval phenomenon

Download the corresponding data packets from the analysis system and find a large number of RST packets, as shown in the figure below.

Randomly check a session information (5-tuple-based dialogue) in the above figure, and find that there is an anomaly.

In the figure below, all the packets before Frame 19695 are a normal POST request operation, but Frame 20756 and 20757 obviously have nothing to do with the previous connection.

Continue to analyze.
In normal data transmission, the TTL of data packets flowing from 80 to 78 is 119, as shown in the figure below.

Looking at Frame 20756 again, it is also a packet from 80 to 78, but the TTL is 124 .

At the same time, this RST package also contains more application layer information for reference.

And Frame 20757 is the RST of the above RST message.

After analysis, a total of 1846 similar RSTs occurred within a time range of about 43 minutes.

Impact of Abnormal RST on Data Transmission

The analysis found that after the above-mentioned RST occurs, 78 will stagnate for a period of time, and will initiate a TCP handshake request to 80 again, and then perform the POST data operation.

The following are a few data randomly viewed as evidence.

After the abnormal RST, 78 waits for 8.19 seconds before sending a connection establishment request to 80.

After the abnormal RST, 78 waits for 13.14 seconds before sending a connection establishment request to 80.

After the abnormal RST, 78 waits for 34.34 seconds before sending a connection establishment request to 80.

After the abnormal RST, 78 waits for 58.43 seconds before sending a connection establishment request to 80.

No longer list them one by one.

05Analysis conclusion

When data is transmitted between 78 and 80, there will be a large number of RST data packets of unknown systems or nodes, and the data packets will cause obvious delay to 78's request at the same time.

06Solution suggestions

Since the abnormal data packet contains address and prompt information, the device sending RST can be located according to this information. The location of the device can also be calculated and located based on the TTL information.

Configure and optimize policies for devices that send out abnormal RSTs.

07 Question Verification

Analyzing the abnormal RST, it was determined that it was sent by the terminal control software, and the management personnel made corresponding settings for the software so that it would no longer send RST messages.

After downloading the modified policy from the NetInside traffic analysis system, open and view the data transmission packets between 78 and 80, and no abnormal RST packets will appear.

Similarly, within a period of time, an abnormal RST will no longer appear, as shown in the figure below.

This indicates that the terminal control software policy settings are valid.

08 Comparison of abnormal effects before and after

Finally, analyze and compare the traffic transmission characteristics before and after the abnormality.

Traffic transmission status comparison

The following is the data transmission between 78 and 80 before the policy adjustment.

The following is the data transmission between 78 and 80 after the policy adjustment.

Through comparison, it can be seen that after the policy is adjusted, the data transmission is significantly accelerated , and there are no obvious intervals and blank waiting periods in the middle.

09 Function and value

When a user encounters an abnormal network problem, on-site professional technicians conduct analysis many times, but the location of the problem is still uncertain, which consumes a lot of manpower and time for the user. In order to solve this problem, we adopted NetInside full traffic behavior analysis technology, which can quickly discover the causes of abnormalities and risks. Through this technology, the user's position is changed from passive to active, truly liberating them from the time and energy consumed by unnecessary manual fault diagnosis and packet analysis.

Guess you like

Origin blog.csdn.net/NetInside_/article/details/132607753