Sharing of network troubleshooting examples in complex network environments

       The customer's network environment may be complex, and many security rules are also configured in the network equipment in the network, which may cause our software and hardware products to encounter a series of problems such as inability to connect and packet loss after they are deployed in the customer's environment. This article briefly describes two typical network failure problems encountered before, to provide you with a reference or reference.

1 Overview

       Before project acceptance, hardware and software equipment must be deployed in the customer environment for trial and debugging for a period of time. First, it is to see whether the software and hardware products can operate normally, and the second is to see whether the function and performance of the product can meet the customer's requirements. Among them, whether the product can adapt to the customer's complex network environment is a very important indicator. The meaning of this indicator is that the software and hardware products can run through the customer's network environment, and the products can successfully complete the joint debugging test for various services.

       If the product cannot run normally in the customer's network environment, especially for customers with a relatively high security level, their network environment is extremely complex, and customers cannot provide a complete network topology for security reasons, and can only start from our products. Step by step to investigate, use wireshark to capture packets on the product side and server side respectively, and ask the customer's network management department to assist us in the investigation.

       Troubleshooting complex network problems will be quite laborious. It takes a lot of manpower and time to conduct the troubleshooting step by step by module and region. Professional technical support and R&D personnel need to be arranged to check on the customer site. What is even more troublesome is that sometimes the problem is not inevitable, sometimes it is easy to occur, and sometimes it takes a long time to run for a long time to reproduce, so the time will last longer, and the troubleshooting of some projects will last for a month. several months.

2. Example 1: Multicast packets are intercepted by network devices, and IP and MAC addresses are bound

       A primary server and a backup server are deployed in a customer environment (in fact, multiple servers with different services are deployed in the real environment, here is a simplified description for the convenience of explaining the problem). Run the standby server and quickly switch to the standby server. The two servers use the same IP address. When switching to the standby server, the standby server will send a broadcast packet that preempts the IP to notify everyone. The broadcast packet is sent out in a multicast mode . However , after the broadcast packet is sent, the backup server will often fail to connect to the network , resulting in occasional problems in the business of the product.

       After analysis, it was found that when we sent an IP-grabbing broadcast packet by multicast , the Huawei switch had a problem with processing the packet after receiving the packet, and directly disabled the port of the server on the switch for a while. Huawei switch bug. However, customers trust Huawei's product quality and believe that it is a problem with our servers that we need to solve from the server side. Huawei is a large manufacturer and specializes in network equipment. The network equipment they make is carrier-grade, so customers don’t quite believe that the problem lies in Huawei’s switches, and insist that there is a problem with our servers. Later, we changed the broadcast method from multicast to unicast to solve the problem that the backup server could not grab the IP.

       In addition, the backup server still has problems after grabbing the target IP, and the backup server still cannot connect . After research, it is found that the customer's network security level is relatively high, and a large number of security rules are configured in the network device system, and the IP and MAC addresses are bound in the rules . Ban the device. For example, in the main server, the IP is bound to the MAC address of the main server. After switching to the standby server, the MAC address of the device corresponding to the IP changes, which will be intercepted by the network device, resulting in the standby server still unable to connect. superior. Therefore, let the customer release the public IP of the main and standby servers in the network equipment to ensure that the IP will not be blocked due to the change of the corresponding MAC address.

3. Example 2: The redirection option in the TCPIP network protocol stack of the Linux system in the device is disabled, resulting in the failure to respond to the ICMP redirect redirection message returned by the gateway

       In a customer's environment, multiple sets of software and hardware products are deployed. As a result, the products under a certain network node have serious packet loss in the interaction with the server, but the products in other network nodes are fine. Customers also use equipment from other manufacturers, and it is no problem to get the equipment of other manufacturers to this problematic network node. Just our product has a problem. This problem is very strange. At the beginning of the investigation, I was confused.

       The device is configured with a gateway, and the device can also connect to the server through the gateway, but there will be obvious packet loss in the connection going out through the gateway, and there will be obvious exceptions in the business due to obvious packet loss. Later, through wireshark capture, it was found that when the device sends a connection request through the gateway, the gateway will reply to the device with an ICMP redirect redirection message. The purpose of replying to the message is to let the device not go out through my gateway, but carry it from the message. IP out. However, looking at the route through tracert, we know that the initiated connection still goes out from the gateway, and does not go out from the IP specified in the ICMP redirect message replied by the gateway, which is quite strange.

Later, the on-site technical support entered the sysctrl –a|grep redirect command        in the Linux system of the device to view all the redirection options of the system TCPIP protocol stack, and found that the redirection options of the system TCPIP protocol stack were disabled, as shown below:

       That's right, the redirection options of the built-in Linux system of the device are all disabled, so that after the gateway replies with an ICMP redirect redirection message to the device, the device does not process the redirection message and discards the message, so the device passes the protocol The connection initiated by the stack still goes through the gateway, and does not go out through the IP specified by the gateway.

       Later, it was verified that the Linux system installed on the device was tailored and reconfigured. For security reasons, the redirection option of the TCPIP protocol stack in the system was deliberately turned off. This and the conclusion of the investigation can be Right. Later, the redirection options of the system's TCPIP protocol stack were turned on through the script, and there was no problem in connecting the device to the server, and all services could run normally and stably.

Guess you like

Origin blog.csdn.net/chenlycly/article/details/124044427