Difficulties: The ping fails on the same network segment, and the chain cannot be established across network segments. How to break it?

 

The author previously shared two articles about solving difficult and complicated diseases in production, and the results were unexpectedly good. In fact, after so many years of work, I have encountered a lot of materials about difficult and complicated diseases, and it is worth a good summary. Then continue today. I would like to share with you the difficult problems encountered in daily work. Of course, I will omit specific sensitive information. I just summarize the phenomenon and experience of the problem to prevent you from stepping on the pit.

In recent years, due to the continuous prevalence of cloud computing, many enterprise data centers have begun a large-scale expansion journey. Many of the network plans have divided the network area used for the deployment of backup and monitoring equipment into a large sub-division. Network, no further planning was carried out. When the monitoring server was expanded along with the production business service node, two problems were discovered in practice.

  1. Can't ping each other inside the subnet
  2. You can ping through subnets, but you cannot establish a connection.

These two phenomena actually correspond to two problems. If one problem is taken out, it can be solved very quickly, but two simple problems are mixed together, and it takes a lot of trouble to solve it. The specific problem situation is as follows:

Why can't ping the same subnet?

First of all, we first tried to merge the two problems, thinking that it was caused by the same reason. Here we have made many attempts, and then we found a key message,

1. Tips for arpping

Even if you use the arpping command to ping each other, you can restore connectivity between two nodes in the same subnet. That is to execute on two machines first

1. Ping the peer ip, the result is unreasonable:

2. arpping peer ip

3. ping the peer ip: the result is to restore communication

2. Why is there no abnormality in the system log output?

That is to say, we basically positioned this to be caused by the arp table. We know that the mutual access of nodes in the same subnet is only a gateway, so arpping refreshes the local arp table to restore connectivity, let us basically rule out the situation The problem of network equipment, and focused on the arp configuration of the node itself.

But the strange thing is that if the arp configuration error caused the network to be blocked, then why the system syslog does not respond? I first ruled out this log problem. After investigation, it was found that Linux has restrictions on the printing of kernel logs. The details are as follows:

1. The time interval limited by /proc/sys/kernel/printk_ratelimit , the default value is 5

2. /proc/sys/kernel/printk_ratelimit_burst The maximum number of prints in the interval, the default value is 10

That is, the default print rate is 10 prints at most every 5 seconds .

Therefore, it is basically determined that the error log is restricted and not output.

Three, the size of the Linux default arp table is the main reason

This step has been relatively clear after investigation, which is the configuration problem of related servers in the same subnet. After further confirmation, the Linux default arp table size is 1024, and generally all backup and monitoring network areas are classified as one subnet. The practice resulted in overflow of the arp table with a size of only 1024, and a lot of security software was installed in the monitoring and backup server that caused the problem, and the syslog was output through printk, and the arp table was full when the speed of Linux kernel log printing was limited. It is not reported through the system syslog.

Four: rectification measures

1. Adjust kernel parameters

vi /etc/sysctl.conf

Modify the following configuration to modify:

net.ipv4.neigh.default.gc_thresh1 = 512

net.ipv4.neigh.default.gc_thresh2 = 2048

net.ipv4.neigh.default.gc_thresh3 = 10240

2. Update the configuration

sysctl -p

Of course, after we have determined that the problem with the arp table is the cause of the failure of internal machines in the same subnet, in fact, we have basically determined that there are other reasons for the cross-network segment or the inability to establish telnet. Because the cross-network is through the gateway.

Why can't telnet across the network?

After solving the problem of the same network segment, there is still a problem with the inability to telnet across the network. In fact, it is a relatively typical low-level error from the results. Briefly share the relevant troubleshooting process.

  • Cross-network ping works, but the newly deployed node cannot establish a connection to the corresponding port, and the old deployment has an unbroken long link, but cannot actually transmit data

1. Ping the peer ip across the network, and the result is:

2. New monitoring node: The telnet peer ip monitoring response port fails

3. Old monitoring node: netstat -an|grep monitoring response port, the status is ESTABLISHING, but there is no traffic on this link

2. Check the service list of the monitoring node and find that iptables is started

1. Execute chkconfig --list

2. Found that the status of the iptables service is running

3. Stop iptables and find that it will be automatically pulled up after a period of time

Third, confirm that it is a configuration problem of the security software strategy

Since iptables is not turned on by default in our specifications, it was finally confirmed that the security software mistakenly included the backup and monitoring subnets in the iptables startup list. And the whitelist configuration of iptables is empty, which also leads to the failure of their monitoring ports to the production nodes. In fact, this has nothing to do with cross-networking.

Fourth, iptables is essentially a kernel module based on the hook mechanism

For hooks, please refer to the author’s previous article "Difficulties: Why does the CPU usage of anti-virus software continue to increase under Linux". iptables itself is actually a netfilter-based hook software, but it will not force a link to be broken, but it will The corresponding traffic is blocked, so for the old service, his long link has been established before iptables is started. Therefore, although this link is not in the whitelist of iptables and can exist for a period of time, it cannot be sent, and the new server simply links to it. Can't build it.

Guess you like

Origin blog.csdn.net/BEYONDMA/article/details/114730644