The whole process of network troubleshooting by senior network engineers is too strong! 【Attached tool download】

Good afternoon, my network worker friend

We know that a switch is a very important network device in a local area network, and its working status is closely related to the online status of the client system.

However, in the actual work process, the state of the switch is easily disturbed by the outside world, so that various network failures will appear in the LAN.

In order to ensure the stable operation of the network, we must properly manage and maintain the switches in normal times to avoid switch failures.

Various physical links are used to connect enterprise network devices, and in order to accurately complete the forwarding of data packets, various network protocols are also run between devices.

Network equipment, cables, and network protocols may cause network failures. How to quickly complete troubleshooting is the basic quality of a senior network engineer.

Have you ever encountered a situation where the physical connection is improper and cannot be pinged? How do you usually deal with this situation?

Regarding network troubleshooting, today's article will help you catch everything!

Today's article reading benefits: "A Collection of Useful Tools for Network Troubleshooting"

Friends who need it, please private message me , send the secret code "troubleshooting", and the top 20 will send a collection of tool resources free of charge .

01 What is the degree of network failure?

Network failure refers to the phenomenon that the network loses the specified function and affects the business due to some reason.

From the user's point of view, any phenomenon that affects services can be defined as a fault.

Common fault symptoms and classification are as follows:

02 Network Troubleshooting Process

If you take an unstructured network troubleshooting process, you just go back and forth between these steps on the basis of intuition, and although you may eventually find a solution to the problem, there is no way to guarantee efficiency.

In a complex network environment, new faults may be introduced due to unstructured network troubleshooting processes, making network troubleshooting more difficult.

Therefore, we should follow the structured network troubleshooting process to locate the alignment failure point and make corrections.

There are multiple departments in an enterprise, such as finance, personnel, logistics, marketing, R&D, etc. The networks between these departments require interconnection and mutual visits. In order to ensure the normal operation of the network, the enterprise may have the following situations:

1. Large and medium-sized enterprises set up network maintenance departments and build professional network teams.

2. In order to save costs, small enterprises generally do not have a separate network maintenance department, but entrust the network to a professional network maintenance company.

3. Seek help from the equipment manufacturer and call the manufacturer's after-sales service number.

Generally, the first person who perceives a network failure is not the network maintenance personnel, but personnel from other business-related departments.

Network engineers often receive various calls for help, such as "the computer suddenly cannot access the Internet", "the web page cannot be displayed normally", "the game cannot be played"...

Ask the user about the above on the phone and document it in the troubleshooting report.

Why do you need to know the user's position level, job content and other information? Because in an enterprise environment, different levels of users may have different network access rights.

Why do you need to confirm the failure?

The user's description may be ambiguous, and the reported fault may not be the real fault point, so experienced engineers are required to confirm the fault.

Four elements to confirm a fault:

  • Fault subject: Which network service has a fault.
  • Fault performance: what is the phenomenon of the fault.
  • Fault time: when the user discovered the fault, and the real time when the fault occurred as speculated by professionals.
  • Location of failure: Which network component has failed. Give an accurate description of the failure phenomenon.

Finally, you should confirm whether the fault belongs to your own responsibility, that is, whether you have been given the corresponding authority to handle the fault.

What information needs to be collected: The stage of collecting information is mainly to collect fault-related information, such as documents and network changes.

How this information is collected:

  • Use the device's own operating commands; use information collection tools, such as packet capture tools, network management software, etc. Obtain authorization:
  • In a network environment with high requirements for information security, the collection of information needs to be authorized, and sometimes a written authorization document needs to be signed. Risk assessment during the information gathering phase:
  • Some information collection operations, such as executing "debug" commands on routers or switches, will cause the CPU usage of the device to be too high, and in severe cases, the device will even stop responding to user's operation instructions, thus introducing additional faults.
  • Therefore, when collecting information, these risks should be evaluated, and the relationship between the risk of introducing new faults and the urgency of solving existing faults should be balanced, and users should be clearly informed of these risks, and users can decide whether to carry out information collection with higher risks Work.

If the root causes of the faults are found row by row and the faults are eliminated, the process of network troubleshooting can be ended.

In a complex network environment, you still need to observe for a period of time after the fault phenomenon disappears. On the one hand, you can confirm that the fault reported by the user has been resolved, and on the other hand, you can confirm that no new fault has been introduced during the troubleshooting process.

Finishing work includes collation of relevant documents, announcement of information, etc. It is necessary to back up all configurations or software that have been changed in the previous network troubleshooting process, and organize and hand over troubleshooting documents.

In order to prevent the same failure from happening again, suggestions for improvement should be made to users at this stage.

03 A classic networker network troubleshooting process, worth collecting!

The office building I was in charge of at that time contained several companies. In order to ensure that each company could access the Internet independently and require their online status not to be affected by other companies, I chose a routing switch as the core switch of the building network.

At the same time, different virtual working subnets are set for each unit on the switch.

Since each unit is distributed on different floors, the number of companies distributed on each floor is not exactly the same. Some floors have two or three units, and some floors have as many as five or six units.

The unit work subnets on different floors are all connected to the building LAN through the switches on the corresponding floors, and access the Internet through the hardware firewall in the building network.

In order to improve network management efficiency, network administrators usually manage and maintain switches through remote connections.

However, when I went to work that morning, when I was scanning and diagnosing the working status of each switch port of the LAN core switch, I found that one of the switch ports was down.

So I checked the network management files and found that the port was connected to a second-floor switch on the fifth floor.

When logging in to the switch on this floor remotely, it was found that the login failed for a long time. When using the ping command to test the IP address of the switch, the returned result was "Request time out";

Just when I wondered why no one reported the fault, the phone rang as expected, and sure enough, users from the fifth floor began to report network faults one after another.

According to the above failure symptoms, I estimate that the working status of the floor switch may be unexpected.

So I ran to the site of the faulty switch, cut off the power of the device, and after a while, turned on the power again to restart.

After the boot operation was complete, I used the ping command to test the IP address of the switch.

At this time, the returned result is normal, and the remote login operation can be carried out smoothly.

However, half an hour later, the faulty switch had the same fault phenomenon again, and when the ping command was tested, it returned abnormal test results.

Later, I was worried, and after repeated start-up tests, I found that the faulty switch could not be pinged normally.

01 In-depth investigation

As I said just now, DNS is equivalent to a phone book.

Since the problem cannot be solved after repeated restarts, I guess the cause of the failure is more complicated, considering that this kind of failure phenomenon is often encountered in the process of network management.

So I conducted an in-depth investigation according to the following ideas:

Considering that in the entire building network, this phenomenon occurs only on a certain floor switch on the fifth floor, so I preliminarily judge that it may be caused by a problem with the switch itself on this floor.

In order to ensure that the cause of the fault can be accurately located, I am going to replace the faulty switch with a switch that is in normal working condition to see if the fault still exists.

At the same time, connect the switch that is suspected of having a problem to an independent network work environment.

After half an hour of testing and observation, I saw that the faulty switch connected to the independent network environment was working normally, and its IP address could be pinged through in this network environment.

After the newly replaced switch was connected to the building network, it could not be pinged normally.

According to these phenomena, I think that there is almost no possibility of problems with the switches on the fifth floor. After eliminating the status factor of the faulty switch itself, I re-reviewed the network structure and network status of the entire building network.

Since users on other floors in the building can access the Internet normally, only some users on the fifth floor cannot access the Internet.

Checking the networking information on the fifth floor, I saw that five units were distributed on the fifth floor. At that time, the network administrator arranged two floor switches on the fifth floor and connected them together through cascading;

At the same time, five virtual working subnets are divided in these two switches, which ensures that each unit can work independently in its own virtual working subnet.

Since the corresponding port on the core switch has been down, all the units on the fifth floor cannot access the Internet. Why do only some users report the fault now?

As soon as the working hours came, I immediately called several other companies that did not report network failures. The answer they received was that they had just discovered that the network access was abnormal, and they were about to ask the building's network administrator for help.

In this way, all units on the fifth floor cannot access the Internet normally, so the cause of the failure should be in the virtual working subnets of these units.

After locking the scope of troubleshooting on the five units on the fifth floor, I think that the network failure can be temporarily restored by restarting the equipment of a switch on the fifth floor.

Only half an hour later, the same network failure phenomenon will appear again.

Compared with this special phenomenon, I suspect that it may be a network broadcast storm, which caused the switch to be blocked within a certain period of time, and finally blocked the corresponding switching port of the core switch.

In order to facilitate the analysis of faults, I used the network monitoring tool to analyze the network transmission data packets of the cascading port of the switch on the fifth floor.

It was found that both the input packet flow and the output packet flow were very large, almost 100 times more than the normal value, which indicated that there was network congestion in the network on the fourth floor.

So what is the network congestion caused by network viruses?

Or network congestion caused by network loops?

I plan to observe the status information changes of the cascaded ports of the faulty switch, especially the changes of output broadcast packets. If the output broadcast packets keep increasing every second, then nine times out of ten it can prove that there is a network loop in the network on the fifth floor.

Based on this analysis idea, I use the console control cable to directly connect to the faulty switch, and log in to the system background as a system administrator.

At the same time, use the display command to view the changes of the output broadcast packets of the cascade port of the switch, and check it every second, and then compare the results of each check.

After repeated testing, I found that the size of the output broadcast packet of the faulty switch is indeed constantly increasing.

This shows that there must be a network loop in the five units on the fifth floor.

After carefully checking the two switches on the fifth floor, I found that the physical connection between them is normal.

In addition, each switching port of the two switches is directly connected to the wall Internet sockets of each room on the fifth floor.

Logically speaking, as long as each room does not use switches for cascading, there should be no network loops.

Now, since it is proved that there is a network loop phenomenon in the network on the fifth floor, it means that someone must be using the switch at will to expand the Internet. We only need to find the expansion switch and check its physical connection to quickly find the specific faulty node. .

So I called the network administrators of the various units on the fifth floor and asked them to check each office room and report the rooms that use lower-level switches.

It didn't take long for the inspection results to be fed back to me, and there were about 10 rooms that used lower-level switches to expand the Internet.

At this time, I know that the network connection of these 10 rooms is most likely to have a network loop phenomenon, so which room is it?

Do I have to go to the site in each room in turn to check their network connections?

After serious consideration, I found the networking information and found out the switching port numbers used in these 10 rooms one by one.

Then use network cables to directly plug into these switch ports, and in the view mode state of these ports, ping the IP addresses of the faulty switches in turn.

As a result, when I pinged to the sixth switch port, I found that I couldn't ping normally from this port.

In order to judge whether there is a problem with the switch port, I used the display command to view the status information of the switch port in the view mode of the switch port.

After checking and analyzing, I found that the size of the input and output packets of the switch port is obviously abnormal. Therefore, I estimate that the switch port must be the cause of the abnormal working status of the failed switch.

After consulting the archives, I quickly found the corresponding Internet access room according to the switching port number.

When I arrived at the site, I found that the only two Internet ports in the room were connected to a small hub, and several computers were connected to the two hubs.

What's more terrible is that there is a network cable directly connecting them together, so that a network loop is formed between the two hubs.

The broadcast storm caused by the loop finally blocked the cascading port of the faulty switch, thus causing the entire building network to fail to access the Internet normally.

02 Troubleshooting

After unplugging the extra network cable, I checked the status information of the switch port again. It was found that the input and output packet sizes returned to normal.

When I checked the status of the corresponding switch port on the core switch again, I found that the "down" status of the cause had changed to "up" status, and at this time, I was able to ping the faulty switch on the fourth floor normally.

This shows that the problem was indeed caused by a user in a room on the fifth floor illegally extending the use of switches or hubs. Later, I further inquired about Internet users and learned that their rooms were cleaned the night before, and all network cables were pulled out at that time.

When the cleaning work is over, Internet users do not know much about connection knowledge, so they plug in at will, which eventually causes a network loop phenomenon.

For most ordinary users, what you can feel is a smooth network experience, but the stable operation of the network cannot be achieved overnight, and requires the continuous efforts and persistence of front-line network workers.

Hats off to them!

Finishing: Lao Yang 丨 10-year senior network engineer, more network workers to improve dry goods, please pay attention to the official account: Network Engineer Club

Guess you like

Origin blog.csdn.net/SPOTO2021/article/details/132501022