How to quickly locate data center link failures

Accelerating the deployment of new infrastructure construction is becoming a key move for China to achieve multiple strategic goals. Data center (IDC), as the foundation project of digital infrastructure, has entered a period of "race against time". The shortage of talents is rapidly increasing, and even the employees of some leading data center companies have been named "poaching".

The scale of the current system is getting larger and larger. In the past, a few large computers and a few small computers were enough for banks, but now they are distributed. Yesterday the old leader said that Amazon has 5 million servers and Ali has 2 million servers. This order of magnitude is completely incomparable to before. The technical architecture is becoming more and more complex. It used to be a stand-alone machine. Now that virtualization and containers are added, the complexity of the entire application has increased exponentially. In the face of failures, real-time requirements are getting higher and higher. In the past, bank ATMs couldn't withdraw money and we made a complaint. But how many people would withdraw money from ATMs at the same time in a day? Now everyone uses mobile phones to conduct transactions and transfers anytime and anywhere. Once there is a problem, there will be complaints. If it can't be resolved within a few minutes, there will be even bigger complaints. For example, money should be used to trade in stocks, which will affect my profitability and compensate for losses. These are all real cases. The degree of influence is getting bigger and bigger. Some minor faults in Alibaba Cloud may cause the servers in the entire region to become unusable. The programmer said that there should be no problem, so he manually executed a configuration, and the result was all down, which had a great impact. Alibaba Cloud, AWS, Microsoft, Tencent, and Google have all experienced failures and problems. What we have to do is how to deal with them. The more complex the system is, the more likely it is to cause a chain reaction that is out of control. We are more and more unpredictable about failures. We cannot predict when failures will occur.

For operation and maintenance personnel, the pressure is particularly great, and the regulatory requirements are becoming stricter. In traditional banks, a fault that takes more than 30 minutes must be reported to the China Banking Regulatory Commission. The report is made by the president and the leader of the technology to explain. This pressure will be transmitted to the people who are doing operations and maintenance. We will have a life and death line. If 30 minutes If it can't be resolved within, the performance this year may be gone. There are more and more resources for operation and maintenance. Traditional banks are better than 50,000 units, and Internet companies have a million units. The impact of failures is increasing. Troubleshooting is getting harder and harder. It used to be easy with just a few servers, but now it is impossible to check manually. Business requirements are getting faster and faster, and even the requirements raised this morning can help me realize it at night. This is a very practical case.

Therefore, we say that operation and maintenance personnel are not only in short supply, but also facing increasing pressure. When people did operation and maintenance before, it was easy to make mistakes when relying on people to do things. This is inevitable. If we can do it manually Things are automated, so that you can reduce the negative impact caused by manual operations and reduce the risk of manual operations.

Navedi's computer room equipment and line visualized operation and maintenance management platform can replace most of the manual work. For example, the traditional operation and maintenance method before equipment is placed on the shelf is that the operation and maintenance personnel need to investigate the capacity of the computer room on site, and which U positions in which cabinets the equipment should be mounted , The remaining port capacity of the equipment, etc., if there is a visualized operation and maintenance management platform system for the equipment and lines of the Nevedi computer room, the on-site inspection by the operation and maintenance personnel is omitted, because the system is actually consistent with the scene of the data center computer room. , Operation and maintenance management personnel only need to make planning after surveying on the system. After the planning is completed, a work order is generated and sent to the specific implementation personnel. When the planning is completed, it is equivalent to the operation and maintenance records are also completed, which will ensure the data in the long run The accuracy of the electronic archiving method is that you can easily find the data you need after many years.
Insert picture description here
Insert picture description here
We can exclude and prevent accidents and uncertainties, but accidents and uncertainties are the norm in life. Accidents will definitely happen, and what we have to do is directly respond. What we hope is to avoid greater losses as much as possible in the face of uncertainty. Navedi’s computer room equipment and line visualized operation and maintenance management platform can quickly locate faults. For data centers, the connection of various cables is more complicated. Where are the local and opposite ports of an optical cable or network cable connected? Equipment, which links are passed in between, and which equipment room and cabinet are these equipment located in. The above information is very critical when a fault occurs, and it is also the most difficult to find. The Navedi equipment line visualization system can be based on cables or equipment. Any one of the attributes can quickly locate and retrieve the content that needs to be queried. Compared with the traditional way of viewing EXCEL tables or drawings, the efficiency has been improved many times, and the troubleshooting time has been greatly reduced, thus improving the overall operation and maintenance efficiency of the data center. Insert picture description here
Insert picture description here
Yao Yanyan

Guess you like

Origin blog.csdn.net/NWVDI/article/details/109645117