The Influence of Peripheral Environment on Linux System Performance

When I first came to this company at the beginning of this year, a small computer room of about 15 square meters was equipped with two server cabinets and a network cabinet, which were full of servers, and a weak air conditioner was struggling to blow cold wind. , the nightmare began.

In Beijing at the end of March, it was relatively cold, but our computer room was not cold. The air blown by the air conditioner was not as cool as the outside wind. At that time, the temperature in the computer room was estimated to be around 30 degrees. The problem of temperature was one of my biggest heart problems. , In the next two months, the failure rate of the server was very high. On average, there were server failures every week. According to statistics, from the time I first came to the company to the time when the temperature problem in the computer room was solved, there were 29 times. Server failures, including 2 hardware failures, 5 server performance issues, 7 server downtimes, 4 server downtimes, and 11 server restarts. During this period of time, I was doing very passive operation and maintenance work. When the big problem of the temperature of the computer room was not solved, every time the server failed, we must clearly find out the cause of the failure and report it to the leader. In addition to server restart and downtime, there are also The hardware failure is clearly recorded in the HP ilo. The server has performance problems inexplicably, stuck and crashed inexplicably. Although there are corresponding records in the linux log, most of them are incomprehensible information, but no It is clearly indicated that it is caused by the temperature (although the temperature of the computer room is high at this time). The two hardware failures are that the power supply module is broken and the network card is broken. On the one hand, the leader questioned the quality of the hardware, and on the other hand, he wanted to know why it was broken, but he did not care that the temperature in the computer room had broken 30. The temperature situation, made a graph for the leaders to see, the reply was that the new computer room was being planned, and persisted for a while. As time went on, the weather in Beijing also became hot, and the temperature of the computer room rose to 34, 35, and 36 degrees. The temperature in the server chassis has also been maintained above 40 degrees for a long time, and it will be down at 45 degrees. Seeing that the server can’t stand it anymore, after reporting to the leader, I bought another air conditioner, and the temperature of the two air conditioners was adjusted to Minimum, and after that, the server has almost never had a problem again.

Looking back at the passive work for the past two months, I have lingering fears, but it also gave me some insights into the troubleshooting of Linux servers. When the server has inexplicable performance problems and inexplicable downtimes several times, the Linux system also records relevant log information. Put these log information on Google, and some people have encountered the same problem. Discussing the cause of the problem online Most of the replies made corresponding guesses based on the literal problems of the log information. It is not unreasonable for them to make such guesses, but their thinking is limited to the Linux system, ignoring the large environment outside the Linux system (server hardware, computer room, etc.). environment, network environment, etc.).

Among the problems I have come across, many of the problems are caused by problems other than the system, so I think that during the operation and maintenance of the Linux system, when there is a problem with the server, unless it is clearly a system problem, the troubleshooting should be From the outside to the inside, from the peripheral computer room environment (temperature, humidity, etc.) to the network environment (a certain network foundation is required), to the server hardware (disk, raid card, network card, memory, etc.), and then to the operating system, so In order to accurately and quickly find the cause of the problem.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326905503&siteId=291194637