Cloud server cluster performance troubleshooting Notes

Author: Tian Yi (wx formyz)

My advice

Currently there are still some people (including some programmers) believes that with the cloud hosting, online Soso, installation documentation about the configuration, where also what professional system administrator (commonly known as operation and maintenance of the dog). Of course, this also implies that the cloud service provider propaganda (to buy a cloud host, worry-free stability, data throw up once and for all). It really true? If your application is no traffic day, few people visit, really do not spend money to hire a full-time system administrator; if you feed a bunch of people on the Internet, but also hope that more users to access, as well as recognition of the above, I only Oh ...

 

Non-system administrator deployed environment

An application, all in a public cloud. By the load balancing, four web application, the shared data disks (shared program code), the database (master-slave) and other components. From a structural point of view, ah, no problem. Therefore, for a long time, and no one came to support us to do, we do not know the existence of these applications.

 

Autumn has come, the empire of the good weather Well, surely we feel with the weather, but also the fun of it! Recently, however, support the qq group, old people calling, saying the project four all server load skyrocketed. 21:00 to 11:00 load energy to several hundred. Said that the relevant personnel investigation for several days, to no avail (I myself laughing a lot).

001.jpg

 

 

Survey the scene

Applications for the nginx + php + mysql, then the possible bottlenecks can be adjusted with local generally include: system configuration, php configuration, database configuration (load balancing cloud service providers lacks adjustable). Without further ado, a reminder was so anxious, take a look at the health of it.

QQ picture 20190830102044.png

My god, run so high is not dead, like this one first. In addition to suits high cpu load, memory has been largely exhausted. According to past experience, there may be a problem to set system parameters (default systcl.conf not set), then arrange my little brother from another server reference it, set it up, execute sysctl -p to take effect. Such as access to the peak of follow-up observations, the results are poor, seems to have a personally.

 

Investigation and processing

Selected time point, that access an hour before the peak of the log.

 

Take a look at what the memory to the finish, ps see the process, we found a large number of php carried out. Initially suspected After the user requests data, in order to effectively shut down the process and free up resources, with the consent of, php restart the service. Moment, run out of memory process again, not quite right too!

 

Repeat the following command php process statistics:

ps auxww|grep php|grep –v grep |wc  -l

ps auxww|grep php|grep –v grep |wc  -l

进程数一直保持不变,数量为601。一個进程占用好几兆内存,600个进程,最低下限耗费数G的内存,负载不高才怪了。

 

打开配置文件php-fpm.conf,一眼就看到问题所在

002.jpg

进程管理被错误的设置成static(静态),最大子进程为600,那么一旦启动php,不管有没有必要,都会启动一个主进程加600个子进程。配置文件php-fpm.conf 最大子进程这一行以后与动态管理相关的参数,如最大开始进程、最大空闲进程数等一律无效。修正这个问题后,时间差不多到了访问高峰期。通过人工跟踪加监控报警,基本上算是有很大改进,负载峰值load在50以下。

 

进一步的优化措施

虽然通过修正php参数设置,性能得以改善,但我对这个结果还是不太满意。想再看看有么有可以调整的地方。于是,思路到了磁盘io这个问题上了。

 

四个服务器共享一個云nas硬盘,只保存一份程序员写的php代码。如果io性能不佳,也会严重影响整个应用的性能。

 

用mount指令查看nfs挂接情况,主要是挂接参数,结果如下:

003.jpg

用的是tcp协议,而在以前的实践中,我通常用udp协议(vers=3)进行挂接。考虑到云服务商提供的磁盘性能,用tcp未必就能比udp更好。于是跟其他人协商,在不影响性能访问的情况下,先修改一台服务器对nfs的挂接方式,有进一步性能提升后再修改其他的服务器,最后留一台不做更改,以便观察对比效果。

 

关服务,切换出挂接点目录,卸载nfs,用下列指令挂接重新挂接nfs:

/usr/bin/mount -t nfs -o nolock,vers=3 6e46868719-pgn67.cn-qingdao.nas.aliyuncs.com:/   /data

/usr/bin/mount -t nfs -o nolock,vers=3 6e46868719-pgn67.cn-qingdao.nas.aliyuncs.com:/   /data

Restart php and other related services, peak period, the effect is very obvious, reducing the load value to 5 or less.

QQ screenshot 20190830103240.jpg

 

After several days of observation, comparison, influence cloud server nfs articulated manner on performance is relatively large.


Guess you like

Origin blog.51cto.com/sery/2436946