Server Troubleshooting Ideas

In the event of a server failure, the cause of the problem can rarely be thought of at once. We basically start with the following steps:

1. As far as possible to understand the cause and effect of the problem

Don't jump right in front of the server, you need to figure out how much is known about the server and the specifics of the failure. Otherwise, you are likely to be aimless. The questions that
must be clarified are:
What is the manifestation of the failure? No response? Error?
When was the fault discovered?
Is the failure reproducible?
Is there a pattern (like every hour)
what was the last update to the entire platform (code, servers, etc.)?
What are the specific user groups affected by the failure (logged in, logged out, in a certain region…)?
Can documentation of the infrastructure (physical, logical) be found? Is
there a monitoring platform available? (eg Munin, Zabbix, Nagios, New Relic…anything)
Is there a log to look at?. (eg Loggly, Airbrake, Graylog…)
The last two are the most convenient sources of information, but don’t get your hopes up, they basically don’t have. Can only continue to explore.

2. Who is there?

$ w  
$ last 

Use these two commands to see who is online and which users have visited. This is not a critical step, but it is best not to debug the system while other users are working. There is a saying that one mountain cannot tolerate two tigers. (ne cook in the kitchen is enough.)

3. What happened before?

$ history 

Take a look at the commands that have been executed on the server before. It's always right to look at it, plus the information about who has logged in before, it should be useful. In addition, as an admin, be careful not to use your own authority to infringe on the privacy of others.

As a reminder here, later you may need to update the HISTTIMEFORMAT environment variable to show when these commands were executed. Yes, otherwise, it will also be maddening to see a bunch of commands that you don't know when to execute.

4. What is the running process now?

$ pstree -a  
$ ps aux

This is all about looking at existing processes. The results of ps aux are messy, and the results of pstree -a are relatively simple and clear, you can see the running processes and related users.

5. Monitored network services

$ netstat -ntlp  
$ netstat -nulp  
$ netstat -nxlp

I usually run these three commands separately and don't want to see a whole bunch of all services listed all at once. Netstat -nalp can also. But I would never use the numeric option (my humble opinion: IP addresses seem more convenient).
Find all running services and check if they should be running. View each listening port. The PIDs in the service list displayed by netstat are the same as those in the ps aux process list.

If there are several Java or Erlang processes running on the server at the same time, it is important to be able to find each process individually by PID.

Generally we recommend running fewer services per server, adding more servers if necessary. If you see 30 or 40 listening ports open on a server, make a record, clean it up when you have time, and reorganize the server.

6. CPU and memory

$ free -m  
$ uptime  
$ top  
$ htop

Pay attention to the following questions:
Is there any free memory? Is the server swapping between memory and hard disk? Is there any CPU left? How many cores does the server
have? Are some CPU cores overloaded?
Where is it coming from? What is the load average?

7. Hardware

$ lspci  
$ dmidecode  
$ ethtool

There are many servers that are still bare metal, you can take a look:
find the RAID card (with BBU backup battery?), CPU, free memory slots. Based on these conditions, you can roughly understand the source of hardware problems and ways to improve performance.
Is the NIC set up? Is it running at half-duplex? The speed is 10MBps? Are there any TX/RX errors?

8. IO performance

$ iostat -kx 2  
$ vmstat 2 10  
$ mpstat 2 10  
$ dstat --top-io --top-bio

These commands are useful for debugging backend performance.
Check disk usage: Is
the server hard disk full? Is swap enabled (si/so)? Who is using the
CPU: System processes? User processes? Virtual machines?
dstat is my favorite. Use it to see who's doing the IO: is MySQL eating all the system resources? Or is it your PHP process?

Nine, mount points and file systems

$ mount  
$ cat /etc/fstab  
$ vgs  
$ pvs  
$ lvs  
$ df -h  
$ lsof +D / /* beware not to kill your box */ 

How many filesystems are mounted in total? Are
there filesystems dedicated to a service? (eg MySQL?)
What are the filesystem mount options: noatime? default? Has any filesystems been remounted as read-only?
Is there any disk space left?
Are there large files deleted but not emptied?
If disk space is a problem, do you still have room to extend a partition?

10. Kernel, Interrupts, and Networking

$ sysctl -a | grep ...  
$ cat /proc/interrupts  
$ cat /proc/net/ip_conntrack /* may take some time on busy servers */  
$ netstat  
$ ss -s

Are your interrupt requests evenly distributed to CPU processing, or is there a CPU core that is overloaded by a large number of network interrupt requests or RAID requests?
What are the settings for SWAP exchange? A swappinness of 60 is fine for workstations, but terrible for servers: you'd better never have a server do SWAP swapping, otherwise reads and writes to disk will lock up the SWAP process.
Is conntrack_max set large enough to handle the traffic of your server?
What is the setting of TCP connection time in different states (TIME_WAIT, …)?
If you want to display all existing connections, netstat will be slower, you can use ss to see the overall situation first.
You can also take a look at Linux TCP tuning for some essentials on network performance tuning.

11. System logs and kernel messages

$ dmesg  
$ less /var/log/messages  
$ less /var/log/secure  
$ less /var/log/auth

Look at the error and warning messages, for example, is there a lot about too many connections?
See if there are hardware errors or file system errors?
Analyze whether these error events can be compared in time with the suspicious points found earlier.

12. Scheduled tasks

$ ls /etc/cron* + cat  
$ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done

Is a certain scheduled task running too frequently?
Are some users submitting hidden scheduled tasks? Is
there a backup task that happens to be executing when a failure occurs?

Thirteen, application system log

There are more things that can be analyzed here, but I am afraid that as an operation and maintenance person, you have no time to study it carefully. Pay attention to obvious problems, such as in a typical LAMP (Linux+Apache+Mysql+Perl) application environment:
Apache & Nginx; look for access and error logs, directly look for 5xx errors, and check for limit_zone errors.
MySQL; Look for error messages in mysql.log to see if there are any tables with damaged structures, if there is an innodb repair process running, if there is a disk/index/query problem.
PHP-FPM; If the php-slow log is set, directly Look for the error message (php, mysql, memcache, …), if it is not set, set it up quickly.
Varnish; In varnishlog and varnishstat, check the hit/miss ratio. See if there are any rules missing from the config that would allow end users to attack your backend directly?
HA-Proxy; what's the status of the backend? Was the health check successful? Is the front-end or back-end queue size maxed out?

in conclusion

After these 5 minutes, you should be clear about the following situation:
what is running on the server?
The failure appears to be related to IO/hardware/networking or system configuration (problematic code, system kernel tuning, ...).
Does this glitch have some characteristics that you are familiar with? For example, improper use of database indexes, or too many apache background processes.
You might even find the real source of the failure. Even if you haven't found it yet, after you figure out the above situation, you now have the conditions to dig deeper. keep working hard!

Reprinted from: http://blog.jobbole.com/36375/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325811044&siteId=291194637