Operation and maintenance experience to share: Troubleshooting Methods

Engaged in the operation and maintenance of a year and a half, encountered a variety of problems, loss of data, the site linked to horse, database files accidentally deleted all kinds of problems, hacker attacks, and today would like to briefly sort out, to give you a small share partners.

First, the online operation specification

1. Test Use

Had to learn the use of Linux, from basic services to the cluster, the virtual machine is doing, although the teacher told us there is no difference with the real machine, but the desire for increasingly rising real environment, but allows various snapshots of virtual machines we develop all kinds of hand cheap habit, so that the time to get the server operating authority, can not wait to want to try, I remember the first day at work, the boss gave me the root password, because the only use putty, I I want to use xshell, so quietly trying to change the login server xshell + key, because there is no test, nor leave a ssh connection, after all restart sshd server, and that they are blocked in the server, but fortunately I was backed up sshd_config file , then let the engine room personnel cp past it, fortunately this is a small company, or directly be done ...... lucky then better luck.

The second example is about synchronizing files, we all know rsync synchronization quickly, but he deleted files much faster than rm -rf, there is one in rsync command is subject to a directory sync a file (if the first a directory is empty, then the result can be imagined), the source directory (with data) will be deleted, when I was because of misuse, as well as the lack of testing, they wrote anti-directory, the key is not backed up ...... production data is deleted, no backup, we all want the consequences of it, and its importance is self-evident.

Repeatedly confirmed before 2.Enter

About rm -rf / var this wrong, I believe that people who fast chips, or a relatively slow speed when the chance of a large, when you discover executed, at least your heart is cold half.

You might say, I did not come out so many times by mistake, do not be afraid, I just want to say, when there is once you understand, do not think the operation and maintenance of those accidents are on other people, if you are not careful, the next one is you.

3. NEVER operate more than

I was in the company, operation and maintenance management rather chaotic, cite a typical example of it, leaving several of operation and maintenance have any server root password.

Usually we received the operation and maintenance tasks will be simple to see if you can not solve, you ask others for help, but when the issue burnt, customer service supervisor (understand the point of linux), network management, your boss with a debug server, when you all kinds of Baidu , a variety of control, finished find that your server configuration file, modify the last time with you is not the same, and then change it back, then Google, excitedly identify problems resolved, but others tell you, he resolved, modify different parameters ...... is this, I really do not know which is the real cause of the problem, of course, this is good, problem solved, everyone is happy, but you come across a file that you just modified, the test is invalid, then found time to modify the file has been modified it? Really angry, people should not operate.

4. After the first backup operation

A habit, when you want to modify data, back up, such as .conf configuration file. In addition, when modifying the configuration file, the original comment recommend Options, and then copied, modified.

Besides that, if the first example, there is a database backup, rsync misuse that soon all right now. So lost database will not happen overnight, just a backup would not have to suffer so much.

Second, involve data

1. caution rm -rf

Examples of online many various rm -rf /, various delete the primary database, operation and maintenance of all kinds of accidents ...... little mistake will cause great losses. If you really want to delete, you must be cautious.

2. Back up than anything else

Originally above all with regard to a variety of backup, but I want to emphasize it again divide the data type, backup is very important wow, I remember my teacher said a word, what is related to the data can not be too careful. I have to do the inauguration of the third-party payment company websites and net loan platform, third-party payment is a full backup once every two hours, net loan platform is backed up once every 20 minutes. I will not say, we all discretion it.

3. stable than anything else

In fact, not only is the data in the entire server environment, are stable above all, do not seek the fastest, but for the most stability, and usability, so untested, do not use the new server software, such as nginx + php-fpm, production linked to a variety of environments php ah, restart the like, or change apache enough.

4. Confidential above all

Now all kinds of Pornographic sky to fly a variety of routers back door, so to say, it comes to data, not secrecy is not acceptable.

Third, related to safety

  1. ssh

Change the default port (Of course, if you are a professional to black, came out under scanning)
prohibit root login
as a regular user authentication + key + sudo rule + ip address + user limit
the use of a similar explosion in hostdeny cracking software (more than a few tries to pull directly black)
screening / etc / passwd in the login user
2. firewall

Firewall must open a production environment, and should follow the principle of minimum, drop all, then release service ports required.

3. The fine particle size and control authority

Ordinary users can start using the service determined not to use the root, the services access control to a minimum, should be fine granularity of control.

4. The intrusion detection, and log monitoring

Using third-party software, time change detection system critical files and configuration files of various services, such as, / etc / passwd, / etc / my.cnf, / etc / httpd / con / httpd.con etc;
use centralized log monitoring system, monitoring / var / log / secure, / etc / log / message, ftp upload and download files, and so alarm the error log;
another for port scanning, you can use some third-party software, we found to be scanned directly into the drawing host.deny . The information system for the invasion troubleshooting help. It has been said, the cost of a company's investment in security is directly proportional with the cost of security attacks, he was lost, security is a big topic, but also a very basic, the foundation well, can be a considerable increase system security sex, the other is to do a security expert.

Fourth, the daily monitoring

1. operation monitoring system

A lot of people into the operation and maintenance are monitored from the start, large companies generally have professional 24-hour surveillance operation and maintenance. Operation monitoring system generally includes hardware utilization, a common, memory, hard disk, cpu, card, os including login monitoring, system-critical file monitoring, regular monitoring can predict the probability of hardware damage, and bring very practical tuning function.

2. Operation Monitoring Service

Service monitoring in general is a variety of applications, web, db, lvs, etc., which are generally monitoring indicators, will be able to quickly identify and resolve performance bottlenecks occur when the system is.

3. Log Monitoring

Here's a similar log monitoring with security log monitoring, but here are generally hardware, os, the application error and alarm information, monitoring does much good when the stability of the system, but if there are problems, you do not any monitoring , it will be very passive.

Fifth, performance tuning

1. depth understanding of the operating mechanism

In fact, according to more than a year of operation and maintenance experience, to talk about tuning simply on paper, but I just want a simple summary, if you have a better understanding, I will update.

Before the software to optimize, for example, to understand a mechanism to run the software, such as nginx and apache, nginx everyone says quickly, it must know how fast nginx, use what principle, to process the request than apache, and to talk to others out with easy to understand words, when necessary, but also be able to read the source code, or else by a parameter tuning objects in the document are blind to talk about.

2. Tuning Framework and has

Familiar with the underlying operating mechanism, there should be tuning framework and order, such as database bottlenecks, a lot of people go directly to change the configuration file for the database, my suggestion is to first go according bottleneck analysis, view logs, write transfer excellent direction, then start and tuning the database server should be the last step, the first should be the hardware and operating systems, database servers are now only released after a variety of tests on all operating systems, not he should start to start.

3. Each tune only one parameter

Every tune only one parameter, this comparison we all know, tune more, you can himself confused.

4. Benchmark

Determine whether it is useful aspects of tuning, and testing the stability and performance of a new version of the software, etc., it is necessary to benchmark the test involves a number of factors, whether the test close to the real business needs of this test depends on the person's experience, relevant information we can refer to "high performance mysql" third edition, is quite good. My teacher once said, there is no size-fits-all parameter, any parameter change any tuning must be consistent with the business scene, so not to Google what tune, and no long-term effect on your improve or upgrade and business environment .

Sixth, operation and maintenance of mind

1. Control of mind

Many rm -rf / data are from work a few minutes ago, are irritability peak, then you do not intend it under the control of your mind, it was said, irritability went to work, but you can try to avoid upset when The key data processing environment. The more pressure, the more calm, or you'll lose any more.

After most people have rm -rf / data / mysql experience, find deleted, you can imagine the kind of mood, but if there is no backup, you worry what is the use, in general this case, you will want to calm down the worst, and for mysql, deletes the physical file, part of the table will be stored in memory, so off business, but do not close the mysql database, which is helpful for the recovery and use dd to copy the hard drive, then you recovery, of course, most of the time you can only find a data recovery company.

Imagine, the data is deleted, you various operations, close the database, and then repair, there is not only possible to overwrite the file, can not find a table in the memory.

2. responsible for data

Production environment is not child's play, the database is not child's play, must be responsible for the data. Backup consequences are very serious.

3. To get to the bottom

Many operation and maintenance personnel busy, a problem will not solve the tube, I remember last year a client's website is always open, after a php code error, it found session and whos_online damage, before Renyun Wei is repaired by repair I'll have this repaired, but after a few hours, has emerged. After repeated three or four times, I went to Google database tables inexplicable damages: First myisam the bug, and second, mysqlbug, mysql is kill the third is in writing. The last finding is not enough memory, resulting OOM kill the mysqld process. And there is no swap partition, background monitoring memory is enough, and finally upgrade the physical memory solutions.

4. test and production environments

You have to look at where their machine before an important operation, try to avoid to open the window.

The above points are the work of my own experience, I hope to give part of the operation and maintenance personnel to bring some help, if insufficient, welcome advice.

From: http: //os.51cto.com/art/201404/434770.htm

Released two original articles · won praise 0 · Views 252

Guess you like

Origin blog.csdn.net/xiaohuangren_123/article/details/105082953