Good operation and maintenance habits

6 good habits of operation and maintenance

 

1. Online operation specifications

1. Test use

When I first learned the use of Linux , from basic to service to cluster, I did it in a virtual machine. Although the teacher told us that there is no difference from the real machine, the desire for the real environment is increasing day by day, but various snapshots of the virtual machine make we develop all kinds of hand cheap habit, so that the time to get the server operating authority, can not wait to want to try, I remember the first day at work, the boss the root password to me, because the only use PuTTY , I Want to use xshell , so quietly log in to the server and try to change to xshell + key login, because there is no test, and there is no ssh connection. After restarting the sshd server, I was blocked from the server. Fortunately, I backed up the sshd_config file Then, let the computer room staff cp just go in the past. Fortunately, this is a small company, otherwise it will be directly dried ... Fortunately, the luck was better.

 

The second example is about file synchronization. Everyone knows that rsync synchronization is very fast, but he deletes files much faster than rm -rf. There is a command in rsync to synchronize a file based on a certain directory (if the first A directory is empty, so the result can be imagined), the source directory (with data) will be deleted. At first, because of misoperation and lack of testing, the directory was reversed. The key is that there is no backup ... Production environment data was deleted

No backup, everyone thinks about the consequences, its importance is self-evident.

2. Confirm twice before entering

Regarding the error of rm -rf / var , I believe that people with fast hands, or when the Internet speed is relatively slow, the probability of occurrence is quite large

When you find that the execution is finished, your heart is at least half cold.

You may say, I have pressed it so many times and nothing went wrong. Do n’t be afraid, I just want to say

You will understand when it appears once. Do n’t think that the operation and maintenance accidents are all on others. If you do n’t pay attention, the next one is you.

3. Avoid multi-person operation

In the last company I was in, the operation and maintenance management was quite confusing. Let me give you the most typical example. The operation and maintenance that have left several jobs all have a server root password.

Usually when we receive a task, we will simply check if it cannot be solved, and ask others for help, but when the problem is overwhelming, the customer service supervisor (understand Linux ), the network administrator, your supervisor debugs a server together, when you are all kinds of Baidu , Various comparisons, after the discovery, your server configuration file is different from the last time you modified it, and then changed it back, and then Google, found the problem happily and solved it, others told you that he also solved it, The modified parameters are different ... This, I really do n’t know which is the real cause of the problem. Of course, this is still good. The problem is solved, and everyone is happy, but you have encountered the file you just modified, the test is invalid, and then When I go to modify and find that the file has been modified again? I'm really annoyed, and don't let multiple people operate it.

4. Backup before operation

Develop a habit, when you want to modify the data, first back up, such as the configuration file of .conf

In addition, when modifying the configuration file, it is recommended to comment the original options, and then copy and modify

Furthermore, if there is a database backup in the first example, the misoperation of rsync will be fine soon.

So it ’s not like losing a database overnight, just backing up one is not so bad.

 

2. Data involved

1. Use rm -rf with caution

There are many examples on the Internet, various rm -rf / , various deletion of the main database, various operation and maintenance accidents ...

A small mistake will cause great losses. If you really need to delete, be careful.

2. Backup is more than everything

Originally, there are all kinds of backups above, but I want to divide it into data categories. Once again, backup is very important.

I remember my teacher said a word, it ’s not too cautious about data

The company I work for has a third-party payment website and online loan platform

The third-party payment is fully backed up every two hours, and the online loan platform is backed up every 20 minutes

I won't say more, let's consider it for ourselves

3. Stability is above everything

In fact, more than data, in the entire server environment, stability is greater than everything, not the fastest, but the most stable, and the availability

So without testing, don't use new software on the server, such as nginx + php-fpm , PHP hangs in the production environment.

Just restart it, or just change to apache .

4. Confidentiality is everything

Now all kinds of beautiful photos are flying all over the sky, and various routers are backdoors, so it is impossible to say that it is not confidential when it comes to data.

 

3. Safety

1. ssh

Change the default port (of course if the professional wants to hack you, it will come out after scanning)

Disallow root login

Use common user + key authentication + sudo rule + ip address + user restriction

Use hostdeny- like explosion-proof cracking software (more than a few attempts to directly pull black)

Filter login users in / etc / passwd

2. Firewall

The firewall production environment must be open, and follow the principle of minimum, drop all, and then release the required service ports.

3. Fine permissions and control granularity

Services that can be started by ordinary users should never use root , control the permissions of various services to a minimum, and control granularity.

4. Intrusion detection and log monitoring

Use third-party software to always detect changes in system key files and various service configuration files

For example , / etc / passwd, /etc/my.cnf , /etc/httpd/con/httpd.con, etc .;

Use a centralized log monitoring system to monitor / var / log / secure , / etc / log / message , ftp upload and download files and other alarm error logs;

In addition, for port scanning, you can also use some third-party software, and find that it is scanned directly into host.deny . This information is very helpful for troubleshooting after the system is hacked. It has been said that the cost of a company ’s security investment is proportional to the cost of his loss due to a security attack. Security is a big topic

It is also a very basic work. If the foundation is completed, the system security will be improved considerably.

 

4. Daily monitoring

1. System operation monitoring

A lot of people step into operation and maintenance from monitoring. Large companies generally have professional 24- hour monitoring and maintenance. System operation monitoring generally includes hardware occupancy

Common are, memory, hard disk, cpu , network card, os including login monitoring, system key file monitoring

Regular monitoring can predict the probability of hardware damage, and bring very practical functions to tuning

2. Service operation monitoring

Service monitoring is generally a variety of applications, web , db , lvs, etc. This is generally monitoring some indicators

It can be quickly discovered and resolved when the system has a performance bottleneck.

3. Log monitoring

The log monitoring here is similar to the secure log monitoring, but here are generally hardware, os , application error and alarm information

Monitoring is really useless when the system is running stably, but if there is a problem, you will be passive if you do not monitor

 

V. Performance tuning

1. In- depth understanding of the operating mechanism

In fact, according to more than a year of operation and maintenance experience, talking about tuning is basically talking on paper, but I just want to briefly summarize, if there is a deeper understanding, I will update. Before optimizing the software, for example, to deeply understand the operating mechanism of a software, such as nginx and apache . Everyone says that nginx is faster, then you must know why nginx is faster, what principle is used, processing requests is better than apache , and must be able to communicate with others Say it in plain, easy-to-understand words, and understand the source code when necessary, otherwise all documents that use parameters as tuning objects are nonsense.

2. Tuning framework and sequence

Familiar with the underlying operating mechanism, there must be a tuning framework and sequence. For example, if the database has a bottleneck, many people directly change the database configuration file. My suggestion is to first analyze the bottleneck, view the log, and write out Optimize the direction, and then start, and the database server tuning should be the last step, the first should be the hardware and operating system, the current database server is only released after various tests

Applicable to all operating systems, should not start with him.

3. Only adjust one parameter at a time

Only one parameter is adjusted at a time. Compared to everyone knowing this, if you adjust more, you will be confused yourself.

4. Benchmark

To judge whether tuning is useful, and to test the stability and performance of a new version of the software, you must have a benchmark test, which involves many factors

Whether the test is close to the real needs of the business depends on the experience of the tester. For relevant information, you can refer to the third edition of "High Performance MySQL " is quite good

My teacher once said that there are no universally applicable parameters, any parameter changes and any adjustments must conform to the business scenario

So do n’t tune Google anymore, it will not have a long-term effect on your improvement and business environment.

 

Sixth, operation and maintenance mentality

1. Control mentality

Many rm -rf / data are in the first few minutes of work and are at the peak of irritability, so are you still planning to control your mindset?

Someone said that you need to go to work when you are irritable, but you can try to avoid dealing with critical data environments when you are irritable

The more stressful, the more calm you will lose more.

Most people have the experience of rm -rf / data / mysql . After you delete it, you can imagine that kind of mood, but if there is no backup, what is the use of your urgency. In general, you should think calmly The worst plan, for mysql , delete the physical file, some tables will still exist in memory, so disconnect the business, but do not close the mysql database, which is very helpful for recovery, and use dd to copy the hard disk, and then you Restore

Of course, most of the time you can only find a data recovery company.

Imagine that the data is deleted, you perform various operations, close the database, and then repair, not only may overwrite the file, but also can not find the table in memory.

2. Responsible for data

The production environment is not a children's play, and the database is not a children's play. You must be responsible for the data. The consequences of not backing up are very serious.

3. Get to the bottom

Many operation and maintenance personnel are busy, and they will no longer be in charge when solving problems. I remember that a customer ’s website always failed to open last year, and the error was reported after the PHP code.

It was found that the session and whos_online were damaged. The previous operation and maintenance were repaired by repair . I fixed it like this, but after a few hours, it appeared again.

After repeated three or four times, I went to the Google database table for inexplicable damage reasons: one is the bug of myisam , the second is mysqlbug , and the third is mysql in the process of writing

It is the kill , and finally found that memory is not enough, resulting OOM kill the mysqld process

And there is no swap partition, the background monitoring memory is enough, and finally upgrade the physical memory to solve.

4. Test and production environment

Be sure to look at the machine where you are before important operations, try to avoid opening more windows

Guess you like

Origin www.cnblogs.com/panfei-ywg/p/12697651.html