"Must-see" Linux operation and maintenance engineers Daguai upgrade article

Operation and maintenance as do game Daguai upgrade, after upgrading knowledge and operation and maintenance system is relatively big change, learning a lot of new knowledge. Operation and maintenance engineers are pressing in from a spent force into a bitter re-grow as fast hardware process, the premise is that you may be able to able to be able to fight, but also with a keen sense of smell perception in front of the trend change. Such as: this year's data, artificial intelligence, more fire. . . (Python is expressed relative comparison of fire)

See topology:

 

Intermediate articles

Interview and interview experience from the experience of others behind me. Some people think that, in fact, is to deploy a software operation and maintenance, setting some basic function, even if the operation and maintenance will be.

For example: Installing LAMP, LNMP, feel I have mastered the method of deployment. In fact, most of the Internet has a key installation script Han Han no technical content, in the eyes of the interviewer, these are not your highlights. The basic architecture to the company's general environment are deployed, you rarely need to change the environment architecture. Even if you have installed

LNMP you are familiar with the principles of architecture inside you, you are familiar with Nginx optimization, optimization of MySQL familiar with it?

Another example: The problem I encountered the interview, the interviewer asks LNMP since you are familiar with the architecture, then Nginx reverse proxy role.

You should not say understand this software and configuration, you say how to optimize as much as possible, how deep to improve site performance.

1, using a reverse proxy can be understood as load-balancing Layer 7 application layer, then use load balancing can be very convenient scale server clusters to achieve the overall cluster concurrent capacity, increase compression capacity.

2, typically with a reverse proxy server function local Cache, Cache by static resources, effectively reduce the pressure bearing back-end server, thereby improving performance.

Here to talk about the operation and maintenance work required to master the core technology

Note that this is the master at work, difficult to master in learning.

 

1, the first major troubleshooting

● analysis part of the program can not run or no reason to run as expected results, the program is running track, view the process of system calls.

● more in-depth analysis of the system bottleneck points.

Check the remaining memory:

-m as Free # - / + buffers / Cache: 6458 1649 # 6458M is true for the use of real memory 1649M remaining memory (remaining memory + cache + buffer) #linux will use all the remaining memory as a cache, so make sure linux operating speed , you need to ensure that the memory cache size

system message:

uname -a # View Linux kernel version information cat / proc / version # core edition cat / etc / issue # Display system version lsb_release -a # Display system version to be installed-Release CentOS locale -a # list all language locale # Current All environment variable coding hwclock # time to see who # users currently online w # users currently online whoami # View the current user name logname # View original username uptime # view the server startup time sar -n DEV 1 10 # View card speed traffic dmesg # show boot information lsmod # viewing the kernel module

hardware information:

more / proc / cpuinfo # View cpu information lscpu # View cpu information cat / proc / cpuinfo | grep name | cut -f2 -d: | uniq -c # to view the cpu type and the number of logical cores getconf LONG_BIT # cpu running digit cat / proc / cpuinfo | grep 'physical id' | sort | uniq -c # The number of physical CPU cat / proc / cpuinfo | grep the flags | grep 'LM' | WC -l # 0 64-bit result is greater than cat / proc / cpuinfo | grep flags # see if the cpu virtualization pae support paravirtualization IntelVT support full virtualization more / proc / meminfo # view memory information dmidecode # View full hardware information dmidecode | grep "Product Name" # View server model dmidecode | grep - P -A5 "memory \ s + Device " | grep Size | grep -v Range # to view memory slots cat / proc / mdstat # View soft raid information cat / proc / scsi / scsi # View Dell hard raid information (IBM, HP require official inspection tools) lspci # view hardware information lspci | grep RAID # to see if support raidlspci -vvv | grep Ethernet # View card type lspci -vvv | grep Kernel | grep driver # to view the drive module modinfo tg2 # View drive version (drive module) ethtool -i em1 # View network card driver version ethtool em1

● using web log analysis system. (Eg Backfire software)

● analyzing system performance bottleneck point (IO / memory / cpu, common tools, with a special command Sar top shift key combination / vmstat / iostat / ipcs)

Log Management commonly used commands:

history # lasted command default 1000 HISTTIMEFORMAT = "% Y-% m-% d% H:% M:% S" # Let history command specific time history -c # Clear History command cat $ HOME / .bash_history # command history record file lastb -a # lists the user login system failed to clear information about the binary log file echo> / var / log / btmp Last # View the landing of user information to clear the binary log file echo> / var / log / wtmp default open garbled who / var / log / wtmp # landed view user information lastlog # user's last login time tail -f / var / log / messages # system log tail -f / var / log / secure # ssh log

2, optimization

Optimization can be said that the operation and maintenance of the most sought-after skills, basic operation and maintenance will be optimized prevailing wage is high, but is optimized to take risks, not to change the Internet search articles about the configuration file or parameters called optimization, and it is easy to resulting in downtime.

Optimization is partially optimized based on actual field environment of each parameter hardware, software performance and improve site performance. I can only say this imperfect half solution, then optimize mysql and tomcat parameters also find parametric tests on a virtual machine and then view the properties according to the official website online articles and documents.

Cost optimization, performance optimization. Here I give optimization tomcat jvm parameters (corresponding test was conducted into the field environment) :( Remember that no monitoring is not tuning)

- standard parameters, should support all jvm

-X non-standard, each jvm implementations are different

-XX unstable parameters, the next version may be canceled

serial collector threaded serialized

multi-threaded parallel collector

Start jvisualvm.exe monitor dump memory overflow

-Xms: Initial heap size

-Xmx: maximum heap size

-Xss: thread stack size

-XX: NewSize = n: setting the size of the young generation

-XX: NewRatio = n: setting the ratio of the young generation and old generation, such as 3, indicated the young generation: old generation ratio of 1: 3, the total the young generation and the young generation older generations 1/4

-XX: SurvivorRatio = n: 2 ratio eden region Survivor in the young generation area.

-XX: MaxPermSize = n: setting the size of the permanent generation

Collector settings

-XX: + UseSerialGC: serial collector provided

-XX: + UseParallelGC: disposed parallel collector

-XX: + UseConcMarkSweepGC: concurrent collector provided

Collection statistics

-XX:+PrintGC

-XX:+PrintGCDetails

-Xloggc:filename

 

tocmat optimize confirmed several jvm virtual machine

set JAVA_OPTS=

-Xms4g

-Xmx4g

-Xss512k

-XX: + AggressiveOpts aggressive optimization options, plus all entries are optimized

-XX: + UseBiasedLocking lock optimization, basically have chosen, paranoid lock

-XX: permSize = 64m area the size of the original, the largest multi-class 300m is set larger

-XX:MaxPermSize=300m

-XX: + DisableExplicitGC //System.gc () call does not show gc

-XX: + UseConcMarkSweepGC using cms shorten response time, concurrent collection, low pauses

-XX: + UseParNewGC parallel garbage collection Cenozoic

-XX: + CMSParallelRemarkEnabled UseParNewGC in the case to minimize the time mark

-XX: + UseCMSCompactAtFullCollection when using concurrent collector, turn on the old generation of compression, reducing the debris

-XX: LargePageSizelnBytes = 128m page size to enhance the performance of

-XX: + UseFastAccessorMethods get / set switch method native code

-Djava bug awt headless = true under repair linux that can occur when handling tomcat icon

 

Front tomcat did not participate in any parameter tuning is probably about 605 per second per second, nearly three times the 435 results

 

3, development skills

Preferred shell and python, now shell does not meet your needs or efficiency is very low, then select automation python is the best choice. Now the general requirements of recruitment needs, write shell or python, perl scripts, personal choice or to vote for python.

This python language to get started faster, easier to understand.

python

On the very rich server management tools, configuration management (saltstack) batch execution (fabric, saltstack) Monitoring (Zenoss, Nagios plug-ins) virtualization management (python-libvirt) process management (supervisor) cloud computing (OpenStack) ..... also most of the system C library has python bindings.

For the process to determine what ultimately must be incorporated into the system management system, written procedures, to become part of the system. Rather than free and can not reuse a whole variety of scripts.

随着云计算时代的来临,中小型公司,不需要运维了。大型公司,没有工程开发能力的运维,是没有竞争力的。

最重要的学好 python 可以涨工资,可以涨工资,可以涨工资。(重要的事情说三遍。)

目前本人也是在学 python,正在把以前 shell 脚本的实例转换成 python 脚本。

4、意识篇

1) 安全意识:

运维人员的权限很大,所以一定要保证帐号/私钥的安全。

● 最好使用加密工具存储。比如truecrypt,lpassword

● 基于本地存储。切勿用网盘,也不建议用lastpass等

● ssh私钥添加密码

2) 磨刀意识:

关于任何操作配置,最好先搞明白操作或配置的原理,然后再去操作。应一句话叫做“磨刀不误砍柴功”,而且对于类似的操作可以举一反三。

3) 计划意识:

复杂的变更操作比如多台主机以及牵涉到san存储,最好先作 操作计划,写计划文档,详细致每条命令,然后请高手帮忙审核。 这样能最大程度使整个操作过程安全。如果是重要的客户业务系统,操作最好有回退方案,而一旦变更失败,客户可以在短时间内将业务回退。

4) 记录分享意识:

遇到自己认为较特殊的案例时,记得要写 案例过程及分析的文档。也方便自己以后翻看,或者和其他兄弟分享,作知识的传播以便于大家以后都能少走弯路。

5) 监控意识:

运维来说,监控是非常重要的,监控是发现系统各种异常的眼睛,所以运维应该和监控紧密配合。

6) 业务意识:

尽量了解维护的各主机上业务类型,以及各主机业务之间的关联性。因为任何维护工作都是为主机能提供业务服务的,当某业务中断,能最快的知道与此业务相关的主机群,从而缩小故障排查范围,最快定位故障。

附上运维思路拓扑图:

 

 

3、开发技能

优选 shell 和 python,现在 shell 无法满足你的需求或者效率很低,那么选择自动化 python 是最好的选择。现在普遍招聘需求要求,会写 shell 或者 python,perl 脚本,个人选择还是选 python。

python 这门语言上手比较快,容易理解。

python

在服务器管理工具上非常丰富,配置管理(saltstack) 批量执行( fabric, saltstack) 监控(Zenoss,nagios 插件) 虚拟化管理( python-libvirt) 进程管理 (supervisor) 云计算(openstack)...... 还有大部分系统 C 库都有 python 绑定。

对于流程确定的事情,最终一定是纳入系统管理的体系,写成程序,成为系统的一部分。而不是无法复用游离与整体的各种脚本。

随着云计算时代的来临,中小型公司,不需要运维了。大型公司,没有工程开发能力的运维,是没有竞争力的。

最重要的学好 python 可以涨工资,可以涨工资,可以涨工资。(重要的事情说三遍。)

目前本人也是在学 python,正在把以前 shell 脚本的实例转换成 python 脚本。

4、意识篇

1) 安全意识:

运维人员的权限很大,所以一定要保证帐号/私钥的安全。

● 最好使用加密工具存储。比如truecrypt,lpassword

● 基于本地存储。切勿用网盘,也不建议用lastpass等

● ssh私钥添加密码

2) 磨刀意识:

关于任何操作配置,最好先搞明白操作或配置的原理,然后再去操作。应一句话叫做“磨刀不误砍柴功”,而且对于类似的操作可以举一反三。

3) 计划意识:

复杂的变更操作比如多台主机以及牵涉到san存储,最好先作 操作计划,写计划文档,详细致每条命令,然后请高手帮忙审核。 这样能最大程度使整个操作过程安全。如果是重要的客户业务系统,操作最好有回退方案,而一旦变更失败,客户可以在短时间内将业务回退。

4) 记录分享意识:

遇到自己认为较特殊的案例时,记得要写 案例过程及分析的文档。也方便自己以后翻看,或者和其他兄弟分享,作知识的传播以便于大家以后都能少走弯路。

5) 监控意识:

运维来说,监控是非常重要的,监控是发现系统各种异常的眼睛,所以运维应该和监控紧密配合。

6) 业务意识:

尽量了解维护的各主机上业务类型,以及各主机业务之间的关联性。因为任何维护工作都是为主机能提供业务服务的,当某业务中断,能最快的知道与此业务相关的主机群,从而缩小故障排查范围,最快定位故障。

附上运维思路拓扑图:

 

 

意识是很重要,并不是你技术很牛,学的技术很多很熟,就不代表你不需要运维意识,其实领导很看重运维意识的,例如有没有做好备份,权限分配问题,平台测试情况,故障响应时间等,这些都是意识,而不是你学了很多技术自认大牛了,平台发现故障你又没什么大不子,以为很简单的问题喜欢处理就处理,不需要向其它部门反馈等,领导不是看你的技术如何,而是看你的运维意识如何,你没运维意识,技术再牛也没用,只会让其它部门的人跟你不协调。

 

转载于:https://mbd.baidu.com/newspage/data/landingshare?pageType=1&isBdboxFrom=1&context=%20%7B%22nid%22%3A%22news_10707237857143559312%22%2C%22sourceFrom%22%3A%22bjh%22%7D

Guess you like

Origin www.cnblogs.com/rui517hua20/p/11280143.html