22 must-see project experiences (true) for operation and maintenance interviews, learn to double your salary

Project 1: The server goes online

Responsible for the company's new server to go online to build the system environment and software environment

1. Deploy the tool according to the existing configuration (ansible+playbook)

2. Customize the deployment template based on the application system environment requirements (system environment initialization, system optimization

customization, service software installation, and configuration templates) Create a one-click execution script playbook, use

roles refine different tasks in different templates

3. Deploy the LNMP environment on the newly deployed servers and compile the Nginx configuration file

software and deploy in batches;

4. The implementation of automated deployment is completed and the server is online

5. Test the automatic deployment effect according to the customized detection template

Project 2: web architecture adjustment

Web server architecture adjustment (from single point to cluster design)

Requirements: Solve the problem of single point of failure of web server for multiple projects on the website

Responsibilities:

1. The research on various load balancing schemes is mainly aimed at lvs+keepalived and

nginx+keepalived for research

2. Write the implementation project book and implementation schedule of the new architecture plan

3. New system deployment and daily maintenance turned most of the company's original single-point servers into clusters, which improved the stability of the website and high-concurrency application scenarios

Project Three: EFK Daily Collection Audit

Proposal and implementation of server log audit project

1. Implement the logging scheme ELK2 for all users for security permission control

2. Build EFK log collection management system

3. Through EFK, the log audit of all server systems, users, and services will be implemented and

Centralized records management

4. After implementation, let all operation and maintenance and development personnel check in Kibana according to their roles

Corresponding logs, the administrator manages all logs, and realizes the safe collection and distribution of server logs

analysis, audit.

Project Four: Batch Distribution

Realize batch distribution and batch management of server data in the whole network

need:

The company's servers are gradually increasing, so it is very troublesome to manage, so it is proposed to solve batch distribution

Management solution for server data distribution and management across the network

Responsibilities:

1. Two sets of distribution management methods for ansible configuration automation tools and ssh key+rsync

Case study, and finally choose a simple, easy-to-maintain and powerful ssh key+rsync solution

2. Find an IDC intranet server as a distribution machine to do sshkey for fixed ordinary users

Authentication (note that it is not root), requires root authority, and is controlled by sudo to reduce security risks.

3. Perform security configuration for the distribution machine, for example, remove the external IP and open the firewall. Reality

After the completion of the construction, the efficiency of operation and maintenance management has been greatly improved, so it has been rewarded by the company.

Project Five: User Security Management

Server user rights management transformation plan and implementation project

Requirement: Solve the problem of the company's root authority overflow

Responsibilities:

1. Propose a permission rectification solution to improve the current situation of the company's root authority overflow

2. Convene a meeting to discuss and determine the plan and promote the implementation

3. After the implementation, the company's authority management is clearer (summary and maintenance), which fundamentally reduces the

Reduce the occurrence of irregularities and safety hazards such as internal operations.

3.1, manage user permissions through sudo to manage permissions, whether it is operation and maintenance or development,

Generally, root privileges are not granted, only core-level development or R&D director or above

It is possible to give the corresponding server-level permissions; only to the core operation and maintenance or the operation and maintenance director.

root authority

3.2, when planning the server, the planning of ordinary users for ordinary users is based on the project.

Its project product line is different in different companies. The company has only a dozen product lines, for each

This project establishes an ordinary user, so both nginx and tomcat run on the ordinary user

Download the pass user.

3.3, public service permission planning Public services also run under ordinary users, generally speaking

In this way, operation and maintenance do operations and maintenance, and development does development. Operation and maintenance is responsible for the network system. As long as there is no fault in the system, as long as there is no fault in the network, and as long as the system resources are sufficient, then

The responsibilities of operation and maintenance are in place. The company's philosophy is the project responsibility system, that is to say, each project

The person responsible for the purpose is development, and operation and maintenance account for about 30%-40% of the responsibility. 60% of the development

responsibility. When the process comes online, the service is run by a normal user. each of its stations

The click directory is the authority of ordinary users, that is, the authority of 700 ordinary users, this is

the safest. Whether it is the start and stop of the project, as well as code online, log collection, daily

Log analysis is performed by ordinary users running processes. While managing this project,

You can add the development users to this project group, so that the developers responsible for the corresponding projects

The member has all the permissions of the corresponding project.

Project Six: Data Backup

Proposed and responsible for the implementation of the whole network server data backup plan

Requirements: Make a complete backup system for company data

Responsibilities:

1. In view of the chaotic state of the company's important data backup and the leadership, a solution for backing up the entire network data is proposed

case

2. Through local packaging and backup, then rsync combined with inotify application to unify the data of the whole network

Backup to a fixed storage server, and then check and report through scripts on the storage server

police admin backup result

3. Regularly back up the data of the cloud host to the company's internal server to prevent problems such as earthquakes and fires.

data loss due to problems. Project Seven: Database Master-Slave

MySQL database realizes master-slave synchronization and complete backup solution

1. Before entering the company, the previous operation and maintenance lost data, so the boss attaches great importance to data security

noodle

2, so I proposed and launched the MySQL database backup solution and the MySQ master-slave architecture

plan

3. The scheme is mainly to enable binlog on the slave database and perform a full backup of database sub-tables by day, and push them to the backup

copy server

4. Regularly restore the backup data to the test library for development and use

5. Formulate the process and system for manually updating the database

Project Eight: LNMP Architecture Optimization

LNMP Architecture Optimization Solution

1. The company uses the LNMP architecture, with less optimization and poor performance

2. I propose an optimization scheme for the LNMP architecture

3. The solution is mainly Linux system optimization, nginx service optimization, php service optimization,

MySQL optimization

4. After the optimization is completed, the performance of the LNMP architecture has been greatly improved.

Project 9: zabbix whole network monitoring

Implementation of the whole network server monitoring solution

Requirements: After arriving at the company, there is no monitoring system, and every failure cannot be reported to the police, and every failure is harmful to the company

websites have had a great impact, so I use the monitoring technology I have mastered to

Write solutions and query data, and submit them to company leaders to improve server alarms

Timely problems, to ensure that the company's website failures are dealt with in a timely manner

Responsibilities:

1. Select the most popular monitoring software zabbix for research according to the needs.

2. According to different dimensions and specific needs of different servers, customize templates for monitoring and real-time alarms

3, Edit the monitoring script to realize custom monitoring of services, according to business types and peak rules

Set corresponding alarm standards and processing specifications

4. After the implementation is completed, most of the fault alarms can be reported to the operator in a timely and effective manner.

The maintenance personnel bought time for the stability of the website

Project Ten: Bastion Machine Security

Build a jumpserver springboard machine to manage chaotic accounts

Software environment: CentOS6.5

Development tools: jumpserver

project description:

In the few months I have been working, I found that the company's server operation and maintenance management is very important for the server

The management of accounts is very chaotic. Some O&Ms even have several work accounts, and they can be accessed at any time.

Log in to the root account. Therefore, whenever an operation and maintenance staff is transferred or resigns, the server's

All account passwords will be changed again, which is not only time-consuming and laborious, but also difficult to remember.

Very troublesome. So, after thinking about it for a while, I suggested to the leader to use the open source springboard machine jumpserver to improve the current chaotic situation.

Responsibilities:

Deploy a server as a jumpserver springboard machine and use xshell to log in to the springboard machine for authorization

rights test

Project effect:

The security management of the whole network server is realized, and the confusion of personnel logging into the server is solved through the bastion machine.

Chaotic phenomena, and realize orderly and safe services for different server and different personnel authority distribution

Server management, so that all personnel can log in and operate the server with traces, avoiding

The unsafe operation of the operators has allowed everyone to form a standardized and safe operation awareness, aiming at

The flow of personnel is managed and changed efficiently through the exclusive ssh-key. Guaranteed to the greatest extent

Have the right to use, standardize and secure login to operate the server.

Project Eleven: Ansible Automation

Project requirements:

Evolve with devops. Automated operation and maintenance is becoming more and more important. As the company's business grows, the server

The number of clusters continues to increase, and relying on operation and maintenance personnel to operate and maintain is extremely costly and inefficient.

The company needs a server automation management platform to complete server online initialization, application

Use software deployment, configuration management distribution, program code deployment and other batch operations. researched to

And considering the number of servers and maintenance costs of the company, ansible is selected as the automation management

management platform.

solution:

1. Build the ansible-master node, and add all the cloud hosts of the company to the ansible cluster group management.

2. Responsible for the initialization of different server host groups init, software install, configuration distribution,

Writing and management of palybooks such as code deployment.

3. Responsible for daily batch operation playbook testing, review and deployment.

4. Responsible for designing and writing automated operation and maintenance specifications and process guidelines.

5. Responsible for the correction of problems caused by daily automation, synchronous management of machine addition and deletion of host groups, public

key distribution, etc.

6. Sort out the system environment, software installation, and corresponding configuration files of different ansible host groups

Changes, so as to update the previous update and revision of stack files such as playbook and roles

just.

7. Responsible for the batch deployment and configuration of ansible for the operation and maintenance support services such as elk and zabbix

Configuration distribution, log collection template, monitoring script update deployment, etc.

Project Twelve: Storage Optimization

Improve server storage issues

Requirements: Reduce storage pressure during access peak periods

Responsibilities:

1. Web front-end storage uses NFS primary and backup structure

2. The user writes data, such as pictures, attachments, etc., and stores them on the NFS master.

Access NFS backup

3. NFS master and backup, using rsync+inotify for data synchronization

4. The amount of data stored in NFS is not large, and sersync is used to push the data to the front-end of the web, so as to reduce the request of the front-end service to access the back-end server as much as possible, and reduce the pressure on NFS storage

5. The security of data backup is guaranteed, so there is no need to worry about data loss.

Project Thirteen: Hospital Comprehensive Service Cluster

Department of Tertiary Industry of the First Academy of Aerospace Sciences – Academy Comprehensive Service Cluster

need:

The main purpose of this project is to build the internal service platform of the Aerospace First Academy. The goal is to build a safe

Complete, efficient and stable server cluster architecture. Provide a comprehensive service platform for aerospace institutes.

Project implementation:

The front section adopts load balancing with Squid cluster and hardware firewall to isolate the internal network from the external network.

network, and can provide the function of monitoring the network and recording the transmission information to strengthen the security of the local area network

Performance, etc. Realize high availability of front-end scheduling servers, load balancing of intermediate web servers,

High availability of the back-end database server, the monitoring server monitors each server in the cluster

The software used by the private data and public data front-end scheduling servers is Keepalived and

Nginx, the software used by the intermediate web server is Nginx, with high concurrency and relatively

For the stable back-end database server, read-write separation is adopted, and the writing library MySQL+MHA

Dual master and mutual master-slave mode. Read from the library using load balancing LVS+Keepalived+MySQL,

And use Memcached cache cluster to cache slave database. Web server uses Nginx

To build a website server, and combine Inotify+Rsync to achieve website data synchronization. Monitoring

The server uses Zabbix to monitor the running status and service status of each server.

Responsibilities:

In this project, I am mainly responsible for the construction of the server service platform. In order to achieve uniformity, I specially wrote a shell script to make server deployment more standardized

Project Fourteen: docker swarm cluster

Deployment and application of microservice project docker swarm cluster

demand analysis:

The company's project is converted to micro-service architecture development, and each function point of the project is split into micro-services, so

The deployment and communication of many project microservices are managed by stand-alone docker containers, and the test environment

The next microservice can be deployed on the same test machine, but for the online service, it is necessary to consider the stability of the service

Support and processing of high efficiency and request quantity, because a set of usable microservice arrangement is urgently needed

Scheduling management scheme, considering the operation and maintenance cost and the current characteristics of few projects, choose

docker-swarm is used as the container orchestration scheduling scheme at the beginning of the project.

solution:

1. Deploy the docker-swarm cluster that supports the initial stage of the project, using 5 servers, one

one as swarm-manager, and the other four as

swarm-worker nodes.

2. Deployment and design The cluster network uses flannel+etcd as the cluster cross-host network.

3. Write the orchestration and scheduling files required by the project, and use docker-stack to integrate the relevant microservices

Services are deployed in batches according to dependency orchestration services

4. Design the CICD architecture for the deployment and launch of many microservices in the project

(gitlab+dockerharbor+jenkins+swarm-manager) complete code in

Automated and semi-automated integrated delivery and deployment of test clusters and online clusters.

5. Daily cluster maintenance and maintenance of cluster application projects. Project 15: k8s cluster

demand analysis:

With the development and improvement of devops, the way of application development and deployment in the past is to

Applied on a host using the operating system's package manager. The disadvantage of this is that the

Executables, configurations, libraries, and lifecycles of applications interact with each other and with the operating system

The system is entangled. In order to support the concepts of agile development and lean development, the company adopts microservices

Architecture development projects, so it is necessary to convert many traditional projects into microservice architecture,

A powerful container scheduling system is needed, so the company adopts the most popular

The kubernetes orchestration scheduling system.

solution:

1. Both the test environment and the production environment use the k8s cluster as the application orchestration, scheduling and deployment system.

2. The test environment uses 6 cloud servers to build a k8s test cluster (dual master, single

etcd, other node nodes)

3. Design the network environment for deploying the cluster, and use flannel as the cross-host network of the cluster.

4. The production environment uses 20 cloud servers to build a k8s production cluster (dual master, 3

Taiwan etcd, other node nodes)

5. Responsible for the compilation and testing of different resource files for the scheduling and deployment of microservices related to each project

use.

6. Responsible for sorting out the configuration management, ports, persistent storage, and load balancing of different projects in the cluster

Standardized management of configuration such as domain name and resource file editing and management.

7. Responsible for the daily increase and decrease of cluster nodes, and the management and maintenance of the number of copies of each application container.

8. Responsible for the maintenance and management of cluster log collection ELK and cluster monitoring prometheus. Item 16: NFS cluster upgrade

NFS cluster upgrade and transformation

demand analysis:

1. The NFS method of the original shared storage server, the problems of performance bottleneck and single point of failure

question

2. After the main NFS storage system goes down, the administrator will come to the alarm manually according to the synchronized log

Select the fastest NFS storage system for records and change it to primary. The solution is simple and feasible, but it requires manual work.

Handling. It is inevitable that the operation will be wrong or the time will be too long.

solution:

1. Use distributed file storage management system MFS to replace NFS

2. At present, there is a single point problem in the MFS metadata server, so we provide

Synchronize the disk in time, and provide Failover through HeartBeat to achieve high availability

3. Adopt MFS+DRBD+Heartbeat high-availability service solution, this solution

This solution can effectively solve the single point problem of the primary MFS storage system. When the primary MFS storage system goes down

After that, the primary MFS storage system can be switched from one primary node to another standby node

point, and the new master MFS storage system automatically and all other slave MFS storage systems

system, and the data of the new primary MFS storage system and the primary MFS storage at the moment of downtime

The storage system is almost identical, and the switching process is completely automatic, thus realizing

Hot backup solution for MFS storage system. Fast fault recovery, improving business reliability. Responsibilities

In this project, people are mainly responsible for on-site coordination of the project, the construction of all server service platforms,

Wrote a shell script to make server deployment more standardized Project Seventeen: mysql high availability

MySQL cluster read-write separation and high availability solution

demand analysis:

1. The new solution guarantees service performance and I/O to meet the rapid response requirements of multiple terminals in the enterprise.

2. Ensure the long-term uninterrupted and stable operation of the system. Ensure cost reasonableness.

3. To meet the high availability and reliability of the database system.

solution:

1. There are 5 MySQL databases on the bottom layer, one master and four slaves. Enable semi-synchronous replication and increase the number of

According to security

2. Use the middleware Atlas to achieve read-write separation and read load balancing, and improve the solution with the terminal

couple.

3. Use two servers to build LVS+Keepalived to be responsible for the Atlas server

load balancing and high availability

4. Build a main MHA server to manage the hot backup of the main library of the database.

5. This solution greatly reduces the waste of server resources, realizes 30-second switchover for failures, and greatly protects

Responsibility description for ensuring database consistency: Mainly responsible for the construction of all server service platforms,

Scheme design, script writing.

Project Eighteen: NFS+DRBD High Availability

NFS+DRBD+heartbeat High Availability Solution

Software environment: Centos6.8

Hardware environment: DELL R710 Project description:

Not long after I joined the company, the back-end NFS server occasionally went down during the peak period of network requests.

machine, so that the mount request of the WEB server cannot be automatically switched to the backup server, resulting in the

The server cannot be used normally, causing the network service to be suspended. In order to avoid future outbursts, company leaders

A similar situation now requires me to make a solution. Through the NFS server CPU and internal

Observe the load of the storage, and the load of the main hardware before the NFS server

data query, and carefully analyzed, I submitted a

DRBD+heartbeat+NFS solution to solve existing problems, approved by the leadership

I will implement this plan.

Responsibilities:

1. Responsible for the overall planning and deployment of the project;

2. Responsible for the writing of heartbeat automatic switching script;

3. Responsible for the erection of the main framework of NFS service erection;

4. By simulating faults and running data on metadata servers and data storage servers

According to the observation, compare the data with the previous situation, and form a report;

5. Writing project implementation report. Later improvement: By configuring multiple independent physical connections

to avoid the single point of failure in the Heartbeat communication line itself, and to minimize

Chances of "split brain". Through the ha.cf configuration file, keepalive and other options

settings to shorten the switching time of the master-slave server. In DRBD, for replication

The process is adjusted. Handle the bad block problem on the Master side. Project Nineteen: Separate deployment of mobile terminal and PC

Deployment, tuning and launch of the mobile terminal

Operating environment: CentOS-6.6, DELL R730

Main function: separate mobile terminal and PC business

Using technology: Nginx seven-layer load, tomcat8+jdk1.8, MHA to implement mysql

High availability (mysql–5.6.17), php-5.6.30, shell scripts send data detection information

Technical points:

Use Nginx seven-layer scheduling to realize the separation of PC and mobile services, and seven layers of dynamic and static separation

Scheduling + Keepalived high availability solution Nginx integrates Taobao health detection module code

Read-write separation + database mha mature high-availability scheme timing + script mysql data backup

And detection, sending detection result information to the administrator's mobile phone, web service optimization, php optimization

optimization, tomcat optimization

Project Twenty: Squid Transparent Proxy

squid transparent proxy

1. System environment: CentOS6.5

2. Software tools: squid-3.0

3. Project description:

The company used SNAT to access the Internet before, causing employees to use the company network to bring

Broad browsing of website videos that have nothing to do with work, resulting in reduced work efficiency; Xunlei, P2P, etc.

The proliferation of applications has caused network congestion and strained enterprise network bandwidth resources.

Responsibilities: a) Use squid proxy service to control the online behavior of company employees;

b) Formulate the management and control plan of the enterprise's online behavior;

c) Realize the security prevention and control function of the intranet, filter malicious web pages, and prevent malicious attacks;

d) Restrict network behavior and intelligently control downloading software such as Xunlei and P2P;

e) Fine-grained and intelligent management of online behavior.

5. Project achievements:

After the implementation of the project, the work efficiency of employees has been significantly improved, and the bandwidth resources of the enterprise network have been guaranteed.

Project 21: Website Access Optimization

Demand analysis: users complain that the website cannot be accessed or the access is slow, due to the initial development process of the company

There were no dedicated operation and maintenance personnel for a period of time, which caused many problems and caused slow or even no access to the website.

For problems such as illegal access, continue to use dedicated operation and maintenance personnel to find and solve related problems, because I am responsible for

Responsible for the overall analysis of the website access speed optimization project and proposed solutions.

solution:

Reasons for slow website access:

(1) If the access is slow in a certain area, if there is a cdn, it may be in that area

There is a problem with the cdn node.

(2) The export bandwidth of the website server is full

(3) The server load is heavy

(4) Website code quality problems, which elements can be viewed by pressing F12 through a browser

Slow loading, or unoptimized SQL statements in the code

(5) Database server bottleneck optimization method:

(1) For CDN issues, contact the manufacturer

(2) Traffic bandwidth problem: spend money to buy bandwidth

(3) Server load problem: use commands such as top to view processes that occupy large resources,

Concrete analysis of the specific process

(4) Code problem: This needs to be negotiated with the development

(5) Database bottleneck: create index, sub-database and sub-table, read and write separation

(6) Overall architecture optimization: horizontal expansion, adding load servers, and using cache servers

(such as redis, etc.), cluster high availability, etc.

Project 22: CICD Integrated Deployment

Demand analysis: With the development of the company's business, it has multiple projects such as front-end, server, and back-end,

With the increase of R&D personnel and the promotion of agile and efficient development, operation and maintenance continue to optimize the code

Integrate and deploy solutions to provide efficient R&D in test and production environments -->Department

Administration-->Test-->Production process pipeline, so it is necessary to sort out and improve the company code integration

and release process to achieve standardized operation and maintenance of CICD system.

solution:

1. Tools Currently mainstream devops choose automated operation and maintenance tool chains

git+gitlab+jenkins+ansible+shell+python+enterprise WeChat/DingTalk api

2. Provide distributed code tools git+gitlab for the company's R&D, testing, operation and maintenance personnel, according to

Provide multiple code libraries according to the project and requirements, and complete the authority distribution of the corresponding code warehouses, support

Support the needs of independent branch development of the code of the relevant personnel of each project. 3. Complete the test environment jenkins system, multiple projects are fully automated to pull code and deploy

From the code to the test server, the efficient pipeline of continuous integration delivery and continuous deployment is realized.

Provide strong guarantee for the workflow of R & D and test personnel.

4. Complete the semi-automation of multiple projects in the production environment jenkins system (standardized online, standardized

Scheduled time, operation and maintenance trigger) pull test and integrate the complete project code and deploy to the corresponding

The project server completes the iterative update requirements of the production environment in small steps, and provides

It provides a strong guarantee for flexible, efficient and safe customer service.

5. Combining API interfaces such as DingTalk and Enterprise WeChat to write build trigger notifications, completion and failure

Notification and other scripts, realize jenkins construction and timely notify relevant personnel to track the construction and deployment process,

Realizing the timely tracking of responsibilities to people makes the project CICD process more efficient.

Guess you like

Origin blog.csdn.net/weixin_53678904/article/details/131748369
Recommended