What is network operation and maintenance work?

The Internet operation and maintenance work is centered on service, and the three basic points are stability, security, and efficiency, so as to ensure that the company's Internet business can provide users with high-quality services 24/7. insert image description here
The operation and maintenance personnel strengthen the stability of the infrastructure, basic services, and online businesses that the company's Internet business depends on, conduct daily inspections to find possible hidden dangers in the service, optimize the overall structure to shield common operating failures, and multi-data Access improves the disaster recovery capability of the business.

Through monitoring, log analysis and other technical means, timely discover and respond to service failures, reduce service interruption time, make the company's Internet business meet the expected availability requirements, and provide users with continuous and stable services.

In terms of security, operation and maintenance personnel need to pay attention to all aspects involved in business operation to ensure that users can access online services safely and completely.

From network boundary division, ACL management, traffic analysis, DDoS defense, to operating system, open source software vulnerability scanning and patching, to application service XSS, SQL injection protection;

From security process combing, code white box and black box scanning, authority audit, to intrusion behavior detection, business risk control, etc.

Operation and maintenance personnel need to ensure that the Internet industry provided by the company runs in a safe and controllable state, ensure the security of the company's business data and user privacy data, and also need to have the ability to resist various malicious attacks.

On the premise of ensuring business stability and security, it is also necessary to ensure efficient operation of business and rapid output within the company. The operation and maintenance work needs to optimize all aspects of the business. For example, IO optimization improves database performance, image compression reduces bandwidth usage, etc. The Internet services provided bring maximum user value and experience with relatively small resource investment.

At the same time, it is also necessary to improve the efficiency of internal product release and delivery through various tool platforms, and improve the work efficiency related to operation and maintenance within the company.

Job classification operation and
maintenance There are many directions for operation and maintenance. With the continuous development of business scale, the more mature Internet companies, the finer the division of operation and maintenance positions.

At present, many large Internet companies only have system operation and maintenance in the initial stage, and gradually carry out work subdivision according to the requirements of service scale and service quality.

In general, the work classification (see Figure 1-1) and responsibilities of the operation and maintenance team are as follows. FH19950125@outlook.com
System operation and maintenance
System operation and maintenance is responsible for the construction of IDC, network, CDN and basic services (LVS, NTP, DNS); responsible for asset management, server selection, delivery and maintenance. The detailed job responsibilities are as follows:

1. IDC data center construction
collects business requirements, estimates the development scale of the future data center, from the distribution of the backbone network, data center buildings, and Internet access, network attack defense capabilities, expansion capabilities, space reservation, external dedicated line capabilities, Evaluate and select data centers in terms of on-site service support capabilities and other aspects. Responsible for the construction and on-site maintenance of the data center.

2. Network construction
Design and plan the production network architecture, which includes: data center network architecture, transmission network architecture, CDN network architecture, etc., as well as daily operation and maintenance work such as network optimization.

3. LVS load balancing and SNAT construction
LVS is the traffic entrance in the entire site architecture, and builds a load balancing cluster according to the network scale and business requirements.

Complete the connection between the network and business servers, provide high-performance, high-availability load scheduling capabilities, and unified network-layer attack defense capabilities.

SNAT. Centrally provides the public network access service of the data center, and ensures the high performance and high availability of outbound services through cluster deployment.

4. CDN planning and construction
CDN work is divided into two parts: third-party and self-built.

Establish third-party CDN selection and scheduling control; plan the construction layout of new CDN nodes according to business development trends; improve CDN business and monitoring to ensure the stable and efficient operation of the CDN system.

Analyze the file characteristics and quantity of business acceleration channels, formulate optimal acceleration strategies and resource matching; be responsible for CDN daily troubleshooting such as user hijacking.

5. Server selection, delivery and maintenance
Responsible for the test selection of the server, including the basic test and business test of the whole server and components, reducing the power of the whole machine, increasing the density of rack deployment, etc.

Combined with the understanding of the company's business, promote new hardware and new solutions to reduce the scale of server investment in the business. Responsible for the diagnosis and location of server hardware faults, the development and maintenance of server hardware monitoring and health check tools.

6. OS, kernel selection and OS-related maintenance work
Responsible for OS selection, customization and kernel optimization of the overall platform, as well as Patch updates and internal version releases; establish a basic YUM package management and distribution center, and provide common package version libraries; Follow up various daily OS-related failures; provide targeted optimization support for different business types.

7. Asset management
records and manages basic physical information related to operation and maintenance, including various resource information such as data centers, networks, cabinets, servers, ACLs, and IPs, and formulates effective processes to ensure the accuracy of information; open API interfaces for Automated operation and maintenance provide data support.

8. Basic service construction
The business relies heavily on basic services such as DNS, NTP, and SYSLOG. It is necessary to design a high-availability architecture to avoid single points and provide stable basic services.

Application operation and
maintenance Application operation and maintenance is responsible for online service changes, service status monitoring, service disaster recovery and data backup, routine investigation of services, and emergency handling of failures. Detailed job responsibilities are described below.

1. Design review
In the product development stage, participate in product design review, and put forward review opinions from the perspective of operation and maintenance, so that the service can meet the high availability requirements of operation and maintenance access.

2. Service management
Responsible for formulating online business upgrade changes and rollback plans, and implementing changes. Grasp the responsible services and the relationship between services, and various resources that services depend on. Ability to discover defects in services, report them in a timely manner and promote solutions.

Formulate service stability indicators and access standards, and at the same time continuously improve and optimize the functions and efficiency of programs and systems, and improve the quality of operation. Improve monitoring content and improve alarm accuracy.

When there is a failure in the online service, it will respond immediately, and the known online failure can be reported according to the process and executed according to the plan. For the unknown failure, relevant personnel will be organized to jointly troubleshoot.

3. Resource management
manages the server assets of each service, sorts out the server resource status, data center distribution, network dedicated line and bandwidth status, can reasonably use server resources, and allocates servers with different configurations according to the needs of different services to ensure server resources full use of.

4. Routine inspection
Develop routine inspection points for services and continuously improve them. According to the established service inspection points, the service is regularly inspected. The problems found in the investigation process shall be investigated in a timely manner to eliminate possible hidden dangers.

5. Contingency plan management
Determine the monitoring required by the service, the threshold or critical point of the system index, and the handling plan after the situation occurs.

Establish and update service plan documents, and continuously supplement and improve according to daily failure conditions to improve the completeness of the plan. Able to formulate and review various contingency plans, and periodically conduct contingency plan drills to ensure the feasibility of the contingency plan.

6. Data backup
Formulate data backup strategy and carry out data backup work according to the specifications. Ensure the availability and integrity of data backups, and conduct data recovery tests regularly.

Database operation and maintenance
Database operation and maintenance is responsible for data storage scheme design, database table design, index design and SQL optimization, and changes, monitoring, backup, and high-availability design for the database. Detailed job responsibilities are described below.

1. Design review
In the initial stage of product development, participate in the review of design plans, and propose data storage plans, database table design plans, SQL development standards, index design plans, etc. from the perspective of DBA, so that the service can meet the high availability and high performance requirements of database use .

2. Capacity planning
Grasp the upper limit of the capacity of the database in charge of the service, clearly understand the current bottleneck point, and optimize, split or expand in time when the service has not reached the upper limit of capacity.

3. Data backup and disaster recovery
Formulate data backup and disaster recovery strategies, and regularly complete data recovery tests to ensure the availability and integrity of data backup.

4. Database monitoring
Improve database survival and performance monitoring, and keep abreast of database operating status and failures.

Database security Build a database account system, strictly control account permissions and open scope, reduce the risk of misuse and data leakage; strengthen the management of offline backup data, and reduce the risk of data leakage.

5. Database high availability and performance optimization
Design a corresponding switching plan for database single-point risks and failures to reduce the impact of failures on database services; continuously optimize the overall performance of the database, including the introduction of new storage solutions, hardware optimization, file system optimization, Database optimization, SQL optimization, etc., the database can support more business requests under the condition that the guarantee cost does not increase or increases slightly.

6. Automation system construction
Design and develop database automation operation and maintenance system, including database deployment, automatic expansion, sub-database and sub-table, authority management, backup and recovery, SQL audit and online, failover and other functions.

7. Operation and maintenance research and
development Operation and maintenance research and development is responsible for the design and development of general operation and maintenance platforms, such as: asset management, monitoring system, operation and maintenance platform, data authority management system, etc. Various APIs are provided for operation and maintenance or R&D personnel to encapsulate higher-level automated operation and maintenance systems. Detailed job responsibilities are described below.

8. Operation and maintenance platform
Record and manage services and their associated relationships, and assist operation and maintenance personnel to complete daily operation and maintenance operations in an automated and streamlined manner, including machine management, restart, rename, initialization, domain name management, traffic switching, and implementation of failure plans.

9. The monitoring system
is responsible for the design and development of the monitoring system, completes the collection, alarm, storage, analysis, display and data mining of the resource indicators of the company's servers and various network equipment, and online business operation indicators, and continuously improves the alarm performance. Timeliness, accuracy and intelligence promote the rational allocation of company server resources.

10. Automated deployment system
Participate in the development of the deployment automation system, be responsible for the basic data and information required by the automation deployment system, and be responsible for authority management, API development, and Web-side development. Combined with cloud computing, develop and provide PaaS-related high-availability platforms, further improve service deployment speed and user experience, and improve resource utilization.

Operation and
maintenance security Operation and maintenance security is responsible for the security reinforcement of the network, system and business, conducts regular security scanning, penetration testing, research and development of security tools and systems, and emergency response to security incidents. Detailed job responsibilities are described below.

1. Establishment of safety system
According to the specific internal procedures of the company, a practical and effective safety system is formulated.

2. Safety training
Regularly provide employees with targeted safety training and assessment, and establish a safety responsible person system throughout the company.

3. Risk assessment
Through the black-and-white box testing and inspection mechanism, the overall risk assessment results for physical networks, servers, business applications, and user data are regularly generated.

4. Security construction
According to the risk assessment results, strengthen the weakest links, including designing security defense lines, deploying security devices, updating patches in a timely manner, preventing viruses, automatic source code scanning, and business product security consulting. In order to reduce the value of data that may be leaked, technical means and processes such as encryption, anonymization, obfuscation, and even regular deletion are used to achieve the goal.

5. Security compliance
In order to meet compliance requirements such as payment licenses, the security team is responsible for the external interface of security compliance.

6. Emergency response
Establish a safety alarm system, collect safety problems discovered by third parties through the safety center, and organize various departments to repair the discovered safety problems, evaluate the impact area, and trace the safety reasons afterwards.

The development process of operation and maintenance work
The early operation and maintenance team was mainly engaged in data center construction, basic network construction, server procurement, server installation and delivery with a small number of personnel. It is rarely involved in the change, monitoring, management, etc. of online services.

At this time, the operation and maintenance team is more of an infrastructure role, providing a simple and usable network environment and system environment.

With the gradual maturity of business products, there are higher requirements for service quality. At this time, the operation and maintenance team will also undertake some server monitoring work, and will also be responsible for LVS, Nginx, etc. 4/7 layer operation and maintenance work that has nothing to do with business logic.

At this time, service changes are more manual operations one by one, or some simple batch scripts appear. The focus of monitoring is more on server status and resource usage. There is little monitoring of service application status, and more monitoring uses various open source systems such as Nagios and Cacti.

Due to the continuous increase in business scale and complexity, the operation and maintenance team will gradually be divided into two parts: application operation and maintenance and system operation and maintenance. Application operation and maintenance began to take over the online business, and gradually carried out the work of service monitoring and sorting, data backup and service change.

With the deepening of the service, application operation and maintenance engineers have the ability to start some simple optimization of the service. At the same time, in order to cope with a large number of service changes every day, we have also begun to write various operation and maintenance tools, which can be easily changed in batches for certain specific services.

With the increase of business scale, there are more and more failures of infrastructure due to insufficient capacity planning or weak ability to resist risks, forcing operation and maintenance personnel to begin to devote more energy to multi-data center disaster recovery and contingency plan management. direction.

After the business scale reaches a certain level, the open source monitoring system can no longer meet the business needs in terms of performance and functions; a large number of service changes and complex service relationships, the previous methods of manual records and tool changes are not efficient or accurate. cannot meet business needs. In terms of security, various incidents, large and small, have also occurred, forcing us to invest more energy in security defense. Gradually, the operation and maintenance team formed the five major work categories mentioned above, and each category requires specialized talents.

At this time, system operation and maintenance is more focused on infrastructure construction and operation and maintenance, providing a stable and efficient network environment, and delivering resources such as servers to application operation and maintenance engineers. Application O&M focuses more on service operation status and efficiency.

Database operation and maintenance belongs to the refinement of application operation and maintenance work, and is more focused on automation, performance optimization and security defense in the database field. Operation and maintenance research and development and operation and maintenance security provide various platforms and tools to further improve the work efficiency of operation and maintenance engineers and make business services run more stably, efficiently and securely.

We divide the operation and maintenance development process into four stages, as shown in Figure 1-2.insert image description here

Guess you like

Origin blog.csdn.net/Arvin_FH/article/details/131765002