[Record of production environment K8S from construction to operation and maintenance (4)] Kubernetes Cluster

[Record of production environment K8S from construction to operation and maintenance (4)] Design of Kubernetes Cluster

1 Introduction

Kubernetes Cluster is undoubtedly the protagonist in our system, after all, in the end our services are to run on it. Therefore, it is very important to plan well before building a kubernetes Cluster. This time we will talk about what to consider when designing Kubernetes Cluster.

2. A simple Kubernetes Cluster

First, let's first look at the composition of a party-building Kubernetes Cluster.
Insert picture description here
In this Cluster, it is obvious that there are 3 Masters and 5 Nodes, all of which are odd numbers. The reason why the odd numbers are used is not explained here. If you don't know, you can search it online. You can also see that there are many basic Kubenetes Resources running on Master and Node. Let's analyze it in detail below.

2-1. The composition of Master and Node

  • Master: Three
    Masters are essential as the control center of the entire cluster. They must be clustered in the production environment. As for the number of units, at least an odd number of units with more than 3 too.

  • Node:
    As the carrier for us to provide service Pod, 5 Nodes are not allowed to go wrong. They must be clustered in the production environment. As for the too many, it will be decided according to your actual needs. There may be 3 or dozens.

2-2.Kubenetes Resource

I divide the resources running on Master and Node into three categories:

  • Application Resource
    Application Resource mainly refers to the Service Resource that we provide service Pod Resource and expose service. They are scheduled to run on each Node through deployment, daemonset and other controllers to provide our services externally.
    Application Resource recommends not to run on the Master, because the function of the Master is not to run our services. It is used to manage our entire Kubenetes Cluster to ensure the normal operation of the Kubenetes Cluster. Usually the configuration of Master is not very high, so it is not suitable for use as a carrier of our service. In layman's terms: Master is inside, Node is outside.

  • The
    main function of Control Resource Control Resource is to improve the reliability of Pod services through HPA, quota, limitaRange and other resources, so that our service will not hang up due to high load, and it will not use a lot of CPU due to bugs in our program. Resources such as memory cause Node to hang. In addition, permissions can be managed through Resources such as ServiceAccount, Role, etc., to prevent misuse of permissions and operation errors, and indirectly protect our services. Also these Resources should not run on the Master.

  • Monitor Resource
    Monitor Resource is easy to understand, which is to monitor and collect information on our Kubenetes Cluster, Node, and Pod. Monitoring is so that we can notify relevant personnel in time when there is a system failure or service failure. Information collection is for us to analyze the information, determine the health of the system and predict the future problems of the system, and also conduct business information Analysis can create commercial value.

The above is just a simple structure diagram of Kubernetes Cluster, so how can we design a system for the production environment? What are the things we must consider when designing Kubenetes Cluster? Next, let’s talk about it. .

3.Kubenetes Cluster design elements

Mainly start from the non-functional requirements, sort out all the non-functional requirements, find a solution for each requirement, and then reflect it into the design.

非功能需求分析
需求的解决对策
反应到系统设计

Let's talk about how to design for each non-functional requirement when designing the Kubernetes system.

  1. [Reliability] Ease of recovery and fault tolerance;
    a robust system must have fault tolerance and resilience. When a certain part of the system fails, the overall service of the system will not be stopped; and the failure point can be quickly restored through the recovery plan prepared in advance, and the system user will not be aware of the occurrence during the whole process. Of all this. To achieve the reliability requirements of the system, the usual countermeasures include redundancy, backup and restoration, monitoring and alarm, and a 24-hour response system.
    How to implement these reliability countermeasures in the Kubernetes Cluster production environment, let us analyze them one by one.
  • Redundancy
    From the above composition diagram, it can be intuitively seen that both Master and Node are configured in a cluster mode, and the Pod running on the Cluster is also a verbose configuration, but these alone are not enough. We need to find out all the places where single points of failure may occur, and then realize the redundancy through the corresponding technical solutions. The length of all network lines, the length of network equipment such as switches, and the length of load distribution devices that are usually easy to be ignored. Wait, as long as it is any place and equipment related to our system, we should consider whether we need to achieve verbosity. If it is to use cloud services, the operator will implement the redundancy of the infrastructure for us, so that we will reduce a lot of work. In addition, the separation and decoupling of our business can also indirectly improve the reliability of the system to a certain extent.
classification Lengthy component
server Master
Node
infrastructure Network line
Network equipment (switches, load distribution)
Storage device
Business service Usually pods in Kubernetes
other All parts related to the system

  • Many failures of backup and restoration are caused by our human operations, such as accidental deletion of data, errors in publishing content. Such failures are usually easy to locate the cause. We only need to restore data and version rollbacks to achieve failure recovery. . But these are based on our far-sighted vision and preparation-namely backup and version management, which play a vital role in system protection. Usually need to backup the overall OS backup, database data backup, important business file backup. In the Kubernetes Cluster production environment, you need to back up the etcd, which holds important information about the Cluster, as well as the backup of the Master and Node. If you use the Kubernetes management tools we talked about in the Control System chapter before, you don’t need to back up the Master. And Node itself, because the tool records the complete information of your Cluster, it can restore the entire Cluster when needed. In addition, there is my Docker image and Helm chart, etc.
classification Backup part
Cluster Master
Node
etcd data
Business service Docker image
Helm chart
Business data, business documents, etc.
Log
  • Monitoring alarm
    When the system fails, it is very important to find out and notify the relevant personnel at the first time, so that the technicians can intervene in time to prevent the failure from further expanding. We have introduced the monitoring system in detail in Chapter 3. No more explanation here.

  • 24-hour response system After a
    failure occurs, technical personnel must intervene as soon as possible and solve the failure as soon as possible. Then we need to form a 24-hour standby operation and maintenance team. The operation and maintenance team escorts our system at all times.

2 [Performance requirements] Response time, throughput, resource utilization;
response time and throughput are the main indicators for measuring a system. It is often said that "don't take porcelain work without the diamond", we have to do things when designing the system To figure out the pressure that our system will face in the future, such as the response time requirement of a single processing, the requirement of the maximum access volume per unit time, the requirement of the maximum processing volume per unit time, and so on. After sorting out these requirements, when designing the system, we determine the number of servers, server configuration, network bandwidth, storage capacity, load distribution algorithm, and whether to use caching and other technologies based on these requirements when designing the system. In this way, the system we build is just one A healthy system that can meet performance requirements.

3 [Security] Confidentiality, anti-leakage, access control, and anti-attack;
security requires the system to have an effective security defense solution and access control strategy to prevent the leakage of sensitive system information and improper access. When designing, I am usually willing to divide it into external and internal considerations. Both internal and external countermeasures are basically similar. The following table is the usual security countermeasures.

classification Countermeasure Description
External Virus Security Tool Prevent viruses and attacks from outside
Certification Prevent unauthorized access from outside
Network isolation Hide the sensitive part
internal Virus Security Tool Prevent viruses from inside
Access control Level of authority to avoid information leakage or misoperation due to excessive authority
Unique entry server A form of internal isolation, other servers in the system that can only be accessed through the entry server, to facilitate operation monitoring and authority management

4 [Scalability] Horizontal expansion and vertical expansion
are usually easy to understand, including the expansion of the number of servers, the expansion of CPU, memory, storage, the expansion of services, etc., but the difference in Kubernetes Cluster is that all The expansion requirements must be completed automatically, without manual intervention. Only in this way can the advantages of Kubenetes be reflected. When designing Kubernetes Cluster, we must at least consider the expansion requirements in the following table.

classification Expanded content Automatic/non-automatic
Cluster Autoscaler Horizontal cluster expansion Automatic (requires third-party support such as cloud services)
Vertical cluster expansion Automatic (requires third-party support such as cloud services)
under Autoscaler Horizontal expansion HPA automatic
Vertical expansion VPA automatic (but not recommended)

5 [Maintainability] Modularity, reusability, and easy analysis;
after the system is officially launched, it will enter the operation and maintenance phase, which will always be accompanied by the entire life cycle of the system, so an easy-to-maintain system can be Reducing the workload of operation and maintenance personnel and the probability of misoperation is equivalent to indirectly improving the reliability of the system. Because the composition of the Kubenetes system is basically fixed, both in the form of Master and Node, then for the convenient maintenance of the Kubernetes system in the future, the main considerations are reusability and easy analysis. The following table is to achieve reusable and easy to analyze countermeasures.

classification Design countermeasures Description
Reusable Set up a test environment The same environment as the production environment is equivalent to reusing the production environment, which is an important guarantee for reducing production environment failures
Use third-party management tools Tools such as Pivotal, which we mentioned in the previous chapters, can help us to complete repetitive KubernetesCluster management tasks.
Rich manual Repeated or similar problems and failures, write them in the manual, and you can use them directly in the future
Easy to analyze Detailed and easy to understand design information System framework diagrams and other design materials are important reference materials for future operation and maintenance
Collect valid log information Collect effective and complete log information for us through the log information collection system for analysis

4. Summary

No matter what kind of system you are designing, you need to sort out and analyze non-functional requirements before designing, find corresponding solutions for each non-functional requirement, and then reflect these solutions to your design, so that the designed system is sure It is a healthy system. Of course, it is not easy to sort out and analyze non-functional requirements and find corresponding solutions. Designers need to have both technical knowledge and sufficient design experience. Only through long-term project experience and technical knowledge accumulation can Reach a real system framework designer.

Author: rm * Group
Date: 12 September 2020

Guess you like

Origin blog.csdn.net/ashdfoiuasdhfoief/article/details/108550032