Operation and maintenance experts about how to achieve K8S landing

Background

Java Examples of full operation since the beginning of the transformation of micro-services, online and offline with thousands of micro-services in operation. These Java instances deployed on hundreds of cloud servers or virtual machines, except for a few high traffic of critical applications, most of the examples were mixed deployment.

Management of these instances, the use of self-development platform combines open source software the way, has been achieved through the implementation of the platform Page button menu package, deploy, start, stop and roll back the basic functions specified version, and achieved good results. But there are still a few pain points are as follows:

1. Examples of isolation between resources, especially during peak or failure on a single server competition between different instances of CPU and memory resources are particularly evident.

2. An application example of abnormal online manual intervention is required, resulting in longer downtime.

3. After a large number of server applications on the new line, such as websites critical functional failure, need for each application, select the corresponding version rollback operation, the whole process takes a long time.

4. Line DEV / QA environment frequently releases, each release will need to stop the old version and then release a new version, it will affect the daily test.

Full operation of the rapid development of business, the stability of the system requirements getting higher and higher, we need to solve the above problems.

Technology research, selection

Initially attracted us is a good container technology to isolate and horizontal expansion and other characteristics, and some of Docker Docker reputation and experience a few years ago to participate in the project, so that was our choice for using Docker container technology.

But we still need a vessel scheduling system, to automate management Docker containers, down overview There are three options: Kubetnetes (K8S), swarm, mesos

These three are not familiar with us, and the rhythm of this project is urgent, we do not allow in-depth understanding of these three systems before making a choice. Fortunately, there is a Github statistical functions, we found some of the basic situation of the three open source projects on Github, as shown below:

According to the statistics, and the aura of a Google company, we determined the use K8S in a very short period of time as the container scheduling management systems. K8S, this open source project called automatically deploy, scale and manage container application, and can address the following core issues:

1. Load Balancing - an application to run multiple identical containers, internal Service provides a unified access definitions to the way load balancing to provide access.

2. Service discovery - and Kube-DNS Service binding, can access only to the corresponding container by a fixed Service name, service discovery to find no separate component.

3. High Availability - K8S will check the health status of the service and found the service will automatically attempt to restart the exception, ensure the normal operation.

4. Scroll upgrade - one by one vessel during the upgrade process K8S have planned rolling upgrades, upgrades to reduce the impact to a minimum.

5. Automatic retractable - when the container resource allocation strategy may use higher will automatically add new containers to share the pressure, when resource utilization will reduce recovery container.

6. Rapid deployment - to write a good script corresponding choreography, set the environment can be deployed in a very short time.

7. resource constraints - the program limit the maximum amount of resources to seize the resources to avoid an accident or stress can comfortably guarantee basic services will not be affected.

After further in-depth understanding of K8S, we will use roughly determined the following components, systems and related technologies:

1. application deployment K8S Deployment, HPA;

2. A small amount of basic services K8S Daemonset, kube-dns;

3. Foreign Service exposed K8S Ingress, Traefik, Service;

4. Network Flannel plug;

The alarm monitoring Heapster, InfluxDB, Grafana, Prometheus;

6. Management Interface Kubectl, Dashboard, self-development release management system;

7. mirrored Jenkins, Maven, Docker;

8. Mirror Harbor warehouse;

9. The log collection Filebeat, Kafka, ELK.

Difficulties and basic principles

Online services must be migrated without interruption of service, each application to scale segmentation traffic, migration to K8S cluster in ensuring the stability of the premise.

DEV environment batches on-line, QA and Production environments on-line version to consider each application dependencies.

Only on the initial application stateless.

Impact on R & D / QA minimized (try not to busy research and development / QA students increased workload).

Landing process analysis

Application release process before and after comparison of Docker

2 can be seen that significant changes in the following figure:

1. Before deployment is war package, jar package, after deploying the Docker mirror (mirror comprising war package, jar package).

2. Before a stop and then start the application process will be interrupted during the service release, followed by the first launch of a new version of the container, and then stop the old version of the container, the publishing process has been applied in the provision of services.

System Architecture Migration

The current business applications can be divided into two kinds, internal use only REST API calls RPC service (Pigeon frame) and provide external services, REST API can be further subdivided into two kinds, and did not have access to the API Gateway Access Gateway API . Which RPC services and applications have access to the API Gateway has its own registry, migration is relatively simple steps, you can start the corresponding application in K8S cluster. Access Gateway API application is not using K8S Ingress plug-ins provide external service entrance, it requires some configuration. System architecture in the following figure, the ultimate goal is to achieve the two boxes below the figure moved K8S all application cluster.

High Availability Cluster Master

Due to limited public cloud, we combine SLB major service providers to achieve, diagram is as follows:

Within K8S cluster applications to provide services

Since the IP address of the cluster POD dynamic changes, we use Traefik + Ingress + Nginx + SLB way to provide a unified entry of foreign services. Traefik domain name and path routing HTTP request according to different application services, Nginx executes some complex operations such as rewrite, SLB provide high availability. Architecture diagram is as follows:

The container application initialization

In order to achieve the same image can be compatible to run on DEV, QA, Production, and other environments, you must write an initialization script that is stored in the mirror. When the container is started, the reading current location from the environment Env variables, and creating a series of soft environment corresponding to the respective files and configuration settings, and other initialization operations log directory, then fork a new process is provided for detecting the container and whether the application has been completed within the normal start (with container readiness probes), while calling application startup script.

The following figure shows the inner container through the soft point to a different environment configuration file:

The following figure shows the inner container is provided by a soft log directory:

K8S log collection

The current application form log files are stored, and a single instance of the corresponding multiple log files, can not be adopted K8S official recommended logging scheme. And because of a stateless vessel, we have to think of other ways to save the log. Is currently used to mount the fixed Node directory on a storage volume into the container, the container initialization script generated according to the particular application log path name + IP container when the container is started by. For ease of viewing logs, we offer three ways:

1. Enable within the container SSH server, release management system to achieve WEBSSH, under normal circumstances can view the log line command into the container through the WEB page, because of its convenience, top pick in this manner.

2. The vessel will start under some circumstances a failure, at this time can not enter the command line, you can find a link address of the log in release management system, and then downloaded to the machine view.

3. In addition, we each run on all Node Filebeat a container, the collected logs are sent to the Node Kafka cluster in real time, after treatment ES cluster to store for later retrieval.

Log directory structure on the lower picture shows Node server:

The figure below shows a shared server log Node download path:

K8S monitoring

Using Heapster + InfluxDB + Grafana composition, wherein InfluxDB should be noted that for storing monitoring data, the data needs to be persisted. Produced in different dimensions on Grafana the dashboard, can be retrieved according to Namespace, Node, application name, in accordance with CPU, memory, network bandwidth, disk usage screening applications, to facilitate troubleshooting and routine optimization. (Of course, the monitoring system is better Prometheus, already on the road in the line.)

The figure below shows the market surveillance:

The following figure shows the monitoring menu:

The figure below shows an application of monitoring chart:

Harbor warehouse Mirror

Harbor We currently used multi from a master structure, the main bank and Jenkins are packaged under online network, the image is uploaded to the master library automatically synchronized to the line and another line from the library from the library, following FIG. below:

Mirroring tree

Our plan is to build a tree of images, all applications are based on the base image Fengyun tree to build the application image, select the most similar to the base image when each application to build, to add to the special needs of the application. Based on this image tree, we use more than 95% are in place without the need Gitlab Dockerfile, Dockerfile when packaging can be automatically generated based on the variables, for example:

The following figure shows a script automatically generated application Dockerfile:

Mirror tree diagram is as follows:

Current state

Of containers: The DEV / QA Docker environment has been completed, the product application environment has completed about 98% of the Docker.

System recovery: OOM or other application Crash, the system can automatically pull up a new node to replace the failed node, senior health check yet open (to be otherwise fit).

Elastically stretchable: open all critical applications resilient and elastic, good traffic peak of the observed effects.

Rolling release: the ratio can be specified batches deploy updated version of the application, the first batch of updates, after the success of the destruction of a group, scroll sequentially.

Quick rollback: currently only support a single application quickly roll back, such as the need to increase post-transaction-level rollback capability, the use of rollout function K8S can be easily achieved.

Some pits and stepped on the recommendations

1. The underlying operating system uses CentOS7.x version, it will be relatively easy.

2. Ali cloud classic network ECS can not access the container IP, you need to migrate to the VPC environment, similar to other public cloud, which focus can add their own routing.

3. If an application level monitoring, then collected from the container to the inside of Memory, Load Average is the underlying operating system and other information, rather than the container, these indicators may be dependent special container monitoring system.

4. To limit ulimit attention of the container and it is not isolated, set too small, will encounter some strange problems.

5. The container of the root user may not see the other user processes created by the owner with the netstat command, if there are some old-fashioned script may encounter similar problems.

6. If a specific port number requires a direct access to the interior of the container, headless service very easy to use.

7.Zookeeper have a single IP connection default limit on the number 60, if the parameter is not modified, then migrate applications may experience this problem after the K8S.

An application Sheremetyevo 8. Product environment when migrating to K8S, you can allocate a larger number of vessels, to ensure that all traffic can eat, and then again according to monitoring elastic stretch feature to lose excess container.

9. If you want to know in advance after the performance K8S cluster deployed application to do a pressure test is necessary.

Published 51 original articles · won praise 9 · views 40000 +

Guess you like

Origin blog.csdn.net/weixin_39891030/article/details/86510820