NetEase Cloud's deep customization practice based on Kubernetes

In 2017, Kubernetes surpassed Mesos and Docker Swarm as the most popular container orchestration technology. NetEase Cloud has been contributing code to the Kubernetes community since the second half of 2015. It is the earliest Kubernetes practitioner and contributor in China, and has become an official CloudNative Meetup organizer authorized by CNCF (Cloud Native Computing Foundation). This article will briefly introduce the deep customization practice of NetEase Cloud based on Kubernetes.

1 Architecture of NetEase Cloud Container Service


NetEase Cloud Container Service is based on NetEase Cloud IaaS. In order to simplify user operations, Kubernetes is not directly exposed to users, but provides container services to users through the upper business layer. Add an independent Netease-Controller to connect with Netease IaaS and public platforms, resource management and complex business needs.

 

2 kubernetes public cloud practice


The community version of Kubernetes is mainly for the private cloud market. There is no concept of tenants, only the logical isolation of namespaces. Node/pv and other resources are shared globally by the cluster, service discovery and load balancing are also global, and Node must be prepared in the cluster. Enough, don't worry about resource scheduling failures, and don't need to worry about Docker isolation security issues. For the public cloud, there are a large number of users in the cloud, and the users have various technical backgrounds and require high security isolation. NetEase Cloud has done a lot of work in the process of implementing public cloud based on Kubernetes:

First, in terms of multi-tenant security isolation, there is a dedicated IaaS team to provide isolation of hosts, hard disks and networks;

for each tenant, you can Create a custom namespace;

native Kubernetes authentication is very simple, and Node is globally shared, and each Node can access all Kubernetes resources. Therefore, in order to realize the public cloud, NetEase Cloud has implemented tenant-level security isolation, including authentication, Authorization and API classification statistics and flow control alarms;

computing, storage, and network resources in Netease Cloud are allocated and recycled on demand in real time to ensure resource utilization is as high as possible; because resources are allocated in real time, it is generally slow to create , so NetEase Cloud has made some global optimizations to the creation process, such as speeding up the process of Node registration, selecting hosts based on images, etc.;

there is no concept of network IP in native Kubernetes, and NetEase Cloud has added the Network resource type to represent network IP.

3 NetEase Cloud Container Pod Network


The network of the container mainly has the following schemes:

○ Entry level: Docker basic network model host (shared), bridge (NAT).
○ Advanced level: Self-built network bridge (IP-per-Pod), which can communicate across hosts, such as Flannel, Weave.
○ Professional level: Multi-tenancy, ACL, high-performance scalable, such as Calico, GCE advanced network mode.

NetEase Cloud Container Network

The network implementation of NetEase Cloud Container Service is similar to GCE. Based on the underlying IaaS network, it connects with NetEase Cloud Network through Kubernetes. NetEase Cloud Containers and hosts are completely peer-to-peer on the network, and the tenants are fully interoperable.

IP management is not defined in Kubernetes. Maybe a container or node restarts and the IP changes. NetEase Cloud realizes the IP retention function through IP management, and Pod supports dual networks of private network and public network.

In addition, NetEase Cloud also implements Pod's private network and public network IP mapping relationship management, and implements Netease CNI plug-in management network card mounting and unloading, routing configuration on Kubelet.

4 NetEase Cloud Stateful Containers


When it comes to the state of a container, people often use the analogy of Cattle and Pet. Cattle refers to a stateless container that can be replaced at any time, while Pet is marked, and its data, state and configuration may need to be persisted. The community has used PetSet to implement stateful containers since version 1.3. In the latest version 1.6, it is called StatefulSet.

Before the birth of the community version of stateful containers (version 1.0), NetEase Cloud developed the implementation of StatefulPod:

different from StatefulSet, it can support the retention of container system volumes and data volumes (PV, PVC). StatefulSet can only support external data volumes, but NetEase Cloud's StatefulPod can ensure that as long as the user is not deleted, the data on the user's system disk can also be maintained;

StatefulPod can also ensure that the container private network and public network IP are maintained (Network)

; In Docker, all container directories started by a Node are unified. NetEase Cloud extends Docker to support the customization of the container rootfs directory;

NetEase Cloud's stateful containers also support fault migration, such as hard disks and IP and other resources can drift between Nodes .

5 NetEase Cloud Kubernetes Performance Optimization


Generally, when implementing a public cloud, we try to ensure that there is only one Kubernetes cluster in the same computer room. However, as the number of users increases, the scale of the cluster becomes larger and larger, and there will be many performance problems. Along with the development of the community, NetEase Cloud has also encountered some problems that the community may not have anticipated at the beginning of the design, such as:

○ Kube-scheduler schedules all Pods serially;
○ Kube-controller's deltaQueue has no priority
○ There is no local secret cache in the Serviceaccounts controller;
○ All Nodes repeatedly configure the iptables rules of all Services in the cluster; ○ Kubelet’s SyncLoop
repeats GET imagePullSecrets every time it checks;
○ A large number of Node’s heartbeat reports seriously affect Node’s Listen;
○ Kube-apiserver does not query the index.

The scheduler on the master side


In response to these problems, NetEase Cloud has made a lot of performance optimizations. The first is the scheduler on the master side:

the scenario of public cloud is different from that of private cloud. The containers are distributed in different tenants, and the resources of each tenant are independent, which can be Communicate and schedule through tenants;

usually, it is necessary to traverse Nodes one by one during scheduling. In fact, many Nodes without idle resources can not participate in the scheduling check. NetEase Cloud will filter out Nodes without idle resources before traversing;

optimize predicate The scheduling algorithm filters the order to improve scheduling efficiency;

in NetEase Cloud, scheduling is based on events. For example, if the scheduling fails due to insufficient Node resources, an event will be sent to apply for resources, and an event-driven Pod will be returned after the resources are returned. Schedule without any time to wait.

 

 

 

Optimization of the master-side controller


There are many controllers in Kubernetes, such as Node controllers and Namespace controllers, of which Replication Controller is a core controller that ensures that a specified number of Pod replicas are running in the Kubernetes cluster at any time. NetEase Cloud has created an event priority mechanism to enter the priority queue workqueue according to the event type.

Node-side optimization


There are many users of NetEase Cloud, and users are completely isolated. NetEase Cloud kube-proxy groups Nodes according to tenants:

○ The container network between tenants is completely isolated, and redundant forwarding rules are not configured;
○ Only monitor the service of this tenant, and then generate iptable rules.

Kubelet reduces the request load of the master:

○ imagePullSecret needs to pull the mirror after delay before GET or increase the Secret local cache;
○ Only monitor the changes of resources related to the tenant (relying on the newly added tenant index of apiserver);
○ Reduce the number of master connections for kubelet watch, including Node .

Optimization for Single Cluster Scaling


According to official data, Kubernetes 1.0 supports up to 100 Nodes and 3,000 Pods; in version 1.3, this number rose to 2,000 Nodes and 60,000 Pods. The latest version 1.6 released this year already supports 5K Nodes and 15W Pods.


From the above figure, we can know that APIserver is the communication gateway of the entire cluster, just a proxy proxy, and goroutine perfectly supports web services. The final performance bottleneck is reflected in the access to etcd.

In order to solve this problem, the first thing that comes to mind is sub-library, which is divided into multiple etcd clusters according to Node/RS/Pod/Event. Because Etcd itself cannot scale horizontally in terms of capacity and performance, and there is no performance diagnostic tool:

○ Etcd2 tuning, such as optimization for snapshots, using SSD hard drives, etc.;
○ Etcd3 upgrade, in previous versions, each change will send a request , a request corresponds to a reply. In Etcd3, it is a batch request method. It may push 1000 changes in one request, so the efficiency is greatly improved;
○ Replace the storage of the Kubernetes backend with other KV storage with better performance.

Node heartbeat report mode modification:

○ Node heartbeat interval is extended;
○ Node heartbeat is not persistent;
○ Heartbeat is separated from Node, Node heartbeat only needs to be list without watch;
○ NodeController passively changes to active detection.

Other optimizations


The GC of images and containers is perfect: the current GC only considers the space usage of the disk, not the inode problem. Many users have a lot of small files, so NetEase Cloud has added a disk inode usage check.

Container monitoring statistics: Cadvisor adds network traffic, TCP connections, and disk-related statistics.

NodeController security mode, user-defined Protected, Normal, Advanced 3 modes:

○ Protected: When the underlying IaaS is actively and temporarily operated and maintained, in order to avoid large-scale migration fluctuations, Node offline only alarms but does not migrate;
○ Normal: Stateful, no migration, no Timely offline deletion and rebuilding of status. (Only the underlying cloud disk and network failure);
○ Advanced: Automatic migration and recovery with or without state.

There are also some issues that need to be paid attention to:

○ Graceful delete of Pod requires the container to transmit the SIGTERM signal normally;
○ StatefulSet may have two Pods running at the same time when the kubelet hangs.

Lou Chao , technical leader of NetEase Cloud Container Orchestration. He once participated in the research and development of Taobao distributed file system TFS and Alibaba Cloud cache service. In 2015, he joined NetEase to participate in the research and development of NetEase Cloud Container Service. He has experienced the design and R&D work related to container orchestration of NetEase Cloud Basic Service (Honeycomb) v1.0 and v2.0 , and promote the continuous upgrade of the internal Kubernetes version of NetEase Cloud.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324529513&siteId=291194637
Recommended