Can you "play" K8s in 30 minutes? A must-see for newbies!

If you understand Pod, you understand half of Kubernetes

Kubernetes can be understood as a standard API service that abstracts cloud computing resources such as computing, network, and storage.

Almost all operations on Kubernetes, whether using the kubectl command line tool, or in the UI or CD Pipeline, are equivalent to calling its REST API.

Many people say that Kubernetes is complicated. In addition to its complex implementation architecture, another reason is that there are more than 20 kinds of native resource APIs, and the learning curve is relatively steep. But don’t worry, as long as we grasp the essence – a platform that provides container computing capabilities, we can outline it and understand it quickly and easily.

In K8S, the most important and basic resource is Pod, which is translated as "pod". Let's use the following most basic Nginx container as an example to understand the life of a pod, and K8S will understand half of it.

You don’t need to study how to build Kubernetes. It is recommended to use OrbStack to install a Docker & K8S environment locally with one click and start experiments quickly. First, write a Yaml file like this.

# nginx.yamlapiVersion: v1kind: Podmetadata:  name: nginx  labels:    app: nginxspec:  containers:    - name: web      image: nginx      ports:        - name: web          containerPort: 80          protocol: TCP

Then use kubectl to create the nginx Pod. Add  -v8 the detailed log mode after the command to see what kubectl has done. ​​​​​​​

kubectl create pod -f nginx.yaml -v8
kubectl get pod -v8

In visualization tools such as OpenLens or K9S, we can see that a Pod named Nginx is "born". We can also see POST/GET and other request information from the detailed log of kubectl. ​​​​​​​

I1127 14:55:06.886901   83798 round_trippers.go:463] GET http://127.0.0.1:60649/77046cfbc5f80b52d9a1501954ee0672/api/v1/namespaces/default/pods?limit=500I1127 14:55:06.886916   83798 round_trippers.go:469] Request Headers:I1127 14:55:06.886921   83798 round_trippers.go:473]     User-Agent: kubectl...I1127 14:55:07.166333   83798 round_trippers.go:580]     Cache-Control: no-cache, private

It can be seen that after the Pod is created, the actual Pod has more fields than the fields we declared in Yaml. These extra fields are proof of the experience of the Pod since its birth : scheduled by the scheduler to available nodes in the cluster and handed over to kubelet Manage the Pod life cycle, allocate network IP, mount temporary storage, pull the image to start the container when the container is running, coordinate and correct the running status of the controller, etc.

apiVersion: v1kind: Podmetadata:  name: nginx  namespace: defaultstatus:  phase: Running  hostIP: 10.....  podIP: 10.....  conditions:    - type: Initialized      status: 'True'      lastProbeTime: null      lastTransitionTime: '2023-11-27T06:59:13Z'    - type: Ready      status: 'True'      lastProbeTime: null    .....spec:  volumes:    - name: kube-api-access-72rkq      ......  containers:    - name: web      image: nginx      ports:        - name: web          containerPort: 80          protocol: TCP      resources: {}      volumeMounts:        - name: kube-api-access-72rkq          readOnly: true          mountPath: /var/run/secrets/kubernetes.io/serviceaccount      terminationMessagePath: /dev/termination-log      terminationMessagePolicy: File      imagePullPolicy: Always  restartPolicy: Always  terminationGracePeriodSeconds: 30  dnsPolicy: ClusterFirst  serviceAccountName: default  serviceAccount: default  securityContext: {}  schedulerName: default-scheduler  tolerations:    - key: node.kubernetes.io/not-ready      operator: Exists      effect: NoExecute      tolerationSeconds: 300    - key: node.kubernetes.io/unreachable      operator: Exists      effect: NoExecute      tolerationSeconds: 300  priority: 0  enableServiceLinks: true  preemptionPolicy: PreemptLowerPriority

To expand, running a container requires three major components: computing, network, and storage.

  1. Computing resources , namely CPU/Mem/GPU, are declared in the container part of the spec. In this case, the number of resources required is not set, and requests and resources.limits are empty, which means that it may occupy the entire host. This is generally the case. It is called Best-Effort QoS level, and the scheduling priority is relatively low.

  2. In actual situations, reasonable requests/limits are generally set to reach the Burstable QoS level or the requests and limits are set exactly the same to reach the Guarantee level. After the Pod is scheduled to a certain node by the scheduler, it will be handed over to an interface called CRI (Container Runtime Interface), allowing the implementation of CRI to actually build the container, usually containerd, cri-o, podman, docker, etc. wait.

  3. In terms of network , you can see that there is an additional PodIP field in the status. This is to call the underlying interface called CNI (Container Network Interface) and let the CNI implementation layer give the IP. This process is relatively complicated and involves a process called Regarding the pause container, you can ignore these details when getting started.

  4. In terms of storage , you can see that a volume/volumeMounts is automatically mounted. This is additional storage for the Pod. It may be a configuration file or key, or it may be a persistent storage provided by some cloud vendors, such as EBS, EFS disks will involve the third type of K8S underlying interface, CSI (Container Storage Interface). The Kubernetes distribution versions of the cloud vendors we use generally have built-in CSI implementations.

With computing network storage, the Pod is running. If we want to update the Pod, we can use the Update and Patch interfaces. However, the Pod is an atomic resource of Kubernetes and can only update a few fields, such as image and readinessGate.

If you want to end this Pod, you can use the Delete interface to make the Pod enter the Terminating state. It will eventually be deleted by the controller, the computing resources will be recycled, and the container image file will eventually be deleted by GC.

This is just the most basic process. For detailed Pod life cycle, please refer here: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/, especially some life cycle activities that are closely related to the business. For example, postStart starts the hook in parallel, preStop stops the hook serially, sends SIGTERM to try to end the process, sends the SIGKILL signal after the Graceful Period, etc., which is very useful for business.

Computing, network and storage from Kubernetes cluster perspective

At this point, we understand how computing network storage resources are assigned to the carrier Pod. So, what is the resource pool of computing, network, and storage itself called in Kubernetes?

The computing node of the Kubernetes cluster is called Node. Different from the traditional cloud platform's definition of hardware machines, Node is also an abstract resource. It can grow or die. It is not directly bound to which container it is running, but through label/selector. Scheduling-related mechanisms such as affinity are associated with Pods.

The block storage resource of the Kubernetes cluster is called PersistentVolume. In actual usage scenarios, it is generally implemented using a distributed file system. According to business disk requests, the thing that automatically creates a PersistentVolume is called StorageClass.

The Kubernetes network is divided into several layers : a Pod network that turns the entire cluster into a large intranet; a Service network that allows services in the cluster to access each other with built-in L4 load balancing; and an L7 Ingress/GatewayAPI/ServiceMesh network for fine-grained traffic management. . There are also NetworkPolicy resources that control network access policies.

First, let's look at the Pod network. Although the network principles implemented by different CNIs vary greatly, the purpose is the same, which is to assign an IP to each Pod and open up routes with other Pods. For example, AWS uses a very clever way to implement it. It directly assigns the secondary IP of the current VPC-Subnet to the Pod. DHCP and routing tables are reused. The communication method between Pods is exactly the same as that between existing EC2 nodes. .

Let's look at the Service network as an internal L4 load balancer . Each K8S services resource is assigned a virtual IP (ClusterIP). After ClusterIP is allocated, the kube-proxy component is responsible for creating the route for this virtual IP and real-time correction of Pod Endpoint changes.


Let's take AWS EKS as an example. EKS uses the iptables mode by default . kube-proxy will write the IP of each ClusterIP Service into iptables on each node. You can use the iptables command to see the implementation details.

  • Due to the modification of iptables caused by each change, large-scale clusters have performance problems using K8S's built-in Service load balancing, which can be solved by switching to ipvs mode;

  • There is also a Headless Service without ClusterIP, which uses DNS to realize automatic discovery of endpoints and is not a conventional L4 load balancing;

  • If you need to directly expose a service to the public network, and there are NodePort/LoadBalancer type services, kube-proxy will listen to the NodePort port on each node in the cluster, and iptables will write the DNAT rules corresponding to the NodePort;

  • LoadBalancer / NodePort type Service also has a key field "externalTrafficPolicy". A simple understanding is whether it is a cross-node load balancing mode or a local node direct connection mode. Cross-node load balancing will also cause the health check failure of the external LB and the inability to obtain internal services. Client IP issues need to be handled by the platform side and do not need to be concerned by the business side. The business team should remember one principle: never use NodePort Service.

# https://zhuanlan.zhihu.com/p/196393839iptables -L -n# Chain OUTPUT (policy ACCEPT)# target     prot opt source               destination         # KUBE-PROXY-FIREWALL  all  --  anywhere             anywhere             ctstate NEW /* kubernetes load balancer firewall */# KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */# KUBE-FIREWALL  all  --  anywhere             anywhere          
iptables -L -t nat#Chain KUBE-SVC-TCOU7JCQXEZGVUNU (1 references)#target     prot opt source               destination         #KUBE-SEP-HI2KQBDGYW5OVKWN  all  --  anywhere             anywhere             /* kube-system/kube-dns:dns -> 10.52.xx.xx:53 */ statistic mode random probability 0.50000000000#KUBE-SEP-XQT5TF2PMBOMEGDC  all  --  anywhere             anywhere             /* kube-system/kube-dns:dns -> 10.52.xx.xx:53 */

Finally, let’s take a look at the L7 service network, which is generally provided by an application traffic load balancer like Nginx Ingress/Envoy. For business, just split the nginx conf into yaml fragments. Ingress directly manages north-south traffic, while Service Mesh manages all east-west and south-bound traffic, and can also implement some more complex behaviors than Nginx conf, such as traffic encryption, authentication, fault injection, circuit breaker downgrade, retry, etc.

Other native resources are either for Pod nesting dolls or for auxiliary purposes.

The API design of Kubernetes is very consistent with the Single Responsibility Principle (SRP). A Pod is a simple pod that contains a container. Once deleted, it is gone.

But what should you do if you want to run a service? It doesn't matter. Under the single responsibility principle, implementing new functions is like a matryoshka doll. Kubernetes abstracts a thing called Deployment and wraps it in a thing called ReplicaSet. ReplicaSet then wraps the Pod.

In this way, the Deployment only changes and processes the rotation RS, and the RS only ensures that n pods are running. If the pod is not regenerated, it will be restarted if it dies. This also reflects Erlang's typical let it crash thinking.

One day you say you want to train an AI model to kill OpenAI. Kubernetes provides you with an abstraction called CronJob and Job. CronJob traps Job, and Job traps Pod. Job is just a one-time run thing. It needs to be retried several times and how long it takes to delete the Pod. Cronjob is a natural distributed cron, which just generates the job thing at regular intervals. CronJob and Job are widely used in fields such as big data processing pipelines, CICD pipelines, and AI training. OpenAI is also trained on a huge Kubernetes cluster containing more than 8,000 nodes.

One day you want to use Kubernetes to run a database again. Congratulations to you for learning the most complex native resource, StatefulSet. Deployment treats pods as livestock and you can kill them if you want. StatefulSet treats pods as pets. Pets do not. It's easy to maintain. Each Pod cannot be moved casually. When updating, it can only be updated one by one in order.

In addition to these resources for Pod nesting dolls, the rest can be understood as auxiliary, such as the Ingress that introduces traffic to the cluster, the Service for internal traffic load balancing, the distributed configuration ConfigMap and distributed key storage provided for each Pod. Secret. There are also some auxiliary classes for policy control and resource quotas, which will not be discussed one by one here.

Rethinking what Kubernetes is

At this point, we have roughly figured out what Kubernetes means to users. From the component view of Kubernetes itself, it includes these things:

  • Each machine is installed with an Agent called kubelet to control what the machine runs.

  • Each machine is installed with a kube proxy to host network firewall rules and a CNI implementation to control Pod IP allocation and network routing within the cluster.

  • Optional, install a CSI implementation to take over the creation and mounting of persistent storage disks

  • These things are all connected to the control center composed of API Server + Controller Manager + Scheduler. This control center exposes a set of standard scalable REST APIs, and all data is stored in the ETCD metadata cluster.

, let us operate distributed clusters, no longer need to use shell commands, all commands are API-based, and all resources become ETCD data records.


After understanding this, we understand that Kubernetes is essentially an encapsulation of existing technologies, forming a set of cloud resource operating systems. The real work is only the process on the server, and the real isolation of resources is cgroups and namespaces. Something native to the linux kernel.

After understanding this, you will understand why Kubernetes failure does not affect the running services? Why should we do application performance tuning in Kubernetes clusters when we look at what instance type EC2 uses, which generation of EBS or EFS PV storage is, or how to optimize RTT latency and improve bandwidth between Subnet kernel VPCs?

Kubernetes is everything a distributed cluster is, but Kubernetes is nothing.

The A/B side of Kubernetes

The biggest benefits brought by Kubernetes are standardization, elasticity, and scalability.

The management interface brought by REST API is completely standardized,

Saving an ETCD record to create or correct resource status brings extreme flexibility, and capacity expansion and contraction is just a snap.

Developing a custom resource to implement any function brings rich scalability and evolves a huge CloudNative ecosystem.

We need both, standards, elasticity, and scalability. We have them all . It seems perfect, but everything has two sides. These benefits of Kubernetes are also the source of disadvantages.

The dark side of standardization is complexity. The standard must take all situations into consideration, so this standard cannot be simple. Just a Pod spec has dozens of fields, and probably less than half of the fields have never been seen by most people. The difficulty of learning Kubernetes and the difficulty of self-built operation and maintenance can be seen from the training and certification institutions. Even for large companies, it is best not to have the idea of ​​​​building their own Kubernetes. Each of Kubernetes's own components has hundreds of startup parameters. Either young people do not understand how big the pitfalls are, or people who pretend to understand. , or someone who really understands it in a cloud vendor.

The dark side of elasticity is volatility. Nodes in a cluster can grow and disappear at any time, which brings intrusion into the application architecture. Stateful services running in Kubernetes must have the ability to dynamically discover and configure them in the code. The days of static IPs are over. I also discovered an effect brought by Kubernetes, "log loss anxiety". Probably no one on the VM would worry about the logs not being collected. Pods on Kubernetes are floating around. People always ask, what about the last line of logs before the service hangs up? What should I do if I can’t pick it?

The dark side of scalability is that there is a mixed bag of good and bad. Not all products in CNCF Landscape and the open source community are excellent products, and even some things with big problems have become popular. For example, a colleague from the previous team carefully read the Clickhouse operator code. The code quality of this 1.5K star project may be below the passing line. Four years ago, I and another colleague tried to use ES Operator and Kafka Helm Chart to operate and maintain ES/Kafka. At that time, the maturity was far from reaching the level of production use. In addition, Helm, the most popular package management tool for Kubernetes, is also a typical representative of "worse is better". After the Helm author and philosopher Matt Butcher proposed the OAM idea, he went to work on the "next generation cloud computing" WASM ecosystem.

Complexity is the price of growth .

Source: https://code2life.top/2023/11/24/0074-k8s-in-30-min/

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/135379486
Recommended