How can the K8s platform deal with Pod pre-authorization issues

Preface

TKEx-CSIG is an internal cloud container service platform developed based on Tencent's public cloud TKE and EKS container service. It provides a cloud native platform to solve the company's internal container cloud migration, with the biggest features of compatibility with cloud native, self-developed business, and open source collaboration. .

In the process of business containers going to the cloud, some problems will be encountered. Some require business containerization transformation, and some require platform empowerment. In the part of platform empowerment, there is a category of problems that have solutions in the CVM scenario, but are incompatible on the Kubernetes platform due to different operation and maintenance methods, such as the problem of Pod pre-authorization. We hope to solve this type of problem in a cloud-native way and provide platform-based capabilities so that every user can easily deploy and manage their business on the platform.

background

How to pre-authorize new equipment when deploying new services or expanding capacity? I believe you are not unfamiliar with this issue. Based on security considerations, important components and storage in the company often control the source of access requests, such as CDB IP access authorization, OIDB, VASKEY command word module authorization, etc. They either have their own authorized WEB that allows users to apply for bills of lading, or provide authorization APIs that can be called by the operation and maintenance platform. The routing system often needs to accurately obtain the geographic information of the IP device during registration to provide nearby access capabilities, which requires pre-registration of the CMDB.

When using CVM/TVM to deploy services in the past, this problem can be dealt with easily, because we got a virtual machine in advance, and the IP has been assigned and the CMDB has been registered. All the business needs to do is to use this IP to authorize the bill of lading. , Deploy business programs and add routing to go online after everything is complete. This process can be automated with the pipeline capabilities of the operation and maintenance platform.

Different from the step-by-step procedural deployment of VMs after obtaining available equipment, Kubernetes manages the entire life cycle of Pod from production, IP allocation, business container startup, and routing maintenance. The Control Loop of multiple system Controllers is used for automated management. , Mirror-based deployment provides a guarantee for the scalability and consistency of business instances. Pod destruction and reconstruction have become normal, and IP cannot be fixed.

Businesses often face a variety of pre-authorization needs. The average authorization time ranges from seconds to several minutes. Most authorization APIs are not designed to carry high QPS, which is complicated. We need to find a way to process the authorization before the business container gets up after the Pod IP is allocated, block and ensure success before proceeding with the subsequent process, and control the pressure of the rebuilding process on the authorization API.

After design and iterative optimization, the TKEx-CSIG platform provides easy-to-use product-based authorization capabilities for business, and it is convenient to deal with such Pod pre-authorization problems.

Architecture and capability analysis

Architecture

How can the K8s platform deal with Pod pre-authorization issues

The figure above shows the architecture of the authorization system. The core idea is to use the feature of init Container to execute before the business container to implement complex logical preprocessing before the business Pod starts. The official definition of init Container is as follows

This page provides an overview of init containers: specialized containers that run before app containers in a Pod. Init containers can contain utilities or setup scripts not present in an app image

If it is a small-scale or single-business solution, we can do it very simply, inject the init Container into the business Worklooad yaml, call the required authorization API to achieve it, and to make the platform productable, we need to consider The following points:

  • Easy to use and maintainable

    It is necessary to fully consider the efficiency and manageability of business use, and use permissions as a resource to be recorded and managed by the platform to reduce the intrusive impact of changes on the business.

  • Frequency limiting and self-healing

    Permission APIs are often not designed for high QPS, and need to restrict calls to protect downstream.

  • Permission convergence

    Security, Pod destruction and reconstruction may cause IP changes, consider proactively reclaiming expired permissions

Product Capability of Authorization Process

How can the K8s platform deal with Pod pre-authorization issues

The business only needs to register the required authority resources on the platform WEB console, configure the authority group, associate the authority group to Workload, the platform will automatically inject the configuration of the init Container, pass the authorization configuration index and related information through ENV, and authorize when the Pod is created process. The functions of several components involved in the authorization process are designed as follows:

  • init-action-client

    The init Container is only a trigger device, and only does one thing, which is to initiate an HTTP call request and remain immutable, so that when the function is iterated, it is not necessary to modify the business yaml, and the main logic is moved backward.

  • init-action-server

    Deployment can be scaled horizontally, execute pre-processing logic, pre-register CMDB and other operations, and initiate pipeline calls, initiate the permission application process and poll query, and expose process information associated with POD to facilitate business self-examination and administrator location problems. The backoff retry and circuit breaker logic mentioned later are also implemented here.

  • PermissionCenter

    The platform control component, located outside the cluster, is responsible for the storage and actual application of permission resources. Contains a permission resource center, which stores the permission details and parameters of business registration for easy reuse, provides permission set group management, and simplifies parameter transmission in the authorization process; uses the producer/consumer model to implement authorization API calls and result queries based on Pipline.

Circuit breaker and backoff retry mechanism

How can the K8s platform deal with Pod pre-authorization issues

There may be many abnormal conditions in the authorization process, such as incorrect configuration of permission parameters, reduced or unavailable authorization API service quality, or even interface errors and timeouts caused by network reasons. Authorization APIs are often not designed to support high QPS. We use timeout retries, add circuit breakers and exponential backoff retries to achieve fault tolerance.

  • Retry after timeout

    It is reflected in the timeout setting and retry mechanism of interface calls and asynchronous tasks. In response to transient failures, the init-action-client container will also be rebuilt if it exits abnormally. Each time it is created, it is a new round of retry.

  • breaker

    A Configmap is used to specifically record the number of failed Pod permission applications in the cluster, and the application will not be issued for 3 times. And provide a reset ability, exposed to the front end, so that users and administrators can easily retry.

  • Index evacuation

    The circuit breaker mode can block user configuration errors such as cases where authorization is never possible, but it cannot cope with long-term transient faults. For example, during the abolition period, the authorization API backend may deny service for a period of 10 minutes to several hours. At this time, there will be a large number of Pod authorization hitting the circuit breaker rule and the authorization cannot be continued. The artificial processing time is poor and cumbersome. We have added a jittered exponential backoff for each Pod and record the latest failure timestamp, allowing one attempt after a period of time. If it succeeds, it will reset the backoff for the specified Pod. If it fails, update the timestamp and re-timing. , The parameters are as follows,

bk := &PodBreaker{
    NamespacePod:   namespacePod,
    LastRequestFailTime: time.Now(),
    Backoff:        wait.Backoff{
        Duration: 2 * time.Minute,
        Factor:   2.0,
        Jitter:   1.0,
        Steps:    5,
        Cap:      1 * time.Hour,
    },
}

Finalizer convergence permissions

Convergence of permissions is often ignored, but it is also a security consideration. Pod destruction and reconstruction may be normal, and the IP may be inaccurate and change dynamically. A large amount of garbage permissions may be generated for a long time, or the authorized IP may be allocated to other services. Pod creates security risks. We made a Finalizer controller to reclaim the permission before the Pod is destroyed. The reclaiming action is idempotent and a best-effort, because the reclaiming ability also depends on whether the authorized party has the reclaiming ability. Permissions will consider this, such as Tencent Cloud MySQL's IP automatic authorization.

How can the K8s platform deal with Pod pre-authorization issues

In order to reduce the action of hitting the Finalizer and try not to affect the pods that are not authorized by authorization, we only identify the Pods with authorized init Container when the Pod undergoes a change event, and mark the Finalizer on the patch, and reclaim the permissions when these Pods are shrunk and destroyed. And delete the Finalizer, and then the GC will delete the Pod.

kind: Pod
metadata:
  annotations:
~
  creationTimestamp: "2020-11-13T09:16:52Z"
  finalizers:
  - stke.io/podpermission-protection

to sum up

This article solves the problem of preprocessing such as automated authorization when the business uses the container platform before the business process starts. Use init Container to implement the pre-processing before the start of the business container, and to enable the business to manage and apply for permission resources more conveniently, the circuit breaker and backoff retry mechanism to provide fault tolerance, and the Finalizer to provide a recovery capability to prevent Proliferation of permissions.

Reference article

Guess you like

Origin blog.51cto.com/14120339/2596844