Slime 2022 Outlook: Packing Istio's Complexity into a Smart Black Box

1. Introduction

NetEase Shufan Qingzhou microservice team started to use Istio as a service mesh very early. In practice, we have developed many Istio peripheral modules to facilitate the use of Istio by ourselves and our customers within the NetEase Group. In order to give back to the community, we have systematically organized these modules and selected some of them to open source the Slime project in early 2021 .

The Slime project aims to solve the pain points in the use of Istio, facilitate users to use the advanced functions of Istio, and always adhere to the principle of seamlessly connecting Istio without any customization , which greatly reduces the threshold for use.

Over the past year, Slime has made many changes and attempts in architecture, functions, and engineering, and has been greatly improved. In December 2021, Slime was invited to join the Istio ecosystem and officially became a member of the Istio Ecosystem - integrations .

Today, this article will introduce the main capabilities of Slime at this stage, mainly lazy loading and intelligent current limiting modules, and look forward to the future development, hoping to let more ServiceMeshers understand Slime, participate in Slime, and use service meshes more easily. .

Second, lazy loading

2.1 Background

Istio's full push performance problem is a problem that all Istio users have to face.

As we all know, the early Istio configuration distribution was very rough, and it was directly pushed in full. This means that with the continuous expansion of the business scale in the service grid, more and more content needs to be delivered by the control plane, and more content needs to be received by the data plane, which will inevitably bring about performance problems. There are often multiple business systems in a cluster. The application of a business system is aware of the configuration of all business systems, which means that it is unreasonable to push a large number of redundant configurations. As shown on the left side of the figure below, A is only related to B, but has been pushed the configuration of C and D. Another problem is that the frequency of pushes will be high. When a service changes, the control plane notifies all SidecarProxy on the data plane.

Therefore, Istio version 1.1 provides a solution - Sidecar CRD (this article will call it SidecarScope, to distinguish it from the SidecarProxy implemented by Envoy). Users can describe the service information that SidecarProxy needs to care about in SidecarScope, thereby shielding the distribution of irrelevant service configurations. The result is shown on the right side of the figure above. After the service is configured with SidecarScope, the received configuration is simplified and no longer contains irrelevant configurations. At the same time, configuration changes of unrelated services will no longer be notified, reducing the push frequency.

A typical SidecarScope example is as follows, the example indicates that the matching SidecarProxy is allowed to perceive all the services in Namespace prod1 and istio-system and the ratings service configuration information in Namespace prod2.

apiVersion: networking.istio.io/v1alpha3
kind: Sidecar
metadata:
  name: default
  namespace: prod1
spec:
  egress:
  - hosts:
    - "prod1/*"
    - "prod2/ratings.prod2.svc.cluster.local"
    - istio-system/*

The SidecarScope provided by Istio can solve the problem of full configuration distribution. It seems that the problem has been solved. But in practice, manually managing SidecarScope is difficult. On the one hand, the information that the service depends on is not easy to organize. On the other hand, once the configuration is wrong, it will cause problems in the call. This is very unfavorable for the large-scale implementation of service meshes. We desperately want to be able to manage SidecarScope more intelligently.

2.2 Value

Lazy loading modules are used to solve the above problems. Lazy loading can automatically connect to the service mesh, support all Istio network governance capabilities during the forwarding process, and have no performance problems . It helps business people use SidecarScope without having to manage it directly.

We believe that service dependencies can be divided into dynamic service dependencies that are constantly changing during operation and static service dependencies that business personnel can know in advance. For dynamic dependencies, we designed a mechanism to obtain service dependencies in real time and modify SidecarScope; for static dependencies, we focused on simplifying the configuration rules to make them more user-friendly.

2.3 Dynamic configuration update

Lazy loading includes two components, Global-sidecar and Lazyload Controller.

Global-sidecar: A sidecar component. When the source service cannot find the target service, it will forward the sidecar and generate the corresponding service dependency metric.
Lazyload Controller: Control component, process metrics reported by Global-sidecar, modify SidecarScope of source service, and add corresponding configuration to it

The simplified dynamic configuration update process is as follows

The SidecarScope of service A is initially blank, and there is no configuration information for service B
Service A initiates the first access to service B. Since the SidecarProxy of service A does not have the configuration information of service B, the request is sent to the bottom component Global-Sidecar
The bottom component Global-Sidecar has full service configuration information, naturally including service B, forwards the request to service B, the first request is successful, and generates Metric (A->B)
Lazyload Controller perceives Metric(A->B), modifies SidecarScope A, and adds the configuration of service B to it
When service A accesses service B for the second time, the SidecarProxy of service A already has the configuration of service B, and the request goes directly to service B

The detailed flow chart is as follows

Among them, ServiceFence is a CRD introduced in lazy loading. Its function is to store metrics related to services and to update SidecarScope. For a detailed introduction, please refer to Lazy Loading Tutorial - Architecture

2.4 Static configuration enhancements

In the early days of lazy loading, we focused on obtaining dynamic service dependencies, which seemed smart and worry-free. However, in practice, we found that many users, for security reasons, often want to configure some rules directly into SidecarScope, that is, configure static service dependencies. So, we began to think about how to flexibly configure static dependencies.

So, we designed a set of very useful static rules and wrote them into ServiceFence (yes, the CRD used to store Metrics in dynamic configuration updates, it played a new role here). Then Lazyload Controller updates the corresponding SidecarScope according to these rules.

Now we provide three types of static configuration rules:

Depends on some Namespace services
Depends on all services with certain labels
depends on a specific service

Here is an example of label matching. If the application is deployed as shown in the following figure

Now enable lazy loading for the service productpage, which has a known dependency rule of

app: detailsAll services with label
All services with labels app: reviews andversion: v2

Then the corresponding ServiceFence is written as follows

---
apiVersion: microservice.slime.io/v1alpha1
kind: ServiceFence
metadata:
  name: productpage
  namespace: default
spec:
  enable: true
  labelSelector: # Match service label, multiple selectors are 'or' relationship
    - selector:
        app: details
    - selector: # labels in one selector are 'and' relationship
        app: reviews
        version: v2

The Lazyload Controller will populate the SidecarScope according to the actual matching result. The actual SidecarScope is as follows, all the green services in the above picture are selected

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: productpage
  namespace: default
spec:
  egress:
  - hosts:
    - '*/details.ns1.svc.cluster.local'
    - '*/details.ns2.svc.cluster.local'
    - '*/details.ns3.svc.cluster.local'
    - '*/reviews.ns2.svc.cluster.local'
    - istio-system/* # istio部署的ns
    - mesh-operator/* # lazyload部署的ns
  workloadSelector:
    labels:
      app: productpage

Finally, we don't have to repeatedly confirm whether all service dependencies are filled in before going online, let alone manually modify SidecarScope when service dependencies change. Configure two or three ServiceFence rules and you're done.

For a detailed introduction, please refer to Lazy Loading Tutorial - Static Service Dependency Addition

2.5 Metric types

In Section 2.3, we explained that metrics are fundamental to dynamic dependency generation. Currently, there are two Metric types supported by lazy loading: Prometheus and AccessLog.

Using Prometheus Mode, metrics are generated by SidecarProxy of each business application. Lazyload Controller queries Prometheus for metrics. This mode requires the service mesh to interface with Prometheus.

In AccessLog Mode, the indicator source is the AccessLog of Global-sidecar. Global-sidecar will generate a fixed-format AccessLog while forwarding the information, and send it to the Lazyload Controller for processing. This mode requires no external dependencies and is more portable.

2.6 Mode of use

There are two usage modes for lazy loading modules, Namespace mode and Cluster mode. In the two modes, the Lazyload Controller is globally unique. The difference is that the former's Global-sidecar is at the Namespace level, and the latter is at the Cluster level. As shown below

For N namespaces, the number of lazy loaded components is O(N) in Namespace mode and O(1) in Cluster mode. Now we prefer to use Cluster mode. As shown in the figure above, each cluster only needs to deploy two Deployments, which is concise and clear.

For a detailed introduction, please refer to Lazy Loading Tutorial - Installation and Use

3. Intelligent current limiting

3.1 Background

With Istio removing Mixer, implementing throttling in a service mesh has become difficult.

Few scenarios: Envoy's local current limiting component has simple functions, and cannot achieve high-level usages such as global equalization and global sharing current limiting
Complex configuration: Local current limiting requires the help of built-in plug-ins in Envoy envoy.local.ratelimit, and users have to face complex EnvoyFilter configurations
Fixed conditions: There is no ability to automatically adjust the current limiting configuration according to actual conditions such as resource usage

3.2 Value

To solve this problem, we have introduced an intelligent current limiting module . Smart current limiting modules have many advantages, specifically

Multiple scenarios: support local current limiting, global sharing current limiting, global sharing current limiting
Easy configuration: simple configuration, good readability, no need to configure EnvoyFilter
Conditional adaptation: The conditions triggered by the current limit can be dynamically calculated in combination with Prometheus Metric to achieve adaptive current limit effect

3.3 Implementation

We design a new CRD - SmartLimiter with configuration rules close to natural semantics. The logic of the module is divided into two parts

SmartLimiter Controller obtains monitoring data and updates SmartLimiter CR
SmartLimiter CR to EnvoyFilter conversion

The current limiting module architecture is as follows

Red is local throttling, green is global share throttling, and blue is global shared throttling. For a detailed introduction, please refer to Smart Current Limiting Tutorial - Architecture

3.4 Local current limiting

Local current limiting is the most basic usage scenario. SmartLimiter sets a fixed current limit value for each Pod of the service. The bottom layer relies on Envoy's built-in plugins envoy.local.ratelimit. The ID field is action.strategy: single.

An example is as follows, which means that the 9080 port of each Pod of the reviews service is limited to 60 times per minute.

apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
  name: reviews
  namespace: default
spec:
  sets:
    _base:   # 匹配所有服务，关键词 _base ，也可以是你定义的 subset ，如 v1 
      descriptor:   
      - action:    # 限流规则
          fill_interval:
            seconds: 60
          quota: '100'
          strategy: 'single'  
        condition: 'true'  # 永远执行该限流
        target:
          port: 9080

3.5 Global share current limit

The global average limit function is based on the total current limit set by the user, and then evenly distributes it to each Pod. The bottom layer relies on Envoy's built-in plugins envoy.local.ratelimit. The ID field is action.strategy: average.

An example is as follows, which means that the 9080 port of all Pods in the reviews service is throttled 60 times action.quotaper .

apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
  name: reviews
  namespace: default
spec:
  sets:
    _base:
      descriptor:
      - action:
          fill_interval:
            seconds: 60
          quota: '100/{{._base.pod}}' # 如果reviews实例数是2,则每个Pod限流每分钟50次
          strategy: 'average'  
        condition: 'true'
        target:
          port: 9080

3.6 Global Shared Current Limit

The global shared current limit limits the total number of accesses of all pods of the target service. It is not limited to the average value like the global share current limit, and is more suitable for scenarios with uneven access. In this scenario, a global counter is maintained, and the bottom layer relies on the global counter capability provided by the Envoy plugin envoy.filters.http.ratelimitand the RLS service. The ID field is action.strategy: global.

An example is as follows, which means that the 9080 port of all Pods in the reviews service is limited to 60 times per minute, and is not evenly distributed to each Pod.

apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
  name: reviews
  namespace: default
spec:
  sets:
    _base:
      #rls: 'outbound|18081||rate-limit.istio-system.svc.cluster.local' 如果不指定默认是该地址
      descriptor:
      - action:
          fill_interval:
            seconds: 60
          quota: '100'
          strategy: 'global'
        condition: 'true'
        target:
          port: 9080

3.7 Adaptive current limiting

In the above three scenarios, the condition field that triggers the current limit conditioncan be not only a fixed value (true), but also the calculation result of Prometheus Query. The latter is adaptive current limiting. This scenario and the above three scenarios are cross-relationships.

Users can customize the monitoring indicators to be obtained, such as defining a handler cpu.sumwhose value is equal to sum(container_cpu_usage_seconds_total{namespace="$namespace",pod=~"$pod_name",image=""}), and then set the trigger current limit conditionto {{._base.cpu.sum}}>100the form of to realize adaptive current limit.

An example is as follows, indicating that the 9080 port of each Pod in the reviews service is limited to 60 times per minute only when the CPU usage value is greater than 100. Compared with the example in 3.4, it is conditionno longer always true. Whether the current limit is triggered or not is judged by the SmartLimiter Controller according to the actual state of the application, which is more intelligent.

apiVersion: microservice.slime.io/v1alpha2
kind: SmartLimiter
metadata:
  name: reviews
  namespace: default
spec:
  sets:
    _base:   # 匹配所有服务，关键词 _base ，也可以是你定义的 subset ，如 v1 
      descriptor:   
      - action:    # 限流规则
          fill_interval:
            seconds: 60
          quota: '100'
          strategy: 'single'  
        condition: '{{._base.cpu.sum}}>100'  如果服务的所有负载大于100，则执行该限流
        target:
          port: 9080

Fourth, the project structure

This section briefly introduces the project architecture of Slime to help you understand the code repository and deployment form in the multi-module scenario of Slime. The architecture is shown in the figure below

Slime's project architecture adheres to the "high cohesion, low coupling" design philosophy, including three parts

Modules: Independent modules that provide a certain function, such as lazy loading and intelligent current limiting belong to Modules
Framework: The base of modules, providing basic capabilities required by Modules, such as log output and monitoring indicator acquisition
Slime-boot: The startup component responsible for pulling up the Framework and the specified Modules

The entire code repository is divided into 1+N. Slime-boot and Framework are located in the main slime warehouse slime-io/slime , and modules such as lazy loading are all located in independent warehouses.

The deployment form is also 1+N, that is, a Slime Deployment contains a common Framework and N modules that users want. The advantage is that no matter how many Slime modules are used, it is a Deployment when deployed, which solves the maintenance pain points of too many microservice components.

V. Outlook

Slime has been open source for more than a year. In addition to the addition of new functions at the module level and the improvement of existing functions, it has also undergone a major architectural adjustment and reconstruction of the Metric system. It can be said that the development of Slime is now on the right track. new stage. Future planning can be divided into the improvement of existing modules and the introduction of new modules, which are described in detail below.

5.1 Lazy Loading Planning

characteristic	Feature Description	nature	Scheduled release time
disaster recovery capability	Modify the Global-Sidecar component to improve its bottom-up capabilities, so that lazy loading can be used in some disaster recovery scenarios	certainty	2022.Q2
Multi-Service Registry Support	Lazy loading is currently mainly adapted to Kubernetes scenarios, and plans to support ServiceEntry to adapt to multi-service registry scenarios	certainty	2022.Q2
More flexible static configuration	Through higher-dimensional abstraction, the automatic configuration of ServiceFence is realized, and more advanced static rules are supported	certainty	2022.Q3
Multi-protocol lazy loading	Lazy loading currently supports Http services, and plans to support lazy loading of other protocol services, such as Dubbo, etc.	exploratory	2022.H2
Lazy loading across clusters	Lazy loading currently supports services in the same cluster, and plans to support lazy loading of cross-cluster services in multi-cluster service grid scenarios	exploratory	2022.H2

5.2 Intelligent Current Limiting Planning

characteristic	Feature Description	nature	Scheduled release time
Multi-Service Registry Support	Intelligent current limiting is currently mainly adapted to Kubernetes scenarios, and plans to support ServiceEntry to adapt to multi-service registry scenarios	certainty	2022.Q2
Outgoing flow limiter	Intelligent current limiting currently supports inbound traffic limiting, which can meet most scenarios, but in terms of capability completeness, it is planned to support outbound traffic limiting.	certainty	2022.Q3
Multi-protocol intelligent current limiting	Smart current limiting currently supports Http services, and plans to support smart current limiting for other protocol services, such as Dubbo, etc.	exploratory	2022.H2
Intelligent current limiting across clusters	Intelligent current limiting currently supports same-cluster services, and plans to support cross-cluster intelligent current limiting in multi-cluster service grid scenarios	exploratory	2022.H2

5.3 New Module Planning

Module name (planning)	Module description	nature	Scheduled release time
IPerf	A performance testing tool set specially built for Istio, integrating the Istio testing framework, adding custom test cases, and intuitively comparing the performance changes of different versions	certainty	2022.H2
Tracy	Full-link automatic operation and maintenance of service grid, improve troubleshooting efficiency, and provide intelligent judgment	certainty	2022.H2
I9s	Similar to K9s, a black-screen, half-command-line, half-graphical O&M tool for service mesh scenarios	certainty	2022.H2

I hope the above plans can meet with you as soon as possible. Slime related information can be found in Slime - Home , and you are welcome to communicate with us.

About the author: Wang Chenyu, NetEase Shufan senior server development engineer, Istio community member, Slime Maintainer, familiar with Istio and Kubernetes, mainly responsible for the design and development of Slime and NetEase Shufan Qingzhou service grid, with years of practice in cloud native related fields experience.