Design summary of distributed configuration system under WeChat R&D system

Author: ypaapyyang, Tencent WXG background development engineers, public personal number: represents the number of agricultural classes.

This article aims to analyze the necessity, feasibility, and key constraints of a distributed configuration system, and introduce a practical attempt under the WeChat R&D system based on this series of analyses.

Preface

For many business development students, the processing of operating materials is not an easy task, and it usually requires customized data cleaning, format conversion, and tool development. The author once had such an unpleasant memory that it took two days to import nearly ten types of configuration data at once. If there is any value in this experience, it motivates me to think about distributed configuration systems and practice them in my work to avoid falling into such a bad process again.

This article aims to analyze the necessity, feasibility, and key constraints of the distributed configuration system, and introduce a practical attempt based on this series of analyses under the WeChat R&D system.

Definition of configuration

We know that the essence of software modeling is the mapping of the real world (people, things, things, and rules), and the output of the mapping includes programming systems and configurations. The configuration provides us with the ability to dynamically modify the behavior of the program during runtime , which is often referred to as "dynamic adjustment of the flight attitude during system operation". The root cause is that "we humans cannot control and predict everything, which is mapped to the software field. It is always necessary to reserve some control threads for certain functions of the system, so that when we need them in the future, we can manipulate these threads to control the behavior characteristics of the system."

Therefore, the configuration referred to in this article refers specifically to the data generated by internal operators (system operators in a broad sense, including products, operations, research and development, etc.), and act as input parameters on programming systems (including real-time systems, batch running programs, and data). Tasks, etc.).

In summary, the configuration usually includes the following three types:

a. Environment configuration, which defines the environment-related parameters of the application running, such as IP, Port, etc.;

b. Application configuration, which defines the parameters or information security control of the application itself, such as the initial memory allocation size, database connection pool size, log level, account password, etc.; (passwords, certificates and other things must not be placed in the configuration system In, we should go for unified encryption and decryption services)

c. Business configuration, which defines the business behavior data performed by the application, such as the most common function switches , the list of participating merchants, etc.

System constraints

Data model

The most basic data unit for key=valueconfiguration is (ie, configuration items). For example, function switches are usually the simplest type, and boolean values ​​are used to affect the program execution link (regardless of the gray scale). However, only the key-value type is not enough. For example, the connection configuration of DB includes fields such as ip, port, username, and password. In the implementation of the ini file, they are composed of different configuration items, which logically belong to the same one. Configuration objects, therefore, based on object-oriented design ideas, key=objectis a more general configuration model, which can be json or xml or protobuf message in physical implementation.

The data of the object type can be flat or multi-level (nested) . In actual business applications, the flat type of data has its particularity, that is, it usually has more entries, and the most typical data is a white list , which may be as many as tens of thousands. Offline, internal operations personnel manage this type of data through excel . If we just roughly package it into an object, then too large data may cause a decrease in system efficiency (either configuration write efficiency decreases, or configuration Reading efficiency decreases), so we will use the data used array of plain objectto express, that is, the key=tabletype of data.

Access model

Different from the data generated by the product user, the data flow of the configuration system is one-way. The offline system is combined with the real-time system and read and write are separated (asynchronous writing, real-time reading) . In the end, the distributed configuration system we want to build must be based on this type of access model.

System constraints

Obviously, as a producer, internal operations personnel must have all configurations of text type (Readable) , and have a small amount of data (compared to production data such as users and systems), require less storage space, and have low update frequency . It can be understood that in the entire configuration system architecture, the input side is like the keyboard is a super slow device relative to the CPU, and they have higher requirements for the ease of use, ease of operation, and safety of the system .

Let's think about the user profile system, which partially meets the access model of the configuration system, that is, the data flow is one-way, the offline system is responsible for writing profile data, and the real-time system reads the data. But first of all, its data producers are usually off-line tasks, not operators; again, the amount of data it involves is huge, and it usually requires a customized storage engine. Compared with the configuration system, it is not the same.

In contrast, consumers who configure the system have high-frequency read access, and have higher requirements for system throughput , latency , network traffic , availability, consistency, and request monotonicity . We will start to think deeply one by one in the follow-up.

The design of the configuration system should fully consider the aforementioned data model, access model and system constraints. (What's strange is that when I checked the implementation of related configuration systems, I rarely saw any discussion about consistency and monotonicity of requests. This is also the reason why I wrote this article)

Security constraints

Because the configuration can easily adjust the behavior of the system during runtime, the security of the configuration is very important. The necessary condition for achieving security is to let the right people, in the right way, at the right time, release the right configuration . Therefore, the configuration system must not only support the basic capabilities of gray release , but also strengthen the construction in terms of authority management, authority granularity management, configuration change review, auditing, and historical versions .

System evolution

Stand-alone configuration file

In the era of stand-alone systems, we basically use configuration files to store configuration data (such as ini files, xml files, etc.). The configuration file is easy to understand, easy to implement, and highly available. Therefore, it has entered the era of distributed clusters and is still widely used.

However, the configuration file has many disadvantages, including:

  • Poor ease of use , mainly reflected in the single type of data expressed. For example, ini can only manage configuration items, that is, key=value type data; and if xml files are used to manage key=table type data, the initialization of the file content is inefficient and easy Errors, difficult to maintain;

  • Poor operability . Basically, configuration files can only be modified and released by development, and changes to regular business materials for products and operations have to be involved in development and execution, which has a serious impact on the efficiency of business processes;

  • The correctness and security are difficult to guarantee . Because of the easy implementation of configuration files, many teams have neglected the construction of the operating system. The situation of R&D personnel modifying configuration files at will and maliciously cannot be eliminated. Fine-grained authority management, operation review, There is no way to talk about auditing;

  • The release efficiency is low , and the configuration file is deployed on a single machine. In the case of a large cluster, any changes to the configuration file need to be published to the entire network through a long gray-scale release process. If the configuration file is statically loaded, it needs to be restarted. Binary, which consumes more energy of R&D and operation and maintenance personnel;

  • File consistency is difficult to guarantee . In the process of publishing configuration changes, if there is a downtime in the cluster, it will cause differences in the configuration of different machines, and there is no automatic correction capability, which depends on the support of personnel or the operation and maintenance system. Cause the business to enter undefined behavior.

If it is said that ease of use, operability, correctness, and security can be improved by building an operating system, low publishing efficiency and difficulty in guaranteeing file consistency are the Achilles’ heels of stand-alone configuration files. The essence is that stand-alone The configuration file system is passive and discrete to accept changes from the outside world, but has no active ability.

Centralized Profile Center

As a result, a centralized configuration file system has emerged, which specifically solves the above problems. Developers store configuration files in independent third-party services (typically managed by ZooKeeper, and some teams implement microservice management by themselves) , And then the agent periodically pulls the configuration to the local cache ( pull ), or publishes the changes to the corresponding cluster ( push ) through the event subscription notification capability .

The centralized configuration file system specifically solves the issue of publishing and changing efficiency and configuration file consistency guarantee. However, in the application cases I know, the following problems still need to be solved:

  • The consistency of the granularity is coarse , and the centralized configuration file can only ensure that the distributed cluster reaches the final consistency (the time depends on the frequency and rate of pull and push), but it cannot guarantee that all processes, threads, and coroutines at any time and for any configuration See the same data, which usually leads to unexpected business failures;

  • The monotonicity of the request cannot be guaranteed . In a business request, we hope that the configuration content that the user sees is static. If there is a change in the middle, it may cause business failure and seriously cause the user data state to be confused; based on a centralized configuration file system The configuration of is usually dynamically loaded, and configuration changes may be reflected in the real-time system at any time, causing a business request to see different data states successively;

  • Security is still not completely guaranteed . Although the modification of the centralized configuration file can control the permissions, on the consumer machine, the developer can still manually modify the local configuration file cache to affect the running behavior of the program;

  • Cannot support grayscale capability , the distribution of configuration file changes is full. If you want to support the grayscale release capability, it needs to be implemented by the involved business side;

The configuration file system, whether it is a stand-alone configuration file or a centralized configuration file, the problem, in the final analysis, is determined by the configuration file carrier and the pipeline positioning of the centralized configuration file system, which leads to high cost of refined management:

  • The visible and readable ability of the configuration file is important to the producer, but it is not important to the consumer. Therefore, the configuration file is used as the carrier for the entire link, which may lead to low loading efficiency (for example, dealing with tens of millions of black and white List, or the business party requests the link to be dynamically loaded in real time);

  • Configuration files are difficult to manage meta-information securely and conveniently. In order to achieve consistency, monotonicity, and security, configuration requires some metadata information management (detailed below), but the configuration file system does not have this capability unless the business side uses high costs Do it yourself

  • The number of configuration files is closely related to the number of configurations. With the development of time, the number of configuration files expands , bringing new operational problems;

  • The centralized configuration file system usually only positions itself as a pipeline (as far as the author knows), that is, it does not understand or maintain the content of the configuration file, the agent has a single function, and the business consumer does not directly interact with the system, but only sees the configuration file Although loose coupling can improve usability, it also allows the business side to still invest a lot of development costs to process configuration files.

The configuration file is only the physical carrier of the configuration. The above-mentioned shortcomings are not insurmountable. It is just that under the configuration file-based configuration system, the cost of realizing the above-mentioned capabilities is high, more usage constraints and peripheral support are required.

Database configuration storage

For configurations with complex structures and many types, business R&D students usually do not directly use configuration files to carry them, but use database (relational or non-relational) library tables to store the configuration, and then write tools to import data . This storage solution overcomes some of the problems of configuration files and provides more refined management of configuration. But there are also obvious shortcomings, that is, a high degree of customization, non-reusability, and high repetitive development. Therefore, we need to improve this and refine the commonality, generalization and platformization of the configuration storage, reading, writing, and management processes.

Plan thinking

Physical model

Since the configuration file is difficult to fine-tune management and has physical entities (local files) that are easy to invade, we need a new data structure to carry the configuration. As we discussed earlier, there are two data models for configuration, namely key=objectand key=table. For users, the configuration must be visible, readable, and manageable. In order to achieve this, we only need to build a well-designed operating system between the internal operating staff and the core of the configuration system . What about on the back end? For consumers, the most attention is paid to the efficiency of transmission and calculation. At the same time, in order to align with the microservice framework, protobuf message is undoubtedly the best form.

However, protobuf cannot explain itself. Without message definition, we can neither convert the textual configuration into a pb binary stream nor deserialize it. Therefore, the business message definition must be mentioned in the operating system, but protobuf is not very friendly to visual editing. Therefore, a feasible idea is to configure the definition, visual operation, transmission and storage based on JSON data . The data type conversion is performed only when it reaches the business side.

Safety management

Build a configuration and operation system and make it the only entrance for operators to manage configuration, and you can easily get high returns. We can perform various configuration security enhancements based on the operating system. For example, configuration changes must have corresponding permissions and can only be applied to the system through auditing. All operations must have the ability to audit, and the historical version of the configuration can be quickly checked.

At the same time, capabilities such as grayscale and fallback also need to be operated based on the operating system.

Configure system SDK

As mentioned above, the centralized configuration file system pipeline location, the agent is only responsible for periodically pulling the configuration and then caching it to the local file system. The business system and the configuration system are loosely coupled. We believe that the configuration file still has a high development cost. For the business side, the best development form should be:

int GetConfig<Message>(const std::string& key, ::google::protobuf::Message& msg);

No need to understand the content and form of the file. Then it is necessary for us to provide a set of SDK for the configuration system for the business side to shield the details of the configuration system, data structure and other information so that the business can only see the configured Protobuf Message object.

On the basis of the SDK, consumers only need to intervene slightly (business plug-ins, see below), we can complete protocol conversion, configuration cache, process, thread, coroutine fast final consistency, request monotonous, gray-scale release capabilities.

The configuration system SDK is the basis of refined management . We can complete the above capabilities by maintaining configuration metadata information other than the content of the configuration itself .

Asynchronous

Asynchronization is the key to configuring SDK . Many local caches are updated periodically by real-time link requests, which is easy to implement, but there are problems in efficiency, especially considering that we also need to configure business logic for the configuration. Therefore, the best solution should be to perform configuration loading, initialization, and other logical processing through an asynchronous process.

The problem caused by asynchrony is the concurrency between asynchronous process and real-time request, that is, how to handle read request of real-time link during configuration change of asynchronous process. This is an engineering problem, we will discuss in another article, a feasible The idea is multi-version and reference counting technology .

Business plugin

Another benefit that asynchronous provides us is that the business can perform some initialization actions when the configuration takes effect, such as verifying the correctness of the configuration and building a data structure suitable for the business . For example, the business whitelist is just an array in pb. If the business performs a hit search, the cost is relatively high. The most expected way for business is definitely to use map for storage. Therefore, configuring the SDK to be asynchronous provides a foundation for the business plug-in capabilities.

Push and pull

We prefer to configure the SDK to actively pull configuration updates. The dialectic of push and pull lies in efficiency and availability. Push is more efficient, and there is no useless network consumption. But Push has introduced a new system dependency (ie, event center). If it is not necessary, do not add entities. Based on this idea, we tend to be actively pulled by the SDK periodically . As for efficiency, it can be optimized by various engineering methods to an acceptable level.

Of course, this also depends on the scale of the system. If we are discussing the configuration system of the company's machine, rather than part of the central level, then we will also seriously consider the push or push-pull combination mode.

Fast final consensus

Whether it is a stand-alone configuration file system or a centralized configuration file system, there are serious inconsistencies. For a configuration change, it basically takes a long time to reach the final consistency (that is, all concurrently see the same data state).

A feasible idea is to have multiple versions and take effect at regular intervals . The configuration will only be visible to the outside at a certain time in the future (the SDK has pulled the latest data during this time). As for how to ensure that all SDKs pull data, which involves usability issues, we will discuss in another article.

Monotonous request

Timed effective cannot solve the problem of monotonic request. Request monotonicity means that the real-time service processes a request. During the call stack process of the request, the configuration content read must be static and unchanged, even if the data to be validated in the middle becomes valid data. One idea is that we can cache the configuration version through thread private variables (coroutine private variables).

Grayscale release

On the basis of configuring the multi-version capability of the SDK, the ability to achieve gray release is also easy. The ability to publish in grayscale is just the ability to select the effective configuration version. If the machine, the role, and the requested service key (such as user, merchant, order), etc., hit the grayscale range, the new version will be used, otherwise the original version will be used.

Efficiency improvement

The efficiency improvement includes reducing the amount of network transmission data and reducing the pressure of configuring storage services. These are specific engineering methods and we will not discuss them in this theory.

Usability improvement

The improvement of the availability of distributed systems is a common topic. In order to focus on the unique capabilities of the configuration system, we will not specifically discuss it in this article.

(However, minimizing the single points in the system is an important principle. It is also involved in the previous section "Push and Pull". At the same time, for business availability, a third party configures the system's operational capabilities, active fault detection capabilities, and fault notification Ability, reproduction and positioning capabilities are also very important. This is also an important reason for reinventing wheels. Many team software may do well, but service capabilities (mainly operating capabilities) are a bit unsatisfactory.)

join us

The overseas payment team is looking for fellow travelers on the road to continuous pursuit of excellence. Job requirements:

28605-WeChat Pay Overseas Payment Front-end Development Engineer (Shenzhen)

October 24th, invite you to Tencent Binhai Building


10.16 World Food Day

We used AI technology to help a CD operation

‍Welcome to follow the video number Tencent programmer

To share with you interesting and practical Tencent products & technologies‍

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/109152879