HUAWEI CLOUD IoT Advanced Siege Lion's 4-year configuration center practice sharing

Abstract: In the past 4 years, we have experienced the coexistence of the self-developed configuration center to Apollo and then to the coexistence of the self-developed configuration center and Apollo. This article summarizes the evolution process of the configuration center in the past few years.

This article is shared from the HUAWEI CLOUD community " 4-year configuration center practice sharing of HUAWEI CLOUD IoT Advanced Siege Lion ", author: He Zhangjian, senior engineer of HUAWEI CLOUD IoT.

Since joining Huawei in 2017, I have been using the configuration center. During the four years, I have experienced the coexistence of the self-developed configuration center to Apollo and then to the coexistence of the self-developed configuration center and Apollo. I have summarized the evolution process of the configuration center in the past few years, and would like to share some of our practices in the configuration center with you to achieve common progress. Apollo is a very good open source software. If there is a misunderstanding of Apollo, I hope you will give me some advice, thank you.

Apollo is an open source configuration management center that can centrally manage the configuration of applications in different environments and clusters. After the configuration is modified, it can be pushed to the application side in real time. It has standardized permissions, process governance and other features. It is suitable for micro Service configuration management scenarios.

Githubhttps://github.com/apolloconfig/apollo

1 Configuration classification used

1.1 Classification from scenes

1.1.1 Operation and maintenance configuration, that is, the read-only configuration of the program

Manual configuration. Manual configuration is performed on the configuration center interface, while the program only reads, such as database configuration, mailbox server configuration, network card configuration, subnet address configuration, etc. This part of the configuration data does not require code to be written dynamically.

1.1.2 Business configuration, that is, the configuration that can be written by the program

We are a SaaS service and each user has some business configuration on it. For example, user certificate configuration, user server flow control configuration, etc., these business configurations are more complicated than operation and maintenance configuration , and may have unique restrictions, such as unique by user id. This part of the configuration data is generally triggered by user operations, the code is dynamically written, and notified to each microservice instance. Usually, we want these configurations to be displayed in the interface and support human modification. If the above logic is implemented by each microservice itself, there will be a lot of repetitive code, and the quality cannot be guaranteed. We hope to implement this capability uniformly by a common component.

1.2 Whether there is a list in the configuration can be divided into single-value configuration or multi-value configuration

1.2.1 Single value configuration

There are only multiple pairs of keys and values ​​in the entire configuration. value is not in a very complex format and tends to be an integer or a string.

1.2.2 Multi-value configuration

The multi-value configuration is more complicated, and the single-value configuration is often under different keys and has different values. For example, in the following configuration, the thread pool sizes and queues of user 1 and user 2 are different

2 The first stage of self-developed configuration center

Before doing cloud services, our configuration center had fewer layers. We deliver it to customers in the form of software. The software runtime is divided into management plane and business plane. The configuration center manages the configuration of management plane and business plane. The most complex scenario is multiple sets of business planes. At this time, it is necessary to ensure that different clusters and different micro The configuration under the service does not conflict, and the configuration level is cluster, microservice, and configuration.

At this time, the configuration center is completely self-developed and does not include functions such as blue-green and grayscale configuration. Its unique features are as follows:

2.1 Single configuration single table

  • In the storage model, each configuration corresponds to a data table.
  • It is friendly to multi-value configuration, especially complex business configuration, which can support various primary key constraints. For single-valued configurations, it's a little heavier.
  • Configured strong schema restrictions. These restrictions include type, size, length, sensitivity, etc. This restriction can not only provide a good experience for the interface modification configuration (such as input boxes with different formats, sensitive fields, front-end input plaintext, back-end storage encryption, etc.), but also can do sufficient calibration when writing the configuration through the interface. test.

2.2 Ensure the reliability of the configuration through the callback method

For example, the process of adding a configuration is like this

Maybe here, some readers want to ask, what reliability can this process ensure. This process checks whether the configuration is reliable by calling the microservice interface, such as whether the IP address is legal, whether the peer address is reachable, whether the configuration quantity exceeds the specification, etc., to ensure that the configuration is basically available.

In general, the comprehensive experience of this self-developed configuration center was good at that time. However, there are still some problems that need to be improved. For example, when there are too many configuration items in a single configuration, because all data under a single configuration of some interfaces at the bottom is carried by one http request, it will cause problems such as response timeout.

3 Phase II Apollo

The main reason for starting the second phase of practice is that we have switched organizations, shifted our business focus to cloud services, and at the same time the team has undergone DevOps transformation. The original old configuration center is maintained by another team. After the organization is switched, if we want to use it, we must maintain it ourselves. So we need to choose between continuing to maintain the old configuration center and introducing open source Apollo. In addition to the operation and maintenance configuration and business configuration mentioned above, our requirements have changed at this time:

  • The level of configuration is getting richer
  • To build the ability to publish microservices in grayscale

On the one hand, the old configuration center does not provide maintenance due to organization switching. On the other hand, it cannot support rich configuration levels, nor does it have the ability to publish in grayscale. At this time, some features of Apollo attracted us, and these features were lacking in the old configuration center, such as (partially quoted from the Apollogithub homepage)

  • Rich levels, from app_id to cluster, namespace, key-value level, can meet the level requirements of our region, cluster, and microservices;
  • Supports configured grayscale publishing. For example, after clicking publish, it will only take effect on some application instances, and it will be pushed to all application instances after observing for a period of time without any problems;
  • All configuration releases have a version concept, which can easily support configuration rollback;
  • The management of application and configuration has a complete authority management mechanism, and the management of configuration is also divided into two links: editing and publishing, so as to reduce human errors;
  • All operations have audit logs, which can easily track problems.

So we brought in Apollo, and my supervisor and I, along with one other colleague, were involved in the work. We have made major changes on the basis of the Apollo open source code. The main reasons are as follows:

  • To save costs, replace the registry and database with the components we are currently using, because these two dependencies are not Apollo's core dependencies
  • Inherit the advantages of the strong schema of the old configuration center
  • Retain the process of callback confirmation configuration, intercept wrong configuration in advance, and reduce the complexity of code processing abnormal configuration
  • Compatible with the scenario of using the old configuration center in the old site through spi or environment variables

Combining the above reasons, we finally practiced this way:

  • The database is switched to the postgre database, and the registry is switched to servicecomb
  • Schema is implemented on the namespace, and each namespace can register the corresponding Schema. Schema requires that the data must be in json format, and the corresponding value in json must meet the specifications defined by Schema (such as ip address, decimal, integer, etc.)

Schema example:

[
    {
         "name":"name",
         "type":"string"
    },
    {
         "name":"age",
         "type":"int",
         "max":120
    },
    {
         "name":"ip",
         "type":"ipv4"
    }
]

Then the data should look like this:

{
     "name":"hezhangjian",
     "age":23,
     "ip":"127.0.0.1"
}
  • When adding or modifying the configuration, the callback function is implemented, and the callback business service confirms whether the configuration can be added or modified;
  • Configuration layering: The cloud service corresponds to Apollo's app_id, the internal environment corresponds to the cluster on Apollo, and then the microservice name + configuration name is spliced ​​into the configuration name.

The following figure shows the correspondence between business concepts and Apollo concepts. Some configurations are single-value configurations, and some are multi-value configurations, so the configuration item level is optional.

During this period of practice, we also found the following problems.

3.1 Concurrency issues

The most fatal one is the concurrency problem. First of all, all Apollo configurations are stored in a table. Secondly, since the initial design of Apollo is mainly to consider the operation and maintenance personnel to manually operate on the interface, the code has no concurrent semantics (or no concurrent semantics to the client) , making it difficult to solve the concurrency problem when we write the configuration through code.

3.2 Performance issues

Open the namespace list page, you need to display all the namespaces under this app_id, because our single app_id will store all the configurations of a single cloud service, this is a large amount, and the interface does not support paging, resulting in slow page loading.

3.3 Experience problems

Apollo's namespace interface does not provide a search function (maybe Apollo didn't want to support so much at the beginning of its design). To locate the namespace we want to view or modify from the namespace, we can only use the browser's search capabilities.

4 The third phase of Apollo and the self-developed configuration center coexist

In addition to the above problems, there are some reasons why we started the third stage of practice:

  • The original top-down configuration layering model, the configuration between microservices is not isolated, not only is it difficult to manage permissions, but also is not suitable for the release concept of DevOps single-microservice autonomy;
  • In the second stage, there were too many changes to Apollo, the organizational structure was changed, and there was not enough manpower maintenance;
  • With more and more clusters, the callback function requires two-way network connectivity, which makes network maintenance inconvenient;
  • We have made many changes to the Apollo interface and the interface based on business, which makes it difficult for other brother departments to share Apollo

At that time, there was a big controversy on whether to retain the three functions of Schema , callback checking , and code writing configuration . Personally, I hope to keep Schema and callback check , because they have significant advantages, and the interface is compatible and can be shared with other departments, but adding the concept of Schema and the process of callback check will increase the learning cost. As for the code writing configuration, due to the need to solve the concurrency problem, the amount of code changes is large, so I do not recommend keeping it.    

After intense discussions, we finally abandoned the three function points of Schema , callback check , and code writing configuration , and only put the operation and maintenance configuration in Apollo. 

Then, we put the business configuration on a self-developed strong schema configuration center. This configuration center is only responsible for the configuration of a single cluster, and each cluster deploys one set to meet our business needs. The core points of the self-developed strong schema configuration center are: single configuration and single table, checking whether the configuration is legal through the callback of the registration center, and realizing long link push with the help of the mqtt protocol, without a single point of bottleneck.

And our operation and maintenance configuration center Apollo has returned to the open source version, reorganized the configuration structure,

The advantages of operation and maintenance configuration are:

  • The configuration model is suitable for single microservice release;
  • Configurations are organized by microservices, and there won't be many namespaces on one page.

shortcoming:

  • After the Schema is missing, the configuration of the operator on the interface will not be verified, even if the configuration format or content is wrong, the configuration can be successful. The configuration password on the interface does not support plain text (Apollo cannot detect whether it is a sensitive field), and other tools must be used to convert the plain text to cipher text in advance, and then configure it;
  • After the callback check function is removed, some configurations, such as the network card network segment configuration, cannot be immediately responded to by the operator.

4.1 Best Practices

After our practice, the business configuration is indeed not suitable for using the open source Apollo. The operation and maintenance configuration uses the native Apollo, but it does not yet have the functions of callback checking and Schema . I hope that Apollo can support Schema in subsequent versions , or the weakened json format checking function. Below are our best practices in the following scenarios.     

4.1.1 SRE operation and maintenance configuration on the interface

The functions are realized through Apollo. As for how to organize the configuration, according to your organizational structure and technical architecture to correspond to the concepts on Apollo, you can organize the configuration according to the level of microservices->deployment environment or deployment environment->microservices.

4.1.2 Complex parameter verification

It is recommended to build a self-built portal wrapping layer on Apollo. The back-end service can perform a layer of processing. This layer of processing can do more complex formatting verification or even callback checking , and then call Apollo OpenApi to write the configuration to Apollo.

4.1.3 Technical Selection of Service Configuration

The biggest challenge is that business configuration is triggered by users, and the concurrency of requests is not easy to handle. There are two ideas. One is to solve the concurrency problem through database distributed locks based on Apollo's native code. The second is to learn from our ideas, through the single configuration single table, mqtt protocol to realize the notification and other core technology points, self-developed business configuration center.

4.1.4 Deployment of business configuration

It is necessary to consider whether to jointly set up a business configuration center according to the number of business configurations. In a single-cluster scenario, there is no doubt that only one business configuration center is needed. Even if it is implemented using Apollo, it can be considered to be co-located with the operation and maintenance configuration center. In a multi-cluster scenario, deploying one business configuration center or multiple business configuration centers. In our own practice, a cluster often supports tens of thousands of users. We adopt a strategy of deploying a set of business configuration centers for each business cluster.

For more learning content, please go to IoT IoT Community 

 

Click Follow to learn about HUAWEI CLOUD's new technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/5515266