Best practices for Apache Pulsar on Tencent Cloud

Introduction

Pulsar Meetup Beijing 2023 hosted by StreamNative ended successfully on October 14, 2023. This event gathered big names, including industry experts from Tencent, Didi, Huawei, Zhaopin, RisingWave and StreamNative, to have an in-depth discussion on the production of Pulsar. Best application practices in the environment and share the latest developments and dynamics of the Pulsar community.

In this Meetup, Lin Yuqiang, senior engineer of Tencent Cloud, brought you a wonderful speech on the topic of "Best Practices of Apache Pulsar on Tencent Cloud". The following chapters will cover system architecture, design ideas, addressing services, cross-cluster We will introduce in detail the best practices of Apache Pulsar on Tencent Cloud in terms of migration and cross-regional disaster recovery.

Pulsar system architecture

The picture above is a very common Pulsar deployment architecture. ZK uses Tencent Cloud TSE ZK; it can be connected to the internal configuration center and CICD platform to achieve standardized deployment. The rest of Bookie and Broker are not much different from the open source self-built deployment model.

Here we need to introduce the diagnostic three-piece set of Tencent Cloud's self-developed part: Metrics, Trace, and log collection.

Metrics: Broker self-researched and collected → reported to Monitor cluster Topic → Pulsar Sink aggregation preprocessing → Tencent Cloud monitoring. It is worth mentioning here that we added a layer of Pulsar Broker cluster (Monitor cluster) as a buffer between Broker and cloud monitoring, and used Pulsar's own Sink to preprocess Metrics data, because the monitoring link is at the regional level. , for example, only one set is deployed in Guangzhou, but the number of Broker clusters that provide external business services is large, and the total number of Topics is large, so it is necessary to add an intermediate layer to relieve the pressure on cloud monitoring. This is a quantitative change leading to a qualitative change. result.

Trace: Broker self-researched and collected → Trace log → Filebeat reporting → APM, which is also the result of evolution. The original architecture is Skywalking collection → memory queue reporting → APM. This architecture seems simple, but when the Broker's traffic scale reaches a certain level, the Trace generation speed is higher than the reporting speed, causing the memory to continue to grow and then OOM, so Trace In the end, we returned to the original, gave up Skywalking, and used disks to withstand short-term flood peaks. Of course, there are also some businesses that continue to have high traffic, causing the reporting to never catch up with the generation of Trace logs. In this case, there are only two methods: for businesses that do not insist on message traces, turn off Trace reporting; for businesses with large traffic and still have no requirement for traces, If required, you can only increase the number of available cores in Filebeat by upgrading the configuration.

CLS log collection: Metrics and Trace are strongly coupled with the Broker's own logic, so we can only carry out customized development for the Broker. However, log collection is a universal function, so we directly use Tencent Cloud CLS to achieve our goals: log collection and summary, classification according to environment, region, cluster, keyword monitoring, long-term log archiving, keyword alarm, etc.

Lookup Service: Lookup Service is a regional-level module. Only one set is deployed in each region. It is used for the routing and addressing functions of all Broker clusters in the same region. This module will be introduced in detail later.

Design background and ideas

After introducing the Pulsar system architecture, let's introduce the problems we face in the Tencent Cloud scenario and how to solve them.

● Complete containerization: including ZK, BK, Broker and other peripheral facilities, all deployed on the container platform.

●Multiple environments and multiple regions: This is a relatively conventional business for cloud service providers. We not only have testing, pre-release, and online environments, but the online environment also has multiple regions, such as Beijing, Shanghai, Guangzhou, Singapore, Hong Kong, etc. Each region has multiple clusters.

● Multiple availability zones (multiple computer rooms in the same city): The cloud itself comes with multiple computer rooms in the same city, so it is very convenient for disaster recovery for multiple users in the same city.

● A large number of clusters and a large scale of topics: A large number of clusters and a large number of topics require us to design a standardized deployment process. The monitoring link mentioned earlier is also a product of this background.

● Various product forms: Product forms correspond to differences in deployment architecture and the deployment relationship between tenants, brokers, and bookies.

● Virtual network, diverse access methods: This is a multi-network plane problem that cloud service providers must face.

● Support cluster hot migration: Migrate tenants from cluster 1 to cluster 2 while the client URL remains unchanged. This is also a product capability required by cloud service providers, such as upgrading a cluster from the standard version to the professional version. Then it corresponds to the migration of tenants and underlying physical resource allocation.

Containerization

Although Pulsar Broker can be called a cloud-native message queue, in fact, Broker is stateful at runtime, such as the ownership relationship between Topic and Broker.

Therefore, when we were doing containerization, the overall idea was to make Pod and VM as equal as possible, so we made the following design:

● Fixed IP: The IP will not change randomly as the Pod is destroyed and rebuilt.

● Pod and Node network leveling: For Tencent Cloud scenarios, the Client must not be running on the container cluster deployed by the Broker. If not, the mapping between the container network (Overlay) and the basic network (Underlay) must be considered during Lookup. Relationships will make Lookup very complicated.

● Cloud disk: The data disk attached to the Pod is a cloud disk, which truly separates computing and storage.

● Coexist with CVM: Each Pod has an exclusive CVM, and the CPU and memory parameters of the two are equal. This can ensure the physical isolation of Pulsar runtime to the greatest extent. Of course, this is also a capability naturally provided by Tencent Cloud EKS (TKE Serverless) . In addition, during the container migration process, Pod nodes and CVM nodes must coexist in the same cluster, so the compatibility between the two also needs to be considered.

● Graceful shutdown: When a Pod is destroyed, you need to ensure that Pulsar's Shutdown logic is triggered, otherwise it will become highly perceptible to the Client. This is also something that needs to be paid attention to due to the difference in CICD processes between container scenarios and CVM scenarios.

● Liveness probe tuning: Liveness probe also has an interesting point. Since k8s’s own various health check mechanisms actually rely on a timeout, if the Broker startup phase cannot return the correct results within the agreed timeout, k8s is prone to errors. It is judged as a failure, causing the Broker to restart indefinitely. For example: Suppose a Broker cluster has tens of thousands of Topics, then its Broker startup time may take more than 90 seconds. If the Liveness setting timeout is lower than this value, you need to pay special attention. Of course, this is also due to the large scale of our clusters. Our different clusters may be used in different business scenarios, so many similar parameters need to be tuned accordingly.

● Helm orchestration: The deployment steps of Pulsar are actually quite complicated, including Bookie deployment, Bookie initialization, Broker deployment, Broker initialization, as well as the required Secrets, ConfigMap, network modules, etc. To organize so many modules in an orderly manner, Helm is of course the best choice.

Topic is large in scale

A large number of Topics will cause the following problems:

● ZK has too much metadata and high load

● Startup becomes slow and K8s readiness is misjudged

● Broker restarts with a large explosion radius

● Lookup performance deteriorates

● Monitoring and collection aggregation triggers qualitative changes

Various product forms

Tencent Cloud Pulsar provides a variety of product forms, corresponding to the diversity of the underlying deployment architecture. Of course, this will correspond to the final selling price of each instance and the cost of Tencent Cloud itself.

● Shared version: Both Bookie and Broker are shared. Essentially, it is the same model as a company building a cluster for all business teams in the company to share. An instance under the shared version corresponds to a tenant of Pulsar.

● Professional Edition: Both Bookie and Broker are exclusive, and the cluster has only one tenant, which exclusively owns all physical resources.

Access method

Tencent Cloud Pulsar provides a variety of network access methods. Network access is the network connectivity relationship between Broker and Client.

Intranet access

Intranet access is essentially similar to conventional self-built ones within the company. Both the Broker and the Client are in the same intranet, and they are fully interoperable. The IPs connected by the Client are also the nodes of the Broker. Original IP, without any network translation.

VPC access

VPC is a virtual private network. Each user can create multiple VPCs, and each VPC can create multiple subnets. If you want to know more, please view:Private Network Product Overview-Product Introduction-Document Center-Tencent Cloud

Two VPCs are usually unable to communicate with each other, such as 10.0.0.0/8 and 192.168.0.0/16 in the figure, unless interworking products between VPCs such as cloud networking and peering connections are used, but this is beyond the scope of our discussion. .

As shown in the figure, the Broker is deployed at 10.0.0.0/8 and the Client is at 192.168.0.0/16. At this time, it becomes impossible for the Client to access the Broker.

In the cloud network scenario, VPC provides a cloud virtual gateway (only internal components) to support interoperability between two VPCs, which we call cross-network plane interoperability.

Of course, its essence is to do network mapping, such as:

● Broker1: The IP at 10.0.0.0/8 is 10.2.0.1. We create Broker1 through the cloud virtual gateway and the IP at 192.168.0.0/16 is 192.168.1.1. Then when the Client is at 10.0.0.0/8, it needs to use 10.2 .0.1 accesses Broker1. When the Client is in 192.168.0.0/16, it must use 192.168.1.1 to access Broker1.

● Since the connection protocol of Pulsar Client uses Lookup first and then direct connection (for this, please refer to the Lookup addressing sequence diagram below), the requirements for the Lookup interface of the Broker are increased, and the Broker needs to automatically determine:

1. When Client comes from 10.0.0.0/8 network, return 10.2.0.1

2. When Client comes from 192.168.0.0/16 network, return 192.168.1.1

This is the problem that the Pulsar service faces in cloud service scenarios.

Public network access

VPC access was introduced earlier. In fact, public network access and VPC access are similar.

The only difference is:

● VPC access: Broker and Client are on two different intranet network planes.

● Public network access: The Broker is deployed on the intranet, and the Client comes from the public network. It is easy to understand that the public network can be directly regarded as a special VPC.

Addressing service

Next, let’s introduce the source of ideas for Tencent Cloud’s addressing service, which is also the value of addressing service.

RocketMQ Architecture Reference

Let's take a look at RocketMQ's service. Its architecture is divided into NameServer and Broker.

● NameServer saves the metadata of the cluster: the ownership relationship between Topic and Broker, so that the affiliation relationship between Topic and Broker can be dynamically configured, and Topic can also be scheduled between different Broker clusters.

● RocketMQ, like Pulsar, also has an addressing process similar to Lookup, except that the request object addressed by RocketMQ Client is NameServer, while the request object addressed by Pulsar Client is Broker. From this aspect, the architecture can be understood as Pulsar Broker It is equivalent to the combination of RocketMQ NameServer and RocketMQ Broker.

Advantages and Disadvantages:

Pulsar's deployment architecture is simple and does not require an additional NameServer service, but it also loses the ability to schedule topics between physical clusters.

Although RocketMQ has one more module, the cluster scheduling capability it provides is a very important operation and maintenance capability.

This kind of scheduling capability is exactly what cloud services need. With this kind of scheduling capability, our cloud service can have operability and flexible resource scheduling for the cluster.

Multi-network Lookup

As mentioned before, in the cloud service scenario, for Pulsar Broker, we need to provide three network access scenarios: intranet, VPC, and public network, which are inconsistent with the lookup capabilities provided by Pulsar Broker itself. Using the addressing module has the following benefits:

● Converging cloud service scenarios such as multi-network access into the addressing service core, while Broker still only provides the purest intranet services, better maintaining the original capabilities of Broker and its connection with open source.

● Dynamically increase or decrease access points (that is, the conceptual names of different network access methods). For a Pulsar cluster, when adding or removing an intranet, VPC, or public network access point, there is no need to make any changes to the Broker. .

● To achieve automatic and imperceptible expansion and contraction, the expansion and contraction of Broker also requires expansion and contraction changes in the network mapping in all access points of the cluster. Due to the existence of the addressing module, this part is all Converged in this module, the expansion and contraction of Broker does not require any changes to the existing nodes.

Lookup timing

Many addressing modules and addressing processes have been introduced before. Here we will focus on the addressing timing of Pulsar-Client to supplement the previous introduction and the mechanism of Pulsar Client under popular science:

● Step one: Get the number of Topic partitions: Client→CLB→Broker. One-to-many means that any Broker behind the CLB that the request falls on can return the correct result equally.

● Step 2: Single partition Topic Lookup: Perform a lookup on a single partition and ask for the current Owner Broker address of the Topic. It is also one-to-many and provides peer-to-peer services between different Brokers.

● Step 3: Based on the Owner Broker address of the Topic partition returned in the second step, make a direct connection, and then send and receive messages. One-to-one, specifically means that the direct connection at this time must be accurately connected to a certain Broker in the Broker cluster, such as Broker-2. If it is connected to other Brokers, the connection will fail.

These are the steps for Pulsar-Client to initialize a Producer/Consumer.

Where is the entry point of our addressing module?

It is the first and second steps, as shown in the figure. Add a proxy layer between these two steps, so that you can do some scheduling for multi-network access, Topic and physical cluster affiliation in the addressing return results. Tampering to achieve our purposes.

inter-cluster scheduling

The picture above shows the changes in Pulsar architecture after we added the addressing module. The entire architecture becomes somewhat similar to RocketMQ. There is a central metadata service to manage the relationship between Topic resources and physical computing resources.

Cross-cluster migration

We have paved the way for this, introducing the addressing module and architectural optimization. Next, we will introduce the productization capabilities we have built on top of this - cross-cluster migration.

Cross-cluster migration specifically refers to the migration of a tenant from physical cluster 1 to physical cluster 2. The two clusters provide services at the same time, with online hot switching and slight client awareness.

As shown in the figure, the entire process is also very simple. The addressing service saves the routing relationship between the tenant and the physical cluster. Once the routing relationship changes, the cluster migration changes.

Based on the Pulsar Lookup addressing protocol, the routing relationship switching granularity can be tenant granularity, Namespace granularity, Topic granularity, and partition granularity.

Switching method: route switching + Topic unload

Mechanically speaking, this switching process is very simple. The difficulty lies in the preparatory work:

● Metadata migration: Tenants, Namespaces, Policies, Topics, Subscriptions, Tokens, etc.

● Bidirectional synchronization of messages: During the switching process, both clusters provide services at the same time. In order to provide rollback capabilities, the messages sent need to be synchronized in both directions between the two clusters to ensure that both clusters have complete messages.

● Progress synchronization: The subscription consumption progress of each Topic is different and needs to be synchronized regularly to minimize repeated consumption during the switching process.

Message synchronization

Message synchronization is relatively simple. We use the GEO-Replicator function that comes with Pulsar and will not go into details here.

Progress synchronization

Here is a brief introduction to the mechanism of progress synchronization.

1. Message migration: During the message synchronization phase (GEO-Replicator), the message IP of the source cluster of the message will be carried in the message header.

2. Scheduled synchronization: Schedule the consumption progress of the source cluster to the target cluster, based on Compact topic synchronization and persistence.

3. Progress cache: The target cluster reads the consumption progress information in the Compact topic and loads the consumption progress into memory.

4. Delivery filtering: During the Dispatcher processing of Pulsar's delivery process, the items that are to be delivered but have already been delivered in the source cluster progress are filtered out.

Cross-regional disaster recovery

Cross-regional disaster recovery is also a productized capability developed after our improved architecture based on addressing services.

The essence is similar to cross-cluster migration, but due to cross-regional network latency issues, the focus is different:

As shown in the figure, assuming that our Pulsar instance is in the Guangzhou area and the Client is also in the Guangzhou area, but due to force majeure, all Pulsar instances in the Guangzhou area are down and cannot be quickly restored in a short period of time. We can temporarily cut off the Client's traffic. Provide disaster recovery instances in Shanghai to ensure the rapid recovery of Client's online business.

● Only one cluster provides services at the same time.

● Short-term, temporary switching: Due to the uncontrollable cross-regional delay (especially in foreign regions), the disaster recovery mechanism must be temporary. After the Guangzhou area recovers, the traffic should be switched back as soon as possible.

● Metadata scheduled synchronization: Because we cannot predict when the Guangzhou cluster will go down, and the frequency of use of this scenario is low, this is a trade-off.

● Messages and progress are not synchronized: What we are facing is that the entire Guangzhou cluster is down. At this time, the data in the Guangzhou cluster disk cannot be read temporarily. Another reason is that the cross-regional delay is large and the cost-effectiveness of real-time message synchronization is too low.

● Switching: Switching based on domain name resolution, because the addressing services in Guangzhou and Shanghai are independently isolated from each other. If there is accumulation in the Guangzhou cluster at this time, that part can only continue to accumulate. Only when the Guangzhou area recovers will there be an opportunity to consume again.

● Switchback: Since the Shanghai cluster is only used for emergency disaster recovery, after the Guangzhou cluster is restored, it must be switched back to Guangzhou. The most important thing during the switchback process is that the messages accumulated on the Shanghai cluster need to be written back to Guangzhou. This In the process, there will inevitably be some repeated consumption and disordered consumption order (within the same partition).

This solution is for disaster recovery in regional-level disaster scenarios. Because each region itself is deployed in multiple availability zones (multiple computer rooms in the same city), the probability of the entire region being unavailable is low. Therefore, the entire product design for cross-regional disaster recovery is The focus is mainly for emergency use, rather than for hot switching like cross-cluster migration, which requires very strict two-way synchronization.

Summarize

We first start with the overall architecture of Tencent Cloud Pulsar, introduce the problems that need to be faced in Tencent Cloud scenarios, introduce the addressing module (Lookup Service), and introduce the introduction of the addressing module to the deployment architecture of Pulsar. optimization on. Then it introduces the two core problems solved by Lookup Service: multi-network Lookup and inter-cluster scheduling, and ensures no awareness of client access and low intrusion into the architecture. Due to the existence of the addressing module, we can further productize it based on its capabilities, cross-cluster migration and cross-regional disaster recovery, so that we can provide corresponding repair measures and operation and maintenance capabilities in different levels of disaster scenarios.

In the future, we will continue to work hard on disaster recovery capabilities, Pulsar peripheral ecological docking, storage optimization and other aspects to provide Pulsar products with lower cost and higher stability.

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4587289/blog/10140390