Alibaba Cloud Function Computing Helps Upgrade the Architecture of Gaode RTA Advertising Delivery System

Zhao Qingjie (Alibaba Cloud Function Computing), Lin Xueqing (Alibaba Cloud Function Computing), Du Lingling (Gao De), Wang Bicheng (Gao De)

preface

In the Spring Festival of 2023, after three years of the epidemic, we finally ushered in the dawn of spring. The enthusiasm of Chinese people for travel is unprecedentedly high: go home to see your parents; the trip you have been thinking about can finally take place. According to AutoNavi's estimates, the peak traffic during the Spring Festival in 2023 will increase by a considerable percentage compared with the same period in 2022 and the November 2022. However, not long ago, due to the impact of the epidemic, the traffic of the system was still running at a relatively low level.

How to quickly complete the preparations for the Spring Festival travel in a short period of time, ensure the smooth operation of the system under the peak traffic during the Spring Festival, and make the access to navigation and other information services necessary for public travel smooth, has become an urgent task for technicians. matter. How to ensure the smooth operation of the system while reducing costs and increasing efficiency in the case of large traffic changes?

In the past few years, AutoNavi has been firmly and continuously promoting the serverless application. After in-depth research and selection, Alibaba Cloud Function Computing (Aliyun FC) was finally selected as the serverless computing platform for its application. In the past year, great progress has been made.

AutoNavi’s foresight on serverless helps them cope with uncertainty and strong recovery during the Spring Festival travel in a more agile and economical way: no need to worry about resource changes brought about by traffic changes, no need to prepare a large amount of computing resources in advance according to peak traffic, no need Worried about whether resources are sufficient, economic costs have dropped significantly, and the efficiency of R&D and operation and maintenance has improved significantly.

Based on the previous serverless results, AutoNavi’s related business quickly completed the preparations for the Spring Festival trip, and the Spring Festival guarantee was successfully completed.

Let’s take a look at a typical case: How FC helped AutoNavi’s RTA advertising system upgrade its architecture last year.

business background

What is RTA

RTA is a real-time advertising program interface, which realizes real-time advertising optimization by utilizing the data and model capabilities of both media and advertisers; RTA is an interface technology, and it is also a strategy-oriented delivery capability.

The advertising media inquires whether to place an advertisement through the RTA interface of AutoNavi, and the RTA service returns the placement results by querying AutoNavi's own crowd information. In this way, the media can place advertisements more accurately.

Architecture & Problems of the Original System

image.pngThe original system server occupies a lot, and the dependent link is long. Every time the capacity is expanded, the dependent service also needs to be expanded accordingly, resulting in a large resource occupation.

Technology Selection

Crowd hit function

The essence of the crowd hit function can be attributed to the problem of retrieving whether an element is in a collection.

This kind of problem is often solved by bloom filter in the industry. The essence of bloom filter is a combination of a set of hash algorithms and bitmap. The advantage is high query efficiency and small footprint. The extended version of Redis provides the bf (bloom filter) function. Since reading is done in golang and writing is written in Java, in order to avoid library inconsistencies caused by cross-languages ​​and possible hit inconsistencies caused by different implementations of bloom filters, you can use the Redis extended version of bf (bloom filter ) function, implement the bf function on the Redis server to ensure the data consistency of calls in different languages.

With the help of Redis to realize the crowd hit function, the algorithm gateway can be removed, and many resources on the data center side can also be saved.

data synchronization

Currently, there are 4 types of data updates on the CircleRen platform: online, real-time, offline once, and offline cycle.

The current circle strategy is based on the offline crowd. Although it is possible to use online and real-time situations in the future, because the crowd delineated by RTA advertisements is generally large, the proportion of real-time crowd changes is low, and the media has caches, so the real-time requirements are not high. Use real-time online crowd and offline There is not much difference in the effect of crowds, so currently it is recommended to only use offline crowds as the main means of circle. If you have high requirements for real-time performance, you can consider updating the offline cycle in the hour dimension (essentially, real-time performance depends on the UDF update frequency and trigger method). Comprehensively consider the way to update Redis offline periodically.

Serverless

Why Serverless

serverful vs serverless (3).svgBy redividing the interfaces of applications and platforms, Serverless allows businesses to focus on their own business logic, making it possible for everyone to quickly develop a stable, secure, elastic, and scalable distributed application.

How to achieve serverless

In the new technology selection, the engine service needs to access Redis. This is a question of how to serverless a system with high-frequency storage access.

It is generally believed that Serverless is FaaS+BaaS. FaaS: Function as a Service, function as a service, generally various backend microservices. BaaS: Backend as a Service refers to backend services that are not suitable for FaaS, such as storage services.

The serverless system architecture puts high demands on cloud storage. In terms of scalability, latency, and IOPS, cloud storage needs to be able to achieve the same or close automatic expansion and contraction capabilities as applications.

Alibaba Cloud provides the Redis Enterprise Edition service. The cluster architecture version provides multiple instance specifications, supports a maximum total bandwidth of 2G, and a QPS of 60 million. Supports adjusting the architecture and specifications of instances to meet different performance and capacity requirements. It can realize non-inductive expansion and contraction. It can meet the storage requirements of the serverless engine service.

FaaS is currently the most common technology selection for serverless backend microservices. Alibaba Cloud Function Compute is the world's leading function computing product identified by Forrester. It has accumulated rich experience in serverless applications in the public cloud and within the group, making it a suitable choice.

high performance requirements

As a system that provides related services for external media, the RTA advertising delivery system has the characteristics of large traffic and high delay requirements, and is a typical scenario with high performance requirements. In such a scenario, the timeout period set by the client is generally very short. Once the timeout expires, the interface call will fail. After adopting the serverless architecture, the requested traffic will first enter the FC system, and then be forwarded to the function instance for processing. In this scenario, in the case of multi-tenant and large traffic requirements of FC, the average value of the system time spent on request processing (excluding the execution time of the function itself) and the P99 value are kept at a very low level to ensure the request success rate SLA requirements.

Landing plan

system structure

RTA.svgIn the new architecture, after the middle station generates crowds, it calls Redis's BF.INSERT and other commands to generate bf. After the engine side gets the device ID, it judges whether it is in the corresponding crowd through the BF.EXISTS command of Redis.

Features:

  1. Remove the gateway and reduce the link length
  2. Set cache, decouple offline system and online system, improve performance
  3. Data compression to reduce memory usage
  4. The system is serverless, realizing real-time elasticity and free operation and maintenance, and speeding up application iteration

    request scheduling

    We mentioned earlier that the AutoNavi RTA advertising delivery system has the characteristics of large traffic and high latency requirements, which is a typical high-performance scenario. FC is a typical multi-tenant system. In a cluster, there are not only AutoNavi RTA advertising functions, but also many other business functions. Very high requirements are placed on the request scheduling of FC:
  • Single-function QPS has no upper limit, and a large number of long-tail functions do not consume resources
  • Scheduling services must ensure high availability, and single point failures have no impact on services
  • The system time required for request processing should be controlled at an average value of less than 2ms, and a P99 value of less than 10ms

Let's take a look at how FC did it. Scheduling.pngIn order to achieve real-time elasticity, when a function request arrives at the front-end machine of Function Compute, the front-end will ask the scheduling node (Partitionworker) for an instance to process the request and forward the request to it. After the scheduling node receives the request, if there is an instance available, it will obtain an instance according to the load balancing strategy and return it to the front-end machine; if not, it will create one in real time and return it to the front-end machine. The instance creation time can reach hundreds of milliseconds.

  • In order to ensure high availability and horizontal scalability, the scheduling node adopts partition architecture
  • Requests from the same user/function are mapped in contiguous shards
  • A single function request can span multiple shards and expand horizontally
  • The scheduling node (Partitionworker) reports the fragmentation and node status to the fragmentation manager (Partitionmaster) through heartbeat
  • Partition master performs load balancing by moving/split/merging shards
  • Scheduling 1 million functions, the maximum peak value of a single function is 200,000 TPS, and the scheduling delay is less than 1ms
  • Any node failure, the request will be routed to other Partitionworker, no impact on availability

We see that a request needs to be processed by the front-end machine and the scheduling node before being forwarded to a specific function instance. Therefore, the system time consumption of request processing includes the processing time of the front-end computer, the processing time of the scheduling node, the communication time between the front-end computer and the scheduling node, and the communication time between the front-end computer and the function instance. The dispatching system has made a lot of targeted optimizations to ensure that the system time-consuming required for request processing is controlled at an average value of less than 2ms and a P99 value of less than 10ms in the case of extremely large traffic.

resource delivery

In the serverless scenario, businesses no longer need to care about resource management, and the platform is responsible for resource management and scheduling. When the business traffic increases, the platform needs to be able to quickly and rigidly deliver the computing resources required by the business; and when the traffic drops, the platform needs to automatically release idle resources.

In order to ensure the rigid delivery of resources for functions including the Gaode RTA advertising function, FC continues to optimize the implementation of resource management.

New serverless base: Shenlong bare metal + safe container

In the beginning, FC delivered Function Compute instances in the form of Docker containers. Because Docker has security issues such as container escape and storage, in order to ensure security, a host will only deploy functions of one tenant. Due to the large number of long-tail functions in FC, the specifications of function instances are often relatively small, such as only 128M/0.1 cores, which limits the improvement of resource utilization.

In order to solve this problem, FC and relevant teams cooperated to fully upgrade the resource base to X-Dragon Bare Metal + Secure Containers, and achieved many benefits after leveraging the virtualization efficiency improvement brought about by X-Dragon Bare Metal software-hardware integration technology and the security guarantee of secure containers. Tenants are highly densely mixed, which greatly improves resource utilization.

Independent resource management and control

Since the Pod output efficiency of the K8s cluster is difficult to meet the creation requirements of tens of thousands of instances per minute of Serverless, FC cooperates with relevant teams to further subdivide the computing resources in the Pod. Control, so as to achieve high-density deployment and high-frequency creation capabilities.

Millisecond-level resource delivery speed

Compared with the resource delivery speed requirements of K8s above the minute level, the serverless scenario needs to increase the resource delivery speed to the second or millisecond level. In order to solve the contradiction between the time-consuming startup of K8s infrastructure and the strong appeal of function computing for extreme flexibility, FC has implemented Pod resource pooling, image acceleration, image preheating, computing instance recycle and other technologies to ensure extremely fast resource delivery speed .

high availability

In order to achieve high availability, FC resources are not only distributed in one K8s cluster in each region, but multiple K8s clusters, so that if any K8s cluster has a problem, it will automatically switch to the normal cluster. Each cluster has multiple types of resource pools: exclusive resource pools, mixed resource pools, and preemptible resource pools. FC performs unified scheduling according to the characteristics of the business, thereby further reducing the cost.resource pool.svg

Delivery SLA

In terms of the total amount of resource delivery, FC currently has a case of delivering tens of thousands of instances of a single function. Since FC has the ability to dynamically replenish resource pools, theoretically, the number of instances that a single function of FC can deliver is far more than tens of thousands. instances. In terms of resource delivery speed, FC can achieve an instance creation speed of 100 milliseconds. In the case of burst, FC controls the resource delivery speed from the following two dimensions:

  1. Burst instance number: the number of instances that can be created immediately (default 300);
  2. Instance growth rate: the number of instances that can be increased per minute after exceeding the number of sudden increase instances (default 300 per minute).

The above parameters are adjustable.

The following figure shows the flow control behavior of FC in a scenario where the number of calls increases rapidly:image.png

Multi-room deployment

The system adopts three-unit deployment to ensure that external media can be accessed nearby and reduce network delay.

unnamed file.svg

business effect

After the system architecture was upgraded, thousands of core machine resources were saved, serverless was realized, the call link was shortened, the system became more flexible, robust and easy to maintain, and achieved good business results.

Outlook

In the past 2022, AutoNavi has made great progress in the field of serverless. However, this is not the end, but just the beginning. Follow-up FC will work with AutoNavi to promote the comprehensive serverless application, hoping to help AutoNavi in ​​more Use Serverless in different scenarios, and fully understand the technical dividends brought by Serverless!

For more content, pay attention to the Serverless WeChat public account (ID: serverlessdevs), which brings together the most complete content of Serverless technology, regularly holds Serverless activities, live broadcasts, and user best practices.

Guess you like

Origin blog.csdn.net/weixin_42477427/article/details/129355383