Advertising case|1 billion data, query <10s, on the correct posture of building an advertising system based on OLAP

As the traffic dividend is gradually fading, more and more advertising companies and practitioners have begun to explore new paths of refined marketing, replacing the previous full-traffic and extensive advertising bombing. Refined marketing means selecting the most potential target audience among hundreds of millions of people, which undoubtedly poses a great technical challenge to the data warehouse capabilities that provide basic engine support.

background

Crowd selection analysis is the core function of the customer portrait platform (CDP). Analysts use various label combinations to select the most suitable group of people, and then push advertisements to achieve precise delivery effects. At the same time, due to the different size of the result set of crowd query under different tag combinations, analysts need to go through multiple logic adjustments to obtain the "best" crowd package in one advertisement delivery. Under such high-frequency operations, the portrait platform usually encounters two problems:

  • First, because such query analysis is temporary and the number of various label combinations is huge, offline precomputation cannot satisfy this kind of flexibility.

  • Second, since this type of query is a real-time scenario, query performance becomes very critical. Usually, a query is at the level of minutes, which takes a long time and cannot meet the needs of analysts.

In this article, we will share the solution of crowd circle selection query in real-time analysis OLAP scenarios, and introduce how to use ByteHouse to speed up such queries. From the perspective of data performance, under the test data of 1 billion users, ByteHouse's crowd query P99 is less than 10s, showing excellent performance.

scene model

A data structure that supports group selection is roughly as follows:

a50d561e686d9cf25ddcd212b379ae21.png

User registration information enters the data lake through the user flow, and user behavior information enters the data lake through the event flow. Then through label production tasks, we label each user.

Due to the real-time and flexibility of instant query, the transformed data is usually written into OLAP engines, such as ByteHouse, to provide flexible and real-time SQL query. When users analyze, they usually visualize and build label logic from the application interface of the portrait platform, and then the platform application converts these logics into SQL and sends them to ByteHouse for processing.

From the perspective of the data model, most of the formats stored in the data warehouse or data lake are based on id-tag, for example:

user_id sex age tags
10001 F 20 []
10002 M 22 [day_1,day_2]
10003 F 23 [tag_1]
10004 M 24 [tag_2]
10005 F 25 [day_1,day_2]

In crowd analysis, the following tag-based patterns would be more appropriate, for example:

tags active_users
tag_1 [10002,10003,10005]
tag_2 [10002,10005]

Data is usually stored based on the user as the main body. This situation results in a very large number of users and many unnecessary fields. Then when the user filters the crowd by combining tags (tag), almost all the rows need to be scanned, making the performance overhead increase with the increase of tags and users.

When the data takes labels as the main body, there are two relatively large changes:

  • First, only the dimensions related to the crowd will be retained, and other information such as sex, age, etc. will be removed.

  • Second, active_users stores all user ids in the form of an array. An important benefit of this operation is to reduce the number of rows and data size.

Under this model, selecting users according to the tag combination will become an intersection and complement operation of the set, and the performance will be significantly improved compared with the first model.

ByteHouse Bitmap type

The second storage model can use the following ByteHouse SQL to create tables:

CREATE TABLE id_tags (
    tags            String,
    active_users    Array<UInt64>
) Engine = CnchMergeTree() order by tags

Crowd selection query, such as finding the number of people who satisfy both tag_1 and tag_2, can be done with the following SQL:

WITH (SELECT active_users as tag_1
        FROM id_tags
        WHERE tags = 'tag_1') as tag_1_user,
WITH(SELECT active_users as tag_2
        FROM id_tags
        WHERE tags = 'tag_2') as tag_2_user,
SELECT length(arrayIntersect(tag_1_user, tag_2_user))

Although this model can simplify some operations, the selection of each tag requires a subquery (with part). This method has a lot of waste for table scans, and is linearly related to the number of tags.

To solve this problem, ByteHouse has a built-in BitMap type, which can directly use bits to indicate whether a tag can exist.

Following the above example, after using BitMap, the table creation statement is changed to:

CREATE TABLE id_tags (
    tags            String,
    active_users    BitMap64
) Engine = CnchMergeTree() order by tags

Note here that we only changed the type of active_users from Array to BitMap64, and the rest remained unchanged.

For the same query of "find the number of people who satisfy both tag_1 and tag_2", use the following query:

SELECT bitmapCount('tag_1&tag_2')
FROM tag_uids_map

We use bits instead of the original array so that the query can be optimized to be done in a single table scan.

Based on ByteDance's internal online scenarios, we observed that the above query optimization can improve performance by 10 to 50 times in multi-label scenarios.

data import

There is no significant difference between writing data into a bitmap table and a normal table. For example, the method of small batch insert can be used as follows:

INSERT INTO TABLE id_tags values ('tag_1', [2,4,6]),('tag_2', [1,3,5])

Because active_users in id_tags is defined as the type of BitMap64, the array values ​​[1,3,5], [2,4,6] will be automatically converted to BitMap64. Subsequent calculations and storage will be of the BitMap64 type.

When importing large batches of files, we can use the import service provided by ByteHouse. At present, both offline (TOS, LASFS) and real-time (Kafka) import modes support BitMap data import. Stream writing (such as Flink direct writing) can be written in the form of insert through the JDBC interface.

related functions

In addition to supporting BitMap type data for intersection and complement operations, ByteHouse also has a large number of built-in column functions, such as bitmapColumnAndreceiving a bitmap column and performing andoperations on all bitmaps in the column; and bitmapColumnCardinalityreturning the number of elements in all bitmaps in a column . For details, please refer to the official documentation.

Introduction to the Principle of BitEngine

BitMap structure analysis

Assuming that a user ID is represented by a 32-bit unsigned integer, then using conventional bit storage requires 2^32 bits ~ 512MB of space. If each label needs to correspond to 512MB space, when the number of labels increases, the storage capacity will become huge. In fact, very few businesses will encounter 2^32 about 4 billion users, so the distribution of user IDs in actual scenarios is very sparse.

Based on this feature, we can use Roaring bitmap to further compress this space. As shown below:

57ba51ac41ce7effe4ed27178a8a70d2.png

In the 32-bit Roaring bitmap, the first 16 bits are used for bucketing. If there is no data in this value range, the bucket will not be created, and the last 16 bits are stored in the corresponding container. There are two types of Containers:

  • Array container: When the amount of data is small (generally less than 8K capacity), it saves more space

  • Bitmap container is suitable for storing dense data and takes up little space

When calculating, you only need to calculate the values ​​in some buckets. When expanding to 64-bit roaringbitmap, we can use a map<uint32_t, Roaring> to support it. The first 32 bits are used as the key of the map, and the last 32 bits are stored in roaringbitmap.

dictionary optimization

In most scenarios, the above roaring bitmap already has good performance. But in the actual scene of bytes, we found that because the user_id is not continuously generated, the proportion of the number of array containers will be very high. The intersection and complement operation of two sparse populations becomes the calculation of two ordered arrays. Compared with simple bit calculation, this calculation still has obvious differences in performance.

Therefore, in ByteHouse, we encode the data through a dictionary to make the data more centralized.

The way to enable dictionary optimization is as follows:

CREATE TABLE id_tags (
    tags            String,
    active_users    BitMap64 BitEngineEncode
) Engine = CnchMergeTree() order by tags

In essence, the dictionary service is an ontology mapping. You can look up the value through the key, and you can also look up the key through the value, where the key is the original value and the value is the encoded value. After enabling encoding, ByteHouse will rely on a dictionary file. By default, ByteHouse will maintain a dictionary file internally.

When the bottom table is updated, the internal dictionary file is also updated asynchronously. ByteHouse also supports users to maintain external dictionaries, which will not be expanded here.

Summarize

Crowd analysis is the basic function of the portrait platform. This article introduces how to use ByteHouse's built-in BitMap type to support real-time portrait query and analysis. At present, ByteHouse Cloud Data Warehouse and Enterprise Edition have landed on Volcano Engine. In the future, Volcano Engine will continue to provide customers with ByteDance and external best practices through ByteHouse, and build an interactive big data analysis platform to cope with complex and changing business needs and high-speed growth data scenarios.

Guess you like

Origin blog.csdn.net/ByteDanceTech/article/details/132242022