The application of Starrocks’ Bitmap among user groups

background

      The intelligent operation platform provides brands with a series of operational tools such as user portraits, user groupings, data reports, email services, App Push, App pop-ups, App Banner, and online questionnaires to help brands improve users' use of IoT devices. experience, improve App user stickiness, and increase product sales.
      In the two years since its release, the platform has served many KA customers at home and abroad. As the customer base grows, its functions, services and customer demands have become increasingly diversified and complex. The platform has also gradually transformed from a marketing tool into a platform and a product. At the same time, the challenges faced by the team have gradually expanded. The basic services originally based on Spark+Clickhouse have become increasingly overwhelmed. While costs have increased, customer experience and platform functions have continued to weaken, and the actual losses are far higher than the returns.
      Combining the many pain points of current business and technology, during the research on competing products in the industry and current forward-looking technologies, the bitmap solution based on Starrocks was gradually explored in depth, and was finally officially put into use at the end of this year.

Introduction to Starrocks & Bitmap

      Starrocks is a distributed columnar storage and analytics system designed to handle large-scale data sets.
      Bitmap is a compressed data structure used in Starrocks database for efficient storage and processing of large-scale data.
      In Starrocks, Bitmap is often used to solve the problem of filtering and aggregation of multi-dimensional data. It provides high-performance filtering and aggregation operations using bit operations on bitmaps. Bitmap can be understood as a binary vector composed of bits, where each bit corresponds to a certain attribute of a data record. When using Bitmap for data filtering and aggregation operations, data can be filtered and combined efficiently through bit operations (such as union, intersection, XOR). Bitmap compression technology can also reduce storage space usage and improve query performance.
       In Starrocks, Bitmap is often used in scenarios such as fast filtering of large-scale data, data pre-aggregation, and multi-dimensional analysis, which can greatly improve the efficiency and performance of data processing.
      To sum up, Bitmap in Starrocks is a compressed data structure used for efficient storage and processing of large-scale data. It achieves high-performance data filtering and aggregation operations through bit operations and is widely used in multi-dimensional data in the Starrocks database. processing and analysis.

Advantages of Bitmap application among users

      The reason why Bitmap (bitmap) is generally recommended by users is because Bitmap can effectively solve certain data processing problems and has the following advantages:
      1. Save storage space: Bitmap uses bits to represent the presence or absence of data sets, such as in In large-scale data sets, marking whether a user meets a certain condition does not need to store the actual data content, thus saving storage space.
      2. Fast filtering: Bitmap can perform data filtering operations in constant time (O(1)), that is, it can quickly determine whether a user meets a certain condition, such as determining whether a user belongs to a specific age group or a certain age group. geographical area.
      3. Quickly calculate intersection and union: Bitmap can efficiently perform intersection and union operations, that is, perform logical operations on two or more Bitmaps to obtain a set of users that meet the overall conditions. This is very useful for scenarios such as inverted indexes and merging of data collections.
      4. Simple and efficient: Bitmap algorithm is simple to implement and has efficient computing performance and query speed.
      However, it should be noted that Bitmap is suitable for processing sparse data. If the data is dense or has a large range, Bitmap's storage and calculation overhead may be excessive. Therefore, when applying Bitmap, it is necessary to evaluate and select the appropriate algorithm and data structure according to the specific situation.
      In summary, the user community recommends using Bitmap because it can provide advantages in storage and computing efficiency, and is suitable for processing sparse data.

Design

      In terms of design architecture and product selection, it has also gone through a large version iteration, and is currently migrating from Clickhouse's solution to Starrocks.
Insert image description here
      Behavior log: Data comes from Web, App, and server-side hidden logs, which are reported to Kafka and collected to offline data warehouse through Pipeline+Transform.
User portrait: mainly from basic information collection and basic data statistics of business data from each end.
Solution comparison:
      1. Early stage When using Clickhouse, we also investigated the Bitmap solution. The bitmap generation from Spark to Clickhouse consumes a lot of CPU and memory. Under extreme cost control, it was impossible to convert billions of user data, so at that time only Basic functions can be met using large wide tables through Clickhouse's efficient query performance. After effectively trimming the business and comparing costs (resource costs, development costs, learning costs, and maintenance costs), we decided to use Starrocks to replace Clickhouse, and completed the Bitmap conversion on the Starrocks side.
      2. Query generation based on large wide tables also consumes a lot of server resources. In particular, Clickhouse basically uses a full CPU load to improve efficiency. In the case of high concurrency and high QPS, the server is extremely unstable, and it can be queried immediately if it cannot meet the requirements. Ready-to-use functional scenarios must not convert crowd package rules into HiveSQL, and produce offline crowd packages through batch calculation loop processing in the offline data warehouse every day and synchronize them to ES for use by the platform. As a result, the business is extremely dependent on offline data warehouses. When the number of people increases and reaches thousands, thousands of SQL tasks are required every day. This not only takes up a large amount of resources in the offline cluster, but also makes the timeliness less guaranteed, and also increases the number of SQL tasks. The cost of redundant storage and ES. On the Starrocks side, by converting bitmap queries, resource consumption is low, and QPS and performance can meet business requirements.
      3. Comprehensive comparison of Clickhouse & Starrocks.
      Overall, Clickhouse’s query concurrency capability is weak. In the scenario of multi-table join, it is basically the fifth scum, and the community activity is relatively low. Daily operation and maintenance is difficult and costly, and Its underlying principle cannot guarantee the consistency of data. Compared with the shortcomings of Clickhouse, Starrocks has the following advantages:
● Supports multiple concurrent queries;
● Supports various distributed Join methods such as Shuffle Join, Colocate Join, etc. Multi-table correlation performance is better;
● Supports transactional DDL and DML operations, compatible with MySQL protocol;
● FE and BE have simple architectures and do not rely on external components, making operation and maintenance simpler;
● Data is automatically balanced, and the cluster grows with business Horizontal expansion is easy.
      4. Clickhouse VS Starrocks performance test
      The performance test between the two will not be described in detail in this article. In short, the two are comparable in single table query, but as mentioned above, in multi-table Join and multi-concurrency scenarios Above, Starrocks details dominate.

data flow

Insert image description here
      The current version mainly provides data for the business in an offline way, completing user portrait tags and user behavior logs in the offline data warehouse, and then synchronizing the data to Starrocks to complete Bitmap conversion.

  • Hive’s user profile table (user_profile)
CREATE EXTERNAL TABLE IF NOT EXISTS  user_profile
(uid string  COMMENT '用户id' ,
appId bigint  COMMENT 'app id' ,
tag1 string  COMMENT '标签1' ,
tag2 bigint  COMMENT '标签2' ,
tag3 decimal(38,4)  COMMENT '标签3' ,
......
tag198 string  COMMENT '标签198' ,
tag199 bigint  COMMENT '标签199' ,
tag200 decimal(38,4)  COMMENT '标签200'
) 
COMMENT '商业版c端用户画像应用表'
PARTITIONED BY ( 
 `dt` string  COMMENT '按天分区'  
) 
  • Hive’s user behavior tag table (user_action_tags)
CREATE EXTERNAL TABLE IF NOT EXISTS  user_profile
(uid string  COMMENT '用户id' ,
appId bigint  COMMENT 'app id' ,
eventCode string  COMMENT '用户行为事件' ,
tagType string  COMMENT '标签类型' ,
tagValue string  COMMENT '标签值' 
) 
COMMENT '商业版c端用户画像应用表'
PARTITIONED BY ( 
`dt` string  COMMENT '按天分区'  
) 
  • Starrocks user portrait tag unnest class table
CREATE TABLE `user_profile_unnest_{dataType}` (
  `u_id` bigint(20) not NULL COMMENT '用户自增id主键',
  `tag_type` varchar(50) not NULL DEFAULT '' COMMENT '标签名',
  `tag_value` BIGINT not NULL DEFAULT '0' COMMENT '标签值-字段类型与dataType 保持一致',
) ENGINE=OLAP 
  • Starrocks user portrait tag bitmap class table
CREATE TABLE `bitmap_user_profile_{dataType}` (
  `dt` int(11) not NULL COMMENT '日期分区',
  `tag_type` varchar(50) not NULL DEFAULT '' COMMENT '标签名',
  `tag_value` BIGINT not NULL DEFAULT '0' COMMENT '标签值-字段类型与dataType 保持一致',
  `uid_bitmap` bitmap NOT NULL COMMENT 'uid bitmap'
) ENGINE=OLAP 
  • Starrocks user behavior label unnest table
CREATE TABLE `user_action_unnest` (
  `dt` int(11) not NULL COMMENT '日期分区',
  `u_id` bigint(20) not NULL COMMENT '用户自增id主键',
  `app_id` bigint(20) not NULL COMMENT 'appid',
  `event_code` bigint(20) not NULL COMMENT '行为事件编码',
  `tag_key` varchar(50) not NULL COMMENT '',
  `tag_value` string NULL COMMENT ''
) ENGINE=OLAP 
  • Starrocks user behavior tag bitmap table
CREATE TABLE `bitmap_user_action` (
  `dt` int(11) not NULL COMMENT '日期分区',
  `app_id` bigint(20) not NULL COMMENT 'appid',
  `event_code` bigint(20) not NULL COMMENT '行为事件编码',
  `tag_type` varchar(50) not NULL DEFAULT '' COMMENT '标签名',
  `tag_value` BIGINT not NULL DEFAULT '0' COMMENT '标签值',
  `uid_bitmap` bitmap NOT NULL COMMENT 'uid bitmap'
) ENGINE=OLAP 

      During the Bitmap conversion process, user portrait tags will be split according to the data type of the tag value. Currently, only string (String), integer (Bigint), and precision types (Decimal) are used. Behavior tags are parameters defined according to the log embedding platform. Logs are stored in Json format, and currently only string types are used. Because the business platform supports the import of third-party users for marketing, the process of self-incrementing IDs based on uid is placed on the Starrocks side. Behavior logs only support internal APP users. After the offline data warehouse processes the Unnest table, it is directly synchronized to the Starrocks side and associated uid_mapping gets u_id.

Introduction to Bitmap common functions

The following are frequently used in business scenarios. For more functions, please see the official documentation .
Insert image description here

Wide table VS Bitmap

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_43452483/article/details/135325935