【XL-LightHouse】General Streaming Big Data Statistics Platform

overview

  • XL-LightHouse is a set of integrated data writing, data computing, data storage, and data visualization functions developed for the complex data statistics needs in the Internet field, supporting large data volumes and high concurrency [universal streaming big data statistics platform].
  • XL-LightHouse currently basically covers common streaming data statistics scenarios, including count, sum, max, min, avg, bitcount, topN/lastN and other operations, supports multi-dimensional calculations, supports minute-level, hour-level, and day-level multiple time granularity statistics, and supports the configuration of custom statistical periods.
  • XL-LightHouse has built-in rich conversion functions and supports expression parsing, which can meet various complex condition screening and logical judgment.
  • XL-LightHouse is a full-featured data governance solution in the field of streaming big data statistics. It provides a relatively friendly and complete visual query function, and provides an API query interface to the outside world. In addition, it also includes data index management, authority management, statistical current limiting and other functions.
  • XL-LightHouse supports the storage and query of time series data.

background

Taking the Internet industry as an example, now that the development of the mobile Internet is relatively mature, traffic has peaked, dividends have disappeared, corporate competition has become increasingly fierce, and the cost of acquiring new users is increasing day by day. Many companies have begun to realize that they cannot blindly seize the market through simple and crude methods such as subsidies, price wars, and advertising. Such an operation mode is difficult to maintain for a long time. The concept of reducing costs, improving efficiency, and maximizing single-user value through refined and data-based operations has gradually been accepted by more and more companies. The premise of refined and data-based operations is to establish a complete data indicator system. With the help of this data indicator system, enterprises can have multiple uses:

  • 1. Troubleshooting: Data-based operation is to bring the business of the enterprise into a "controllable" state, helping the enterprise to quickly determine the problem when the business is not running normally.
  • 2. Business insight: Data-based operation is to make all aspects of business operation more transparent, helping enterprises to see more clearly where the current "short boards" are, and assisting product optimization iterations.
  • 3. Clear direction: Data-based operation is to cultivate a keen sense of smell, so that enterprises can more accurately judge market trends and capture information with business value.
  • 4. Scientific trial and error: In today's increasingly high cost of trial and error, data-based operations help companies change the previous way of making decisions by "patting their heads", break past empiricism, assist decision makers in thinking, quickly verify ideas, and allow enterprises to reduce costs more scientific "trial and error". As enterprises pay more and more attention to data-based operations, there will inevitably be a large number of data statistics needs. XL-LightHouse takes streaming big data statistics as an entry point to promote the rapid popularization and large-scale application of streaming statistics in many industries. It is positioned as a big data platform that supports tens of thousands or hundreds of thousands of streaming data statistics needs with a set of services using less server resources.

income

XL-LightHouse can help enterprises build a relatively complete, stable and reliable data-based operation system more quickly, and save enterprises' investment in data-based operations, which is mainly reflected in the following aspects:

  • Reduce the research and development costs and data maintenance costs of enterprises in streaming big data statistics.
  • Help enterprises save time and cost, and assist the rapid iteration of Internet products.
  • Save considerable server computing resources for enterprises.
  • Facilitate the sharing and intercommunication of data within the enterprise. In addition, XL-LightHouse is friendly to small and medium-sized enterprises, which greatly reduces the technical threshold for small and medium-sized enterprises to use streaming big data statistics, and can cope with complex streaming data statistics needs through simple page configuration and data access.

architecture

XL-LightHouseXL-LightHouse includes the following modules:

  • Client module, the business side accesses the SDK for reporting and statistical raw message data;
  • RPC module, the function includes receiving statistical message data reported by the client, and providing an interface for querying statistical results;
  • Tasks calculation module, the functions include encapsulating various streaming statistical calculation scenarios, executing current limiting rule judgment, analyzing the configuration information of each statistical item, consuming message data and calculating according to the statistical configuration, and saving the statistical results;
  • Web module, functions include managing and maintaining statistical groups and statistical items, viewing statistical results, setting traffic limiting rules, and managing access rights to statistical indicators.

system design

XL-LightHouse is a general-purpose streaming big data statistics platform. It abstracts and classifies streaming data statistics requirements into multiple computing scenarios, and implements high-performance implementations of various computing scenarios so that each computing can achieve unlimited multiplexing. XL-LightHouse uses the three-tier structure of [statistical engineering-statistical group-statistical item] to manage all statistical requirements. Each statistical requirement is called a statistical item, and each statistical item is based on one or more computing scenarios. Users can create several statistical projects as needed, and each statistical project can contain multiple statistical items, and multiple statistical items based on the same metadata are called a statistical group. XL-LightHouseThe Web module can manage the running status of statistical items. Users can start, stop, and delete specified statistical items on the web page. Statistical items in the running state perform statistical operations normally, and statistical items in the non-running state do not perform statistical operations. To access the system, the user first needs to configure the corresponding configuration on the Web side, and then report the original data through the SDK. The system divides the statistical raw message data into several batches according to the statistical cycle, and then performs corresponding calculations based on the statistical configuration.

1. Custom streaming statistics specification (XL-Formula)

SQL specifications are widely used in big data query and statistical analysis, and SQL has an unshakable position in offline data analysis, OLAP, OLTP and other fields. And as the functions of components such as FlinkSQL and SparkSQL are becoming more and more perfect, SQL has also begun to be used more and more in the field of streaming statistics. However, since SQL itself processes data based on the concept of a data table, it is inevitable to store more raw data and intermediate data in memory, resulting in high memory waste; distributed SQL will trigger Shuffle during data processing, resulting in a large number of network transmissions, affecting execution efficiency; some grouping and aggregation operations of SQL may cause serious data skew, which affects the normal execution of the program. Many SQL computing tasks need to be optimized based on data volume and computing logic; Waste of resources; SQL syntax is too bloated and complex, not clear and concise enough, and the combination logic of multiple filter conditions needs to rely on long SQL statements to implement, which is not easy to understand, and writing long SQL statements is prone to errors; SQL function customization function expansion is not convenient; SQL development is relatively complicated, and there may be multiple ways to write SQL to achieve the same function, and the execution and parsing efficiency of different ways are also different. These problems make the realization of the corresponding functions rely on professional data R&D personnel, resulting in high R&D costs and long cycles for streaming statistics tasks. When enterprise data indicators show exponential growth, the bottleneck of SQL specification will also be highlighted, requiring a lot of research and development costs, data maintenance costs and server computing costs. I think these problems of the SQL specification limit its rapid expansion in the subdivision scenario of streaming statistics, making the application of SQL in this subdivision field basically limited to the scope of customized demand development. To a certain extent, the SQL specification has hindered the development of streaming statistics, and restricted the rapid popularization and large-scale application of streaming statistics in various industries. As a general-purpose streaming big data statistics platform, XL-LightHouse focuses on helping enterprises solve complex streaming data statistics problems. XL-LightHouse does not stick to the current industry standards in the field of big data, but hopes to solve the problems faced by enterprises by using lighter technical solutions. It defines a relatively complete set of configuration specifications for describing various forms of streaming statistics requirements. Through the combination of various attributes, very powerful statistical functions can be realized, thus helping enterprises

2. Message aggregation processing

The system divides the entire data consumption link into the following basic links: the link of reporting message data by the Client module, the link of processing message data by the RPC module, the link of performing expansion and grouping operations by the calculation module, and the link of storing statistical results. In each link, the system uses asynchronous processing, batch consumption, and aggregation processing for repetitive calculations. Each link receives the message and puts it into the message buffer pool. The system divides the message into different calculation types according to the predefined aggregation logic of each link, and aggregates the same type of messages in a single node and single process. This design can reduce data transmission to the downstream, improve network IO efficiency, and directly reduce the amount of downstream calculations and the writing pressure of the DB. Every link from the client sending the message to the final statistical result storage aggregates the repetitive messages to reduce the message volume as much as possible, and discards the parameters irrelevant to the downstream operation as soon as possible. The data consumption link of XL-LightHouse is a layer-by-layer decreasing structure. The message aggregation logic of each link is slightly different. Taking the Client module as an example, the message aggregation mainly includes the following contents: (1) Message body parameter cutting In order to improve the transmission speed of the message and improve the efficiency of message aggregation in subsequent steps, the Client module needs to cut the original message. The purpose is to remove statistically irrelevant fields. Statistically irrelevant fields are calculated by the system based on all valid statistical items under each statistical group. Fields that are not related to all valid statistical items are filtered out before the Client module reports data to avoid unnecessary data transmission. (2) Tampering the timestamp of the message body. In the link of reporting the message, the client module modifies the original timestamp of the message to the minimum batch time before performing the aggregation operation. The purpose is to aggregate as many messages as possible under the premise of ensuring the accuracy of the data in the subsequent steps, and reduce the amount of network transmission and downstream calculations. The Client module takes the greatest common divisor of the statistical period of all valid statistical items under the current statistical group as the time window, and calculates the minimum batch time corresponding to the message according to the time window and the original timestamp of the message. The Client module modifies the original timestamp of the message to the minimum batch time and puts it into the buffer pool. (3) Aggregation operation The aggregation operation is to merge the same type of messages together according to the predefined aggregation logic. The aggregation logic of different links is slightly different. The aggregation logic of the Client module refers to messages with consistent message content, that is, messages with the same statistical group and the same parameter values. After the original message is sent to the buffer pool, the consumer thread group periodically reads messages in batches from the buffer pool, and aggregates the messages that meet the aggregation rules. After the aggregation operation, the data structure of the message body is changed from the content of a single message body to two attributes: the content of the message body and the number of repetitions of the message body.

3. Message expansion and grouping

In XL-LightHouse, all statistical tasks in the cluster share cluster computing resources, and the computing module performs expansion and grouping operations on statistical messages after receiving the data.

  • message expansion

In most business scenarios, there are often multiple data indicators for one piece of metadata, and all statistical items under the statistical group share one raw data message. The expansion operation is the process of querying all valid statistical items under the statistical group, extracting the associated fields of each statistical item, copying a separate message data for each statistical item and retaining only the fields related to its operation. The purpose of the expansion operation is to prevent the subsequent operation logic of each statistical item from affecting each other.

  • message grouping operation

The grouping operation is to extract the statistical cycle attribute of the statistical item, divide the time window according to the statistical cycle, and group the messages after the expansion operation according to the time window; then judge whether the statistical item contains multiple statistical operation units, and if it contains multiple statistical operation units, regroup according to the statistical operation unit; determine whether the statistical item contains dimension attributes, and if it contains dimension attributes, extract dimension information and regroup by dimension. The purpose of the grouping operation is to decompose the calculation process of each statistical task into different calculation types, aggregate and process messages of the same type, and the calculation process of different types of messages does not affect each other.

4. Message buffer pool

The implementation scheme of the message buffer pool on which the system aggregate processing depends is based on the bounded priority blocking queue. The system divides the message buffer pool into several Slots, and the structure of each Slot includes a BoundedPriorityBlockingQueue (bounded priority blocking queue) and the last access timestamp corresponding to the Slot. The processing logic of the message buffer pool includes the following steps: (1) The Producer generates the Key of the message event according to the aggregation logic of different links, and the Key is used to distinguish whether it is the same type of message; (2) The message buffer pool allocates the corresponding Slot according to the Hash remainder of the message Key; (3) Divides the message into different processing cycles according to the predefined time window; (6) Determine whether the usage capacity of the Slot exceeds the threshold, the threshold is batchsize * backlog_factor, where batchsize is the specified maximum number of messages for a single consumption, and backlog_factor is the specified message backlog factor; (7) If the usage capacity of the Slot does not exceed the threshold, continue to judge the last consumption access time of the Slot, if it exceeds the time threshold, read the batch consumption of messages, otherwise skip this task. After consuming the Slot message, update the Slot usage capacity and the last access time at the same time. The implementation of the message buffer pool can aggregate as many messages of the same computing type as possible for processing, reducing the downstream computing load and the write pressure on the DB.

5. Radix operations

The cardinality calculation of bitcount refers to distinct (non-repeated value counting). The system uses the cardinality filtering device to filter the existing cardinality values, and realizes cardinality statistics by determining the number of cardinal numbers that do not exist in the filtering device and then updating the statistical results in the DB. The radix filtering device includes two parts: a memory radix filtering device and a distributed radix filtering device. The function of the memory cardinality filtering device is to preliminarily judge whether the cardinality value already exists, and its function is to make memory judgment more efficient, so as to avoid the influence of repetitive cardinality judgment on the overall performance as much as possible. The in-memory cardinality filter is implemented using the RoaringBitMap toolkit. The distributed cardinality filtering device contains multiple fragments, and each fragment corresponds to a RoaringBitMap data storage structure. The number of fragments can be specified according to actual needs. By increasing the number of fragments, the accuracy of the cardinality calculation can be improved. The implementation scheme of the distributed cardinality filtering device includes the following steps: (1) Pass the original value through MurmurHash-128Bit to generate a Long type Hash value corresponding to the original value. (2) Set the number of shards required for statistical tasks. Each shard corresponds to a RoaringBitMap data structure. The filtering device of this system is realized by using Redis to extend the Redis-Roaring plug-in. The shards corresponding to the original value can be obtained by Hash remainder. (3) Split the Hash value of the Long type into two Int type integers according to the high 32bit and low 32bit. If it is a negative number, take its absolute value. The combination of the two Int values ​​corresponds to the Index value of the original value in the RoaringBitMap data structure. (4) Send the combination of Int values ​​corresponding to multiple cardinality values ​​to Redis in batches, and use Lua scripts to combine and execute multiple operations for cardinality judgment. Judging whether the Int value combination exists in the filter device, if both Int values ​​exist in the filter device, it means that the original value already exists, otherwise the original value does not exist, if the original value does not exist in the filter device, the system updates the corresponding Index value after the judgment is completed. (5) Count the number of original values ​​that do not exist in the filtering device and update to the DB.

6. Avoid shuffling

During the execution of big data tasks, shuffle is a major factor that affects performance. In addition to bringing a lot of network overhead, shuffle may also cause problems such as data skew and even OOM. The system avoids uncontrollable factors such as Shuffle to avoid unpredictable problems that Shuffle may bring. The calculation module is developed based on Structured Streaming and adopts a calculation method that completely avoids Shuffle. By setting the number of calculation nodes to adjust the parallelism of task execution, the system splits the statistical information in a single calculation node into different calculation types according to the statistical item identifier, dimension identifier, time batch, and statistical calculation unit. Statistical result data and intermediate state data are implemented based on external storage. In this system, the statistical results are stored in HBase, the intermediate state data of the bitcount radix operation is stored in Redis, and the sorted data of the limit operation is stored in Redis. Each computing node only communicates with external storage during computing, and different computing nodes do not affect each other.

7. Statistical current limit

In order to avoid system instability due to the sudden access of a large amount of statistical demand or the sudden increase in the traffic of a certain statistical item, the system has a circuit breaker protection mechanism for the dimensions of statistical group message volume, statistical item result volume, and statistical item calculation volume. The role of this current limiting protection mechanism is to better ensure the stability of the overall service. Currently, it includes the following strategies: (1) Statistical group message volume current limiting Statistical group message volume current limiting is a current limiting strategy for the number of statistical group messages received per unit time. The system's built-in statistical group message volume counting device is used to calculate the number of statistical group messages received per unit time. When the amount of messages per unit time exceeds the threshold, the current limit is triggered, so that the current statistics group enters the current limit state. The Client module and the Tasks module automatically discard statistics group messages in abnormal states. Since a statistics group can correspond to one or more statistics items, the current limiting policy will affect the normal statistics of all statistics items under the statistics group. After the statistics group enters the current limiting state, the corresponding messages are automatically discarded within the specified time (20 minutes by default). When the current limiting time reaches the time threshold, the statistics group automatically returns to the normal state. (2) Statistical item result volume limit Statistical item result volume current limit is a policy for limiting the number of statistical results generated by statistical items per unit time. The system's built-in counting device for statistical item results is used to calculate the number of statistical results generated per unit time. When the amount of results per unit time exceeds the threshold, the current limit is triggered, so that the current statistical item enters the current limit state. The amount of statistical item results is related to two factors. One is the time granularity of the statistical cycle. The finer the granularity of the statistical cycle, the greater the amount of index data. For example, the statistical results generated in a second-level and minute-level unit of time are more than hour-level and day-level statistics. The second influencing factor is dimension. Statistical items with more dimensions generate more statistical results per unit time. For example, statistical indicators with cities as the dimension generate more statistical results than statistical indicators with provinces as the dimension. Statistical item result flow limit is a current limiting strategy for the current statistical item, so it only affects the current statistical item, and has no effect on other statistical items under the statistical group. When the statistical item enters the current limiting state, the corresponding corresponding message is automatically discarded within the specified time (default 20 minutes). When the current limiting time reaches the time threshold, the current statistical item automatically returns to the normal state.

8. Timestamp compression

The system further optimizes the data storage format for streaming statistics scenarios, aiming to improve the data throughput of DB. The data storage of the system statistical results adopts timestamp compression, which is divided into different time periods according to the statistical cycle, and multiple statistical result values ​​in the same period under the same dimension of each statistical item are stored in different columns.

9. Abnormal fusing

The circuit breaker mechanism is to ensure the stability of the business party's own service and avoid the impact on the business party's own service due to the instability of the statistical service. The abnormal fuse mechanism means that when the client interface is called, if the number of failures or timeouts per unit time exceeds the threshold, it will enter the fuse state. At this time, the client module will automatically skip the statistical message sending logic. After entering the fuse state, the client module periodically checks whether the status of the statistics service returns to normal, and automatically reconnects if the statistics service returns to normal.

System Functional Boundary

  • (1) Does not support detailed query of original data;
  • (2) For the time being, only streaming scrolling window data statistics are involved (sliding window statistics will be supported in subsequent versions);
  • (3), temporarily does not support second-level granular data statistics (will be supported in subsequent versions);
  • (4) Does not involve the details of the original data collection. All calculations of the system are based on the original message reported by the access party. The original message data needs to be assembled by the access party and reported through the SDK. In addition, only the Java version of the SDK is currently provided, and for services in the JVM language, the access party can directly call them in the service. Services developed in other languages ​​can collect data and store them in message queues such as Kafka, and then access XL-LightHouse in the form of consuming data.

project address

dependent components

The components that this project depends on are: Hadoop (Apache2.0), HBase (Apache2.0), Spark (Apache2.0), Kafka (Apache2.0), Zookeeper (Apache2.0), Redis (BSD3), Redis-Roaring (MIT), Guava (Apache2.0), Caffeine (Apache2.0), , ZeroICE (GPLv 2), ECharts (Apache2.0), AdminLTE (MIT), Aviator, SpringBoot (Apache2.0), MySQL (GPLv2), LayUI (MIT), ZTree (MIT), JQuery (MIT), Jedis (MIT), Freemarker (Apache2.0), RoaringBitMap (Apache2.0), Redisson (Apache2.0), Jackson (A pache2.0), ACE Editor (BSD), Disruptor (Apache2.0)

exchange feedback

some words at the end

XL-LightHouse is a general-purpose streaming big data statistics platform dedicated to promoting the rapid popularization and large-scale application of streaming statistics. It is positioned as a big data platform that supports tens of thousands or hundreds of thousands of streaming data statistics needs with a set of services using less server resources. XL-LightHouse is oriented to the common use of all functional personnel from top to bottom in the enterprise. It advocates general-purpose streaming data statistics as the starting point, and tends to choose lighter technical solutions to help enterprises build a relatively complete, stable and reliable data-based operation system that spreads all over the body like our human nervous system. Streaming statistics technology is not perfect, and there are indeed some scenarios that are not suitable for implementing streaming statistics, so it is impossible to completely replace other technical solutions. But I still think that among all technical solutions in the field of enterprise data operation, the only one that can play a mainstay role is general-purpose streaming data statistics. Timeliness is one of the reasons why streaming statistics are favored, but I think the most fundamental reason lies in the extent to which a technology can be popularized. In many cases, the cost of use determines everything. In the field of software research and development, I think that general-purpose streaming statistics will have a huge impact on the current software product development. It will develop into an important role like logs. General-purpose streaming statistics may become another auxiliary tool system that is independent of logs and as important as logs. Programmers of various types of jobs will add streaming statistics codes wherever necessary, just like adding logs. In the enterprise-level service market, I believe that general-purpose streaming data statistics will become one of the core basic services of enterprises due to its huge application scenarios and huge business value, and data-based operation products with general-purpose streaming data statistics as the core concept and other technical solutions as auxiliary means will become an indispensable backbone of the enterprise-level B-end market. In addition, with the coordinated development of software and hardware technologies and the coming of the Internet of Things era, I think that general-purpose streaming data statistics will also permeate all aspects of the real world, becoming a basic computing capability of society and being widely used in various industries.

Author:雪灵    Contact:[email protected]

The 8 most in-demand programming languages ​​in 2023: PHP is strong, C/C++ demand is slowing Musk announced that Twitter will be renamed X, and the logo will be changed for five years, Cython 3.0 is officially released GPT-4 is getting more and more stupid? The accuracy rate dropped from 97.6% to 2.4%. MySQL 8.1 and MySQL 8.0.34 were officially released. The father of C# and TypeScript announced the latest open source project: TypeChat Meta Enlargement move: released an open source large language model Llama 2, which is free for commercial use . React core developer Dan Abramov announced his resignation from Meta. ChatGPT for Android will be launched next week. Pre-registration starts now . needs? Maybe this 5k star GitHub open source project can help - MetaGPT
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6853382/blog/10090555