Sence’s new generation analysis engine architecture evolution

bdfbf36aa3f79e3ff030f359f5162042.gif

d7e369f974a7917c3cbb3b89a285477f.png

Recently, Sence Data has launched the new Sence Analysis 2.5 version, which supports the integrated access of analysis models and external data, builds a global data fusion model, and realizes full-link and full-scenario analysis from users to operations. The new version of Shence Analysis can provide enterprises with more comprehensive and effective market information and business strategies, helping enterprises to deeply understand user needs and grasp market trends, thereby improving competitiveness. This important upgrade provides enterprises with more powerful data analysis tools to provide strong support for their business development and decision-making.

As the technical core of the new version, the Shence Customer Journey Analysis Engine (referred to as the "Sence Analysis Engine") has also undergone an important architectural evolution. Next, this article will describe in detail the architectural evolution direction of the analysis engine in the Shence Analysis 2.5 version. and optimization of important capabilities.

1. Comprehensive elastic architecture capability support

The Shence analysis engine supports a comprehensive elastic architecture, achieving architectural separation of storage, query, and import, and each supports multiple capability level configurations and elastic expansion and contraction. Enterprises can flexibly combine the best solutions based on their own business needs and optimize hardware costs to the extreme.

a1538860efbd3bdadaa6d0cbe587797d.png

Figure Shence analysis engine overall architecture

1. Elastic storage, two-way opening up of mainstream data lake ecology

The Shence analysis engine has a native storage-computation separation architecture, and can be flexibly expanded whether it is immutable data storage (HDFS, object storage) or variable data storage (Kudu).

The engine uses different storage systems depending on how hot and cold the data is and how updatable it is. The goal is to minimize the need for high-performance SSD disks and use low-cost HDD disks to store large-capacity data. Through Alluxio's solution, the engine can directly and seamlessly connect to the object storage of major public clouds to achieve low-cost elastic expansion. Of course, elasticity isn't always the best option, given the better performance benefits of local storage and the relatively manageable costs with one-time upfront discounts. Enterprises can flexibly adjust the proportion of storage types based on business types and needs to find the best balance between performance and cost.

The separation of storage and computing architecture will also bring some side effects in terms of performance. Therefore, in small-scale clusters, Shence Data still adopts the same-machine deployment mode of computing and storage by default to reduce network overhead and improve scanning performance. In large-scale clusters and elastic modes, the engine will intelligently utilize Local Cache technology to reduce additional network overhead caused by separation of storage and computing.

In addition, the Sences analysis engine is fully compatible with the Iceberg standard, making it easy to connect with the customer's existing data warehouse and data lake system in two directions, eliminating the need for redundant data storage and ensuring data consistency between different applications. The Iceberg data lake standard is currently widely supported by mainstream data warehouse and data lake solutions, and has a complete open source ecological tool chain.                                                       

2. Flexible query to flexibly respond to business needs

Query resources are usually the most volatile part of the overall resource usage of the analysis engine, because it is not only related to the company's business peaks (such as traffic peaks brought by promotional activities), but also affected by the company's own business activities (such as weekly and monthly reports) , version release) direct impact. To this end, the Shence analysis engine provides a very flexible query resource configuration solution.

First of all, for relatively stable business fixed query requirements, a certain proportion of local query resources needs to be allocated. Since these resources are integrated with storage and computing, query performance is usually better and latency is lower. Later, the capacity can also be expanded based on business growth needs.

Secondly, for nighttime offline calculations or temporary large-scale queries, such as large-scale promotions or new game launches, elastic query resources based on Kubernetes clusters can be used. The best practice here is to use on-demand pricing nodes from major public cloud vendors, or bidding instances (such as AWS Spot instances) for deployment. According to Sence Data's past practical experience in serving customers, this solution can save approximately 20% to 30% of the cost compared to completely using local query resources.

Finally, the analysis engine not only supports physically isolated query resource groups, but also supports priority queues in resource groups. For example, resources can be allocated according to product lines and query sizes to better ensure high-priority business needs.

3. Flexible import to maximize hardware resource utilization

In terms of import capabilities, the Shence analysis engine provides multiple methods such as second-level real-time, minute-level micro-batch and hour-level offline import to achieve a balance between timeliness and throughput and maximize resource utilization. It also allows dynamic switching between different modes, such as switching to micro-batch mode during the import peak period, and then switching back to real-time mode.

Compared with queries, imported resource consumption is usually relatively stable, and it can generally be run using fixed local resources by default. However, for large-volume, one-time historical data import needs, a better choice is to run on an elastic Kubernetes cluster to avoid the operational and hardware costs caused by frequent expansion and contraction in a short period of time.

2. Optimization of six core competencies

1. Comprehensive and enhanced user journey analysis

The Shence analysis engine focuses on the exclusive scenario of user journey analysis. Compared with the general OLAP analysis engine, we have built an efficient user sequence analysis framework. All funnel, path, attribution, LTV and other analysis models are based on this framework. development. This not only ensures excellent execution efficiency, but also enables rapid functional expansion according to business needs.

When dealing with large data volume scenarios, we provide rapid sampling capabilities based on complete user data to ensure that user behavior will not be fragmented during the sampling process, thereby achieving fast calculation at a low cost and maintaining the accuracy of indicators. In addition, we have also implemented efficient enumeration capabilities, supporting single-user behavior sequence scenarios, effectively avoiding redundant storage and inconsistency issues in data. In addition, to deal with ID-Mapping and data compliance scenarios, we specifically support single-user data deletion and repair functions.

2. Accurate query resource estimation

Accurate estimation of resources for each query is an important prerequisite for the stable operation of the Shence analysis engine. In addition to the traditional estimation method based on statistical information, the Sence analysis engine also introduces estimation based on query history. In real business scenarios, since there are usually strong regularities in the use of enterprise products, the system often runs Over time, historical query predictions will play a key role, greatly improving overall accuracy.

Based on the accurate query resource estimation results, on the one hand, a better execution plan can be obtained, and on the other hand, query resources can be scheduled more accurately - for example, small queries can be entered into a high-priority queue for rapid execution. In addition, it can also give users more accurate interactive feedback.

3. Real-time data aggregation integrating batch and stream

While supporting offline analysis and Ad-Hoc query, the Shence analysis engine can also perform streaming aggregation query starting from any historical data time point. This means that we can use the same set of query engines and UDF/UDAF to implement three different application scenarios to achieve syntax consistency, performance efficiency and reusability. Through this part of the capability, we can achieve high-frequency queries with second-level timeliness to better meet real-time monitoring needs.

1e17c6afcc0935788fc4b2e6338fb9ae.png

Application examples of real-time graph aggregation

4. Consistent materialized view

Materialized views are a common OLAP query engine optimization capability. There are usually two ways to implement them: they are consistent with the base table data, or they need to be updated regularly. The Shence analysis engine uses consistent materialized views, which means we can achieve a 10-fold improvement in the performance of common queries while maintaining data consistency.

5. Complete data security system

In order to ensure the maximum security of enterprise data, Shence Analysis Engine has adopted multiple security measures. First, the engine provides complete table-level and row-column-level access control to ensure that only authorized users can obtain the corresponding data, thus protecting the privacy and confidentiality of the data. Secondly, in scenarios with higher security requirements, the engine also supports enabling KMS (Key Management Service)-based encryption mechanisms for all underlying storage services to enhance data encryption protection and ensure that data is always encrypted during storage. Guard against potential security threats.

6. General performance optimization

As a C++ query engine that supports CodeGen throughout the entire process, the Shence analysis engine has significant advantages in processing complex queries. In addition, through the practice of serving 2000+ customers, we have accumulated a lot of optimization experience, and introduced detailed optimizations such as expression pre-calculation, invalid JOIN clipping, regular caching, Bucket Join and other optimizations to further improve performance in complex business scenarios.

It is particularly worth mentioning that after completing many instruction set-level adaptation work, the Shence analysis engine can perfectly support running on domestic x86 and ARM chips and has good performance.

3. Shence analysis engine efficiently empowers business operations

Based on the Shence analysis engine, enterprises can more efficiently realize key business scenarios such as checking numbers and analyzing insights. Including old versions, the Shence analysis engine has successfully provided solid capability support for the digital operations of 2,000+ customers including pan-finance, pan-brand retail, pan-Internet, and pan-enterprise segments.

Take an Internet tool customer as an example. The amount of new data added every day reaches tens of billions, and the average number of queries is thousands of times a day. In this context, the Shence analysis engine has demonstrated excellent performance: the P95 indicator for numerical queries is about 3 seconds, for analytical queries it is 30 seconds, and for original SQL queries it can also reach 36 seconds. Similarly, a certain e-commerce customer adds tens of billions of data every day, with an average daily query count of nearly 10,000 times. It has also reached the P95 indicator ranging from a few seconds to tens of seconds in different usage scenarios.

Many successful cases such as this fully demonstrate the outstanding capabilities of the Shence analysis engine in large-scale data processing and high-frequency query scenarios, providing strong data support for the rapid development of enterprises in the digital era, and helping enterprises understand their business in real time situation, make accurate decisions, and achieve efficient business operations.

✎✎✎

More content

Sence Data’s dual engines empower digital customer management

Sence Analysis Android SDK was selected as a "Starlight" case

10 questions and answers about data analysis models

Guess you like

Origin blog.csdn.net/sensorsdata/article/details/132074164