Billion-level data, second-level response | Guanyuan Data's "Speed Analysis Engine"

From Excel and reporting systems to traditional BI, while enterprise data analysis tools have evolved, the amount of data that needs to be supported behind them has also increased at a faster rate.


Data capacity.png

(Each data analysis tool is suitable for the amount of data carried)


Take a chain retail company as an example. If there are 2,000 stores and 5,000 SKUs on sale, the amount of single-store item inventory data reaches 10 million in a day, and it may exceed 100 million in a week.


In order to allow performance to keep up with the speed of enterprise data development, ensure that users can do silky drag-and-drop data analysis and dynamic query on the basis of billion-level and billion-level data sets, while not bringing IT staff Additional data management and operation and maintenance pressure. Guanyuan Data began to study the acceleration component based on massive data calculation and query in 2019, and officially launched the "Extreme Speed ​​Analysis Engine" black technology function in March 2020, which truly achieves a second response of one billion levels of data.


"Extreme Speed ​​Analysis Engine" is a set of computing query acceleration components embedded in Guanyuan's one-stop intelligent data analysis platform, which supports the fastest response speed of more than one billion levels of data in cluster mode. It is suitable for data analysis of large data volume, large width table, and high concurrency in the retail industry, such as aggregation analysis and query of massive inventory data, order analysis, and commodity analysis. It can meet the continuous exploratory self-service analysis, ad hoc query, and dynamic analysis needs of business personnel, maintain a coherent analytical thinking, create an immersive analysis experience, dig deep into the value of data, and efficiently understand the business.


How fast is the "Speed ​​Analysis Engine"? We did a performance test in a laboratory environment. The test machine is a single node with 16 cores and 128G memory, and no independent deployment of acceleration components (actually, acceleration components can be deployed separately, the acceleration effect is more obvious).


Demo1: Speedy query demo (click to watch video)

  


In the above case, we simulated an aggregate analysis of sales volume, sales volume, and cost for a retail customer based on order item detail data at any time.


It can be seen that the two tables on the left and right are based on the same order detail table with 100 million rows for aggregate analysis. The difference is that the table on the left uses the Guan-Index dataset, which is calculated using the Spark calculation engine. The table on the right uses the "high-performance query table" and uses the "speed analysis engine" to accelerate the query. It is not difficult to find that when switching the date range, the table on the right can basically return the calculation results within 2 ~ 3 seconds , while the table on the left takes 10 seconds to return. The overall performance improvement is 3 ~ 5 times. A hundred-second response in seconds.


Demo2: Free drag-and-drop analysis of 100 million rows of data (click to watch video)

 


Based on the above data, we will do a free drag-and-drop data analysis for testing. As you can see from the Demo, the free drag and drop analysis based on the order detail data of 100 million lines can also achieve second-level response and silky experience.


How to use such a powerful function?



When a user imports a Guan-Index dataset of more than ten million levels, or generates a dataset of the same volume through Smart ETL, and wants to use the "fast analysis engine" to accelerate query, we can roughly operate in three steps.


1. Configure the data set


We can go to the "Advanced Options" section of the data set details page and configure the data set as a "high performance query table".


2. Set the partition field


Users need to set up partition fields-partitioning is for data to be reasonably fragmented during storage to reduce data scanning during data query. It is generally recommended to use the date field for partitioning, and the partitioning method is recommended to be set to "month" or "day". Using the date field as a partition can effectively control the number of partitions, so as not to make the partition too thick or too thin. If there is no date field, you can carefully choose other fields for analysis. At this time, you need to control the enumeration number of the partition field. You must not choose a serial number like an order ID or a numeric field as the partition field.



640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1


3. Confirm execution


After configuring the partition fields, click "Confirm" to start the mode switching. When the amount of data in the data set is large, the data import takes a certain amount of time. Please be patient. In internal testing, it takes about 12 minutes to import the data set of 300 million rows * 26 columns. Data set update will also trigger data re-import, so it is generally recommended that the high-performance query table update frequency should not exceed once a day.


The following is an ETL output data set configured with a "high-performance lookup table". We see that it looks the same as the general ETL output data set. But when we use it to create cards, we use the "quick query engine" to query data, which can provide a flying experience.


640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1


What scenarios does the "Speed ​​Analysis Engine" apply to?



At present, the "high performance query table" is suitable for data sets with a data volume greater than or equal to 10 million rows, which can greatly accelerate the efficiency of data query on the card side. It is especially suitable for OLAP queries under massive data, suitable for data aggregation and slicing (filtering) of any dimension on large and wide tables, and can also be used for querying detailed data. Compared with using Spark as the calculation engine directly, these queries can generally provide 3 to 5 times performance improvement. If the hardware resources are abundant, the acceleration components will be deployed independently, and a superior speed experience will be obtained.



Guess you like

Origin blog.51cto.com/14689762/2489088