【Data Warehouse】Ad hoc query

definition

Ad hoc query (Ad Hoc) means that users can flexibly select query conditions according to their own needs, and the system can generate corresponding statistical reports according to the user's selection. The biggest difference between ad hoc query and ordinary application query is that ordinary application query is custom-developed, while ad hoc query is defined by user-defined query conditions.

features

Features of ad hoc query: The biggest difference between ad hoc query and fixed query is that ordinary fixed query is custom-developed (that is, the query statement is pre-written and will not change temporarily), while ad hoc query is defined by user-defined query conditions of. Can be changed at any time.

For ad hoc query and OLAP, you can refer to the relationship diagram in the following figure:
insert image description here

Differences and connections between OLTP, OLAP, and ad hoc query

For now, data processing mainly focuses on two aspects, one is OLTP (on-line transaction processing), and the other is OLAP (On-Line Analytical Processing).

OLTP: It is the main application of traditional relational databases, mainly for basic and daily transaction processing, typically bank ATM deposits and withdrawals, real-time updates of financial securities, etc. These operations are relatively simple, mainly for databases. The DML operation is performed on the data. The main body of the operation is generally the user of the product, and OLTP is very transactional, and it is generally a highly available online system, such as the above-mentioned banking and financial aspects.

OLAP: Sometimes it is also called DSS decision support system, which is similar to what we call a data warehouse. It enables analysts to quickly, consistently, and interactively observe information from all aspects of the data to achieve a deep understanding of the data. By analyzing the data in the DW to draw some conclusive things (such as reports), observe information from various aspects, that is, analyze data from different dimensions (looking at facts from a dimensional perspective), so OLAP is also called multidimensional analysis.

The comparison between the two is shown in the figure below:

Attribute comparison OLTP (such as mysql) OLAP (such as hive)
operation object database database
read feature Only a small number of records are returned per query Summarize a large number of records
write feature Random, low-latency writing of user input Batch import large amounts of data
Data aging Current latest data Aggregated data for the current history
Operation Granularity record level Multi-table join analysis
scenes to be used User, Java EE Project Data analysts, providing support for corporate decision-making
Specific job content simple affairs complex query
time requirement real-time Separate line data warehouse and real-time data warehouse
The amount of data GB terabytes to petabytes
data manipulation Support DDL,DML Updates and deletes are generally not supported
The main function query or change status Reports, Statistical Forecasts

However, for ad hoc queries, it is generally compared with OLAP. Here we explain that in the data warehouse, we generally perform a batch processing of data (basically processing the data of the previous day), and it is for a fixed The data has a clear analysis index. For example, the following figure is the data of an order table:

order ID Order area Order Category order time Order amount
1001 North China electronic December 456
1002 East China food November 489
1003 southwest at home February 491
1004 northeast electronic April 659
1005 northwest pet November 369
1006 North China food February 159

What is a clear analysis indicator? According to the above, looking at facts from the perspective of dimensions (facts refer to measurement values, in this table, amounts), we can measure amounts according to 7 dimensions, which are listed in the table below.

dimension metric Analysis indicators
Category the amount
time the amount
region, category the amount
area, time the amount What area, what time and how much money was sold
category, time the amount
region, category, time the amount What region, what category, and when did the sales amount decrease?

Through the display of the above table, that is, to analyze the measurement value from different dimensions

For the analysis of data warehouses, we generally have fixed routines, for example

select area, time from table group by area, time;

This type of query is also known as a hardened query

Refers to some solidified data fetching and reading needs, which are finally provided to users in the form of data products, thereby improving the efficiency of data analysis and operation. The SQL for this type of demand basically has a fixed mode

Unfortunately, however, something happened. In normal work, you have done a good job in the above-mentioned solidified query. At this time, the boss suddenly came with a request, but the scope does not belong to the above-mentioned SQL with a fixed mode. We call this type of request an ad hoc query (Ad hoc query) hoc queries)

Summary of ad hoc and fixed queries:

There is no essential difference between ad hoc query and fixed query in terms of SQL statements. The difference between them is that fixed queries are known at the time of system design and implementation, and all queries can be optimized during system implementation by building indexes, partitioning and other technologies, making these queries very efficient . However, ad hoc queries are temporary production needs of users, and these queries cannot be optimized manually in advance. Such queries generally require real-time automatic optimization inside the database, so ad hoc queries are also an important indicator for evaluating data warehouses. In a data warehouse system, the more ad hoc queries are used, the higher the requirements for the data warehouse, and the higher the requirements for the symmetry of the data model.

Finally, why not use hive to do ad-hoc query?
The purpose of ad-hoc query is very clear, that is, to be fast, and what you ask is what you get, that is, you can see the result immediately when you put forward this demand. It is definitely not possible to do ad hoc query with the traditional hive of the data warehouse. It is probably dark after MR runs the data.

related framework

1. Druid: It is an OLAP database that processes time series data in real time, because its index is first sliced ​​according to time, and when querying, it also routes the index according to the timeline.
2. Kylin: The core is Cube. Cube is a pre-computing technology. The basic idea is to perform multi-dimensional indexing on data in advance, and combine different dimensions to form possible query cubes. Of course, for meaningless dimension combinations, you can Perform pruning operations. The amount of data is reduced, and the query only scans the index without accessing the original data to speed up.
3. Presto: It does not use Mapreduce, and it is an order of magnitude faster than Hive in most scenarios. The key is that all processing is done in memory. Multiple data sources are supported. At the same time, the join operation can be performed on different data sources.
4. Impala: Based on memory computing, the speed is fast, and the supported data sources are not as many as Presto.
5. SparkSQL: It is a module used by Spark to process structured data. It provides an abstract data set DataFrame and DataSet, and is a query engine for distributed SQL. It can also implement Hive on Spark, and use the Spark engine to read the metadata information of Hive to operate the data in Hive.
6. ClickHouse: ClickHouse does not rely on any third-party components, and uses columnar storage. Multiple storage engines are supported, and users can choose different storage engines according to different tables. At the same time, the bottom layer also implements the vectorization engine.
7. Doris: It does not rely on any third-party components, and it is also a column-stored database. Using the MySQL protocol, compatible with the MySQL syntax, the Doris database can be queried by using the MySQL syntax. The new version also implements a vectorization engine.

Kylin

Ad hoc query - Kylin

source

Baidu Encyclopedia-ad hoc query
OLTP, OLAP, ad hoc query (ad hoc query) difference and connection

Guess you like

Origin blog.csdn.net/weixin_44231544/article/details/130940859