Article directory
definition
Ad hoc query (Ad Hoc) means that users can flexibly select query conditions according to their own needs, and the system can generate corresponding statistical reports according to the user's selection. The biggest difference between ad hoc query and ordinary application query is that ordinary application query is custom-developed, while ad hoc query is defined by user-defined query conditions.
features
Features of ad hoc query: The biggest difference between ad hoc query and fixed query is that ordinary fixed query is custom-developed (that is, the query statement is pre-written and will not change temporarily), while ad hoc query is defined by user-defined query conditions of. Can be changed at any time.
For ad hoc query and OLAP, you can refer to the relationship diagram in the following figure:
Differences and connections between OLTP, OLAP, and ad hoc query
For now, data processing mainly focuses on two aspects, one is OLTP (on-line transaction processing), and the other is OLAP (On-Line Analytical Processing).
OLTP: It is the main application of traditional relational databases, mainly for basic and daily transaction processing, typically bank ATM deposits and withdrawals, real-time updates of financial securities, etc. These operations are relatively simple, mainly for databases. The DML operation is performed on the data. The main body of the operation is generally the user of the product, and OLTP is very transactional, and it is generally a highly available online system, such as the above-mentioned banking and financial aspects.
OLAP: Sometimes it is also called DSS decision support system, which is similar to what we call a data warehouse. It enables analysts to quickly, consistently, and interactively observe information from all aspects of the data to achieve a deep understanding of the data. By analyzing the data in the DW to draw some conclusive things (such as reports), observe information from various aspects, that is, analyze data from different dimensions (looking at facts from a dimensional perspective), so OLAP is also called multidimensional analysis.
The comparison between the two is shown in the figure below:
Attribute comparison | OLTP (such as mysql) | OLAP (such as hive) |
---|---|---|
operation object | database | database |
read feature | Only a small number of records are returned per query | Summarize a large number of records |
write feature | Random, low-latency writing of user input | Batch import large amounts of data |
Data aging | Current latest data | Aggregated data for the current history |
Operation Granularity | record level | Multi-table join analysis |
scenes to be used | User, Java EE Project | Data analysts, providing support for corporate decision-making |
Specific job content | simple affairs | complex query |
time requirement | real-time | Separate line data warehouse and real-time data warehouse |
The amount of data | GB | terabytes to petabytes |
data manipulation | Support DDL,DML | Updates and deletes are generally not supported |
The main function | query or change status | Reports, Statistical Forecasts |
However, for ad hoc queries, it is generally compared with OLAP. Here we explain that in the data warehouse, we generally perform a batch processing of data (basically processing the data of the previous day), and it is for a fixed The data has a clear analysis index. For example, the following figure is the data of an order table:
order ID | Order area | Order Category | order time | Order amount |
---|---|---|---|---|
1001 | North China | electronic | December | 456 |
1002 | East China | food | November | 489 |
1003 | southwest | at home | February | 491 |
1004 | northeast | electronic | April | 659 |
1005 | northwest | pet | November | 369 |
1006 | North China | food | February | 159 |
What is a clear analysis indicator? According to the above, looking at facts from the perspective of dimensions (facts refer to measurement values, in this table, amounts), we can measure amounts according to 7 dimensions, which are listed in the table below.
dimension | metric | Analysis indicators |
---|---|---|
Category | the amount | … |
time | the amount | … |
region, category | the amount | … |
area, time | the amount | What area, what time and how much money was sold |
category, time | the amount | … |
region, category, time | the amount | What region, what category, and when did the sales amount decrease? |
Through the display of the above table, that is, to analyze the measurement value from different dimensions
For the analysis of data warehouses, we generally have fixed routines, for example
select area, time from table group by area, time;
This type of query is also known as a hardened query
Refers to some solidified data fetching and reading needs, which are finally provided to users in the form of data products, thereby improving the efficiency of data analysis and operation. The SQL for this type of demand basically has a fixed mode
Unfortunately, however, something happened. In normal work, you have done a good job in the above-mentioned solidified query. At this time, the boss suddenly came with a request, but the scope does not belong to the above-mentioned SQL with a fixed mode. We call this type of request an ad hoc query (Ad hoc query) hoc queries)
Summary of ad hoc and fixed queries:
There is no essential difference between ad hoc query and fixed query in terms of SQL statements. The difference between them is that fixed queries are known at the time of system design and implementation, and all queries can be optimized during system implementation by building indexes, partitioning and other technologies, making these queries very efficient . However, ad hoc queries are temporary production needs of users, and these queries cannot be optimized manually in advance. Such queries generally require real-time automatic optimization inside the database, so ad hoc queries are also an important indicator for evaluating data warehouses. In a data warehouse system, the more ad hoc queries are used, the higher the requirements for the data warehouse, and the higher the requirements for the symmetry of the data model.
Finally, why not use hive to do ad-hoc query?
The purpose of ad-hoc query is very clear, that is, to be fast, and what you ask is what you get, that is, you can see the result immediately when you put forward this demand. It is definitely not possible to do ad hoc query with the traditional hive of the data warehouse. It is probably dark after MR runs the data.
related framework
1. Druid: It is an OLAP database that processes time series data in real time, because its index is first sliced according to time, and when querying, it also routes the index according to the timeline.
2. Kylin: The core is Cube. Cube is a pre-computing technology. The basic idea is to perform multi-dimensional indexing on data in advance, and combine different dimensions to form possible query cubes. Of course, for meaningless dimension combinations, you can Perform pruning operations. The amount of data is reduced, and the query only scans the index without accessing the original data to speed up.
3. Presto: It does not use Mapreduce, and it is an order of magnitude faster than Hive in most scenarios. The key is that all processing is done in memory. Multiple data sources are supported. At the same time, the join operation can be performed on different data sources.
4. Impala: Based on memory computing, the speed is fast, and the supported data sources are not as many as Presto.
5. SparkSQL: It is a module used by Spark to process structured data. It provides an abstract data set DataFrame and DataSet, and is a query engine for distributed SQL. It can also implement Hive on Spark, and use the Spark engine to read the metadata information of Hive to operate the data in Hive.
6. ClickHouse: ClickHouse does not rely on any third-party components, and uses columnar storage. Multiple storage engines are supported, and users can choose different storage engines according to different tables. At the same time, the bottom layer also implements the vectorization engine.
7. Doris: It does not rely on any third-party components, and it is also a column-stored database. Using the MySQL protocol, compatible with the MySQL syntax, the Doris database can be queried by using the MySQL syntax. The new version also implements a vectorization engine.
Kylin
source
Baidu Encyclopedia-ad hoc query
OLTP, OLAP, ad hoc query (ad hoc query) difference and connection