Apache Hive Federation Query (Query Federation)

Apache Hive Federation Query (Query Federation)

Past large memory data passing large data memory
original article (click below to read the original text to enter) https://www.iteblog.com/archives/2524.html .

Today, many companies may use multiple data storage and processing systems internally. These different systems solve corresponding use cases. In addition to traditional RDBMS (such as Oracle DB, Teradata or PostgreSQL), we will also use Apache Kafka to obtain stream and event data. Use Apache Druid to process real-time series data, and use Apache Phoenix for quick index search. In addition, we may also use cloud storage services or HDFS to store data in batches.
The platform team generally deploys all these systems so that application developers can flexibly choose the functions needed to meet their business analysis needs.

Unified access using Apache Hive 3.0 and SQL

But we also know that if we need to associate data from different storage sources, we need to extract the data and put it in the same storage medium, for example, put it in HBase, and then perform the association operation. As you can see, this kind of data fragmentation will bring great trouble to our data association analysis. If we can use a query engine to query data from different data sources, and then directly perform association operations, this will bring Great efficiency improvement. This is the JDBC Storage Handler introduced in this article, see HIVE-1555 for details.
From the name of JdbcStorageHandler, we can see that its function should be similar to HBaseStorageHandler, that is, you can use standard JDBC in Hive to read data stored in different data stores. For example, we can read the data in MySQL and Phoenix separately in Hive, and then associate them. It provides efficient, unified SQL access-out of the box. The benefits of this are huge:

  • Single SQL dialect and API
  • Unified security control and audit trail
  • Unified control
  • Ability to combine data from multiple sources
  • Data independence It
    should be noted that currently JdbcStorageHandler only supports reading data from JDBC data sources, and does not support writing data to JDBC data sources.

    JdbcStorageHandler supports CBO

When using JdbcStorageHandler to read data from a JDBC data source, a simple way is to directly read the full amount of data; and load all of it into Hive. Although this method is very simple, but it will bring performance problems.
For these reasons, Hive relies on storage handler interfaces and Apache Calcite's CBO to implement intelligent operator push-down functions. In this way, the query rules can be pushed down to the JDBC data source, some filtering is performed at the JDBC data source level, and the calculation results are returned to Hive, which can reduce the amount of data and improve the query efficiency.
The operator push-down function is not limited to SQL systems. For example, we can push operators to Apache Druid or Apache Kafka. When querying data in Apache Druid, Hive can push down filtering and aggregation operations to Druid, generate JSON queries, and send them to the REST API exposed by the engine. On the other hand, if we query the data in Kafka, Hive can directly filter the relevant partition or offset, and selectively read data from the topic in the system.
Suppose we have three tables store_sales, store_retuens and date_dim in MySQL or PostgreSQL, and then we have the following query:
Apache Hive Federation Query (Query Federation)
The execution plan of the above SQL before optimization is as follows:

Apache Hive Federation Query (Query Federation)
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please pay attention to the WeChat public account: iteblog_hadoop.
The gray boxes are executed in MySQL or PostgreSQL, and the orange ones are executed in Hive. As can be seen from the figure, three scanning documents directly back to the Hive treatment, so efficiency is very low down, in fact, we can be optimized operator sinking, after the CBO Apache Calcite optimized execution plan is as follows:
Apache Hive Federation Query (Query Federation)
If you want to Keep up to date with articles related to Spark, Hadoop or Hbase, welcome to follow WeChat public account: iteblog_hadoop

The corresponding SQL execution is as follows: the
Apache Hive Federation Query (Query Federation)
above operations are directly executed on the corresponding JDBC data source, and then Hive then calls JDBC_Scan to save the query results to the corresponding sink.

How to use JdbcStorageHandler

Having said that, how do we use JdbcStorageHandler? In order to use it, we need to create an external table in Hive, as follows:
Apache Hive Federation Query (Query Federation)
As shown above, create table currently needs to specify the mode of the JDBC table. HIVE-21060 introduces a feature that can automatically discover the schema for JDBC-based external tables, so that we don't have to declare it in the create table command.
The work of HIVE-21059 is to support external catalogs. The external catalog will allow the creation of a catalog pointing to an external mysql database in the Metastore. And through this catalog, we can use all the tables in the Hive query.

Guess you like

Origin blog.51cto.com/15127589/2678559