Connect to HiveMetaStore and embrace open source big data

This article is shared from the Huawei Cloud Community " Connecting with HiveMetaStore and embracing open source big data ". Author: Sleeping is a big deal.

1 Introduction

  • Applicable versions: 9.1.0 and above

In the era of big data fusion analysis, when faced with massive amounts of data and various complex queries, performance is the most important consideration when using a data processing engine. The GaussDB (DWS) service has a powerful computing engine. Its computing performance is better than computing engines such as hive or spark in the MRS service, and it can meet the high elasticity and agility needs of the business at a lower cost. By linking with MRS, there is no need to relocate data. It is increasingly becoming a mainstream solution to use the high-performance computing engine of DWS to process and analyze massive data in the data lake as well as various complex query services and analysis services.

We can connect to the HiveMetaStore metadata service by creating an external schema, so that GaussDB (DWS) can directly query the hive/spark table or insert data into the hive/spark table. There is no need to create a read or write table, and there is no need to worry about GaussDB (DWS) not updating the table definition in time when the definition of the hive/spark table changes.

This article mainly describes the configuration and guidance of the connection between GaussDB (DWS) and hivememtastore.

2. Brief analysis of principles

2.1 What is HiveMetaStore

HiveMeatStore is a key component of Apache Hive. It is a metadata repository used to manage metadata information of hive/spark tables. HiveMeatStore stores the structural information of the Hive table, including table name, column name, data type, partition information, etc. It also stores the location information of the table, that is, where the table data is stored. The main function of HiveMeatStore is to provide metadata services so that Hive/Spark can query and analyze data. It also provides APIs that allow developers to access table metadata programmatically. In short, HiveMeatStore is an important component of Hive, which provides metadata management and query services.

External schema is the external schema. GaussDB (DWS) connects to the HiveMeatStore service by creating an extranal schema, and actively obtains the metadata of the hive/spark table object for each query. There is no need for the GaussDB (DWS) kernel to obtain the metadata of the hive/spark table through create foreign table.

2.2 The difference between external schema and schema

1 external schema is mainly used to establish a connection with HiveMeatStore and obtain table object metadata. When creating an external schema, you need to specify the various attribute values ​​required for the connection.

2 After the ordinary schema is created, the schema information will be recorded in pg_namespace. After the external schema is created, it will also be recorded in pg_namespace like the ordinary schema. You can distinguish whether it is an external schema or an ordinary schmea through the nsptype field in pg_namespace.

illustrate

In addition to the relevant information stored in pg_namespace, configuration information related to external schema connections will be recorded in pg_external_namespace.

illustrate

3 Table objects are not supported under external schema. The object is created in hive or spark, and the external schema is only used to perform DML operations.

2.3 Principle description

The process of GaussDB (DWS) docking with HiveMetaStore is shown in the figure below

1. Create Server, external schema, sql query.

Before using this feature, users will need to create a Server. The process of creating a Server is the same as that of an existing Server.

There are two ways to create an OBS server. One is to create it through permanent AK and SK. (The premise of this method is that permanent AK and SK can be obtained, but this method is not safe. AK/SK are directly exposed in the configuration file, and AK and SK need to be entered in clear text when creating a service. It is not recommended to create services in this way. )

Another method of binding DWS to ECS on the cloud is to access OBS and create an OBS server through the control plane. To entrust the creation of a server through the control plane, please refer to how to create an OBS server when creating an appearance. https://support.huaweicloud.com/mgtg-dws/dws_01_1602.html

Create external schema:

The external schema creation syntax is

CREATE External Schema ex 
WITH SOURCE hive
DATABASE 'default'
SERVER hdfs_server
METAADDRESS '10.254.159.121:9010'
CONFIGURATION '/home/fengshuo/conf2';

The SOURCE field specifies the type of external metadata storage engine, DATABASE is the corresponding database name in Hive, SERVER is the server created in step 1, METAADDRESS is the address and port information provided by Hive, and CONFIGURATION is the path to Hive and Kerberos related configuration files.

The goal of external schema is to connect to external metadata (Foreign Meta) so that DWS can actively perceive changes in external metadata, as shown in the figure below.

GaussDB (DWS) connects to HiveMetaStore through external schema, maps to the corresponding table metadata, and then accesses Hadoop through the table.

SQL query: The select query format is select * from ex.tbl, where tbl is the name of the foreign source table and ex is the created external schema.

2. Grammar parsing: The grammar parsing layer is mainly responsible for parsing and is mainly responsible for the following contents:

After reading the ex.tbl table, connect to HMS for metadata query

3. Metadata query: Query metadata information from HMS. This step is completed in step 1.

Reading data from HMS mainly includes column information, partition information, partition key information, delimiter information, etc.

4. Data query (for select): Obtain the number and file size of statistical information files from DFS storage to provide a basis for plan generation.

5. Query rewriting, query optimization, query execution

6. Query delivery: Send the metadata to the DN along with the plan. After the DN receives the plan, it will decode the metadata and insert it into SysCache.

7. Query execution: DN accesses the obs corresponding file and executes the query.

3. Interconnection process with hivememtastore

3.1 Prepare the environment

The DWS 3.0 cluster and MRS analysis cluster have been created. You need to ensure that the MRS and DWS clusters are in the same region, availability zone, and VPC subnet to ensure cluster network interoperability;

Obtained AK and SK.

3.2 Create the table that needs to be connected on the hive side

1. In the /opt/client path, import the environment variables.
source bigdata_env

2. Log in to the Hive client.

3. Execute the following SQL statements in sequence to create the demo database and target table product_info.
CREATE DATABASE demo;
use demo;
DROP TABLE product_info;
 
CREATE TABLE product_info 
(    
    product_price                int            ,
    product_id                   char(30)       ,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    int            ,
    product_comment_time         date           ,
    product_comment_num          int        ,
    product_comment_content      varchar(200)                   
) 
row format delimited fields terminated by ',' 
stored as orc;
4. Import data into the hive table through insert

3.3 Create external server

Use Data Studio to connect to the created DWS cluster.

There are two supported formats on the MRS side, hdfs and obs. The ways of creating external servers for hive docking in these two scenarios are also different.

Execute the following statements to create an OBS external server.

CREATE SERVER obs_servevr FOREIGN DATA WRAPPER DFS_FDW 
OPTIONS 
(
address 'obs.xxx.com:5443', //OBS access address.
encrypt 'on',
access_key '{AK value}',
secret_access_key '{SK value}',
 type 'obs'
);
Execute the following statements to create an HDFS external server.
CREATE SERVER hdfs_server FOREIGN DATA WRAPPER HDFS_FDW OPTIONS (
      TYPE 'hdfs',
      ADDRESS '{primary node},{standby node}',
      HDFSCFGPATH '{hdfs configuration file address}');

Hard-coding the AK and SK used for authentication into the code or storing them in plain text has great security risks. It is recommended to store them in cipher text in the configuration file or environment variable, and decrypt them when used to ensure security. In addition, dws will encrypt sk internally, so there is no need to worry about sk being leaked during transmission.

Check the external server (obs as an example).
SELECT * FROM pg_foreign_server WHERE srvname='obs_server';

The return result is as follows, indicating that it has been created successfully:

srvname                      | srvowner | srvfdw | srvtype | srvversion | srvacl |                                                     srvoptions
--------------------------------------------------+----------+--------+---------+------------+--------+---------------------------------------------------------------------------------------------------------------------
 obs_server |    16476 |  14337 |         |            |        | {address=obs.xxx.com:5443,type=obs,encrypt=on,access_key=***,secret_access_key=***}
(1 row)

3.4 Create EXTERNAL SCHEMA

Obtain the internal IP and port of Hive's metastore service and the name of the Hive-side database to be accessed.

Log in to the MRS management console.

Select "Cluster List > Existing Cluster", click the name of the cluster you want to view, and enter the cluster basic information page.

Click "Go to manager" in the operation and maintenance management office, and enter your username and password to log in to the FI management page.

Click "Cluster", "Hive", "Configuration", "All Configurations", "MetaStore", and "Port" in order, and record the value corresponding to the parameter hive.metastore.port.

Click "Cluster", "Hive", and "Instance" in sequence, and record the management IP of the host name corresponding to MetaStore that contains master1.

CREATE EXTERNAL SCHEMA

//Hive docking OBS scenario: SERVER name fills in the external server name created in 2, DATABASE fills in the database created on the Hive side, METAADDRESS fills in the address and port of the hive side metastore service recorded in 1, CONFIGURATION is the default configuration path of the MRS data source, No changes are required.
DROP SCHEMA IF EXISTS ex1;
 
CREATE EXTERNAL SCHEMA ex1
    WITH SOURCE hive
         DATABASE 'demo'
         SERVER obs_server
         METAADDRESS '***.***.***.***:***'
         CONFIGURATION '/MRS/gaussdb/mrs_server'
 
//Hive docking HDFS scenario: SERVER name fills in the data source name mrs_server created by creating the MRS data source connection, METAADDRESS fills in the address and port of the hive-side metastore service recorded in 1, CONFIGURATION is the default configuration path of the MRS data source, no need to change .
DROP SCHEMA IF EXISTS ex1;
 
CREATE EXTERNAL SCHEMA ex1
    WITH SOURCE hive
         DATABASE 'demo'
         SERVER mrs_server
         METAADDRESS '***.***.***.***:***'
         CONFIGURATION '/MRS/gaussdb/mrs_server'

View the EXTERNAL SCHEMA created

SELECT * FROM pg_namespace WHERE nspname='ex1';
SELECT * FROM pg_external_namespace WHERE nspid = (SELECT oid FROM pg_namespace WHERE nspname = 'ex1');
                     nspid                     | srvname | source | address | database | confpath |                                                     ensoptions   | catalog
--------------------------------------------------+----------+--------+---------+------------+--------+---------------------------------------------------------------------------------------------------------------------
                  16393                        |    obs_server |  hive | ***.***.***.***:***        |  demo          | ***       |                         |
(1 row)

3.5 Execute data import into hive table

Create a local data source table, the table structure is consistent with hive
DROP TABLE IF EXISTS product_info_export;
CREATE TABLE product_info_export
(
    product_price                integer        ,
    product_id                   char(30)       ,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    integer        ,
    product_comment_time         date           ,
    product_comment_num          integer        ,
    product_comment_content      varchar(200)                   
) ;

Import Data

Import Hive tables from local source tables.

INSERT INTO ex1.product_info SELECT * FROM product_info_export;

3.6 Execution data is imported from hive to dws table

Import Data

Import Hive tables from local source tables.

INSERT INTO product_info_orc_export SELECT * FROM ex1.product_info;

4 Summary

This article mainly explains the principles and methods of GaussDB (DWS) docking with hiveMetaStore.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Linus took matters into his own hands to prevent kernel developers from replacing tabs with spaces. His father is one of the few leaders who can write code, his second son is the director of the open source technology department, and his youngest son is a core contributor to open source. Huawei: It took 1 year to convert 5,000 commonly used mobile applications Comprehensive migration to Hongmeng Java is the language most prone to third-party vulnerabilities. Wang Chenglu, the father of Hongmeng: open source Hongmeng is the only architectural innovation in the field of basic software in China. Ma Huateng and Zhou Hongyi shake hands to "remove grudges." Former Microsoft developer: Windows 11 performance is "ridiculously bad " " Although what Laoxiangji is open source is not the code, the reasons behind it are very heartwarming. Meta Llama 3 is officially released. Google announces a large-scale restructuring
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/11054552