Data warehouse tool Hive-basic articles (1)

1. Introduction

Hive is a data warehouse built on Hadoop

  • The structured data file can be mapped to a database table
  • Provides SQL-like query language HQL (Hive Query Language), which can convert SQL statements into MapReduce tasks to run

2. Advantages and disadvantages

2.1, advantages

  • Easy to get started, HQL syntax is similar to SQL
  • Unified metadata management , can share metadata with impala/spark, etc.
  • Good flexibility and scalability: support custom user functions ( UDF ), custom storage formats, etc.
  • Supports running on different computing frameworks (MR, Tez, Spark)
  • Scalable: Computing/scaling capabilities are designed for very large data sets (MR is used as the computing engine and HDFS is used as the storage system. In general, Hive can freely expand the scale of the cluster without restarting the service.
  • High execution delay, not suitable for real-time data processing, suitable for offline data processing , stable and reliable (real production environment)

Compared with MapReduce, Hive has slower execution efficiency but faster development efficiency


2.2. Disadvantages

(1) Hive's HQL expression ability is limited

  • Iterative algorithms cannot be expressed (complex logical algorithms are not easy to encapsulate)

  • In terms of data mining, due to the limitations of the MapReduce data processing process (which is slower because the underlying shortcomings are still there), more efficient algorithms cannot be implemented.

(2) The efficiency of hive is relatively low

1) The MapReduce job automatically generated by hive is usually not smart enough

2) Hive tuning is more difficult, and the granularity is relatively coarse (can only be optimized on the basis of the framework, not deep into the underlying MR program optimization)

3) Poor controllability of hive

Three, system architecture

Insert picture description here
Through a series of interactive interfaces provided to users, Hive receives user instructions ( SQL ), uses its own Driver , combined with metadata ( MetaStore ), translates these instructions into MapReduce , submits them to Hadoop for execution, and finally returns the execution. The result is output to the user interaction interface

3.1, client (Client)

The client needs to connect to Hive Server through tools. There are three common tools as follows:

  • Hive Thrift
    communication framework, you can use Thrift to connect to Hive
  • Hive JDBC Driver (Commonly used)
    Connect to Hive client
  • Hive ODBC Driver

3.2. Services

The client connects to the Driver through a series of interfaces, which is the driver of Hive Server

  • CLI (command-line interface)
    command line interface is the most commonly used user interface
  • JDBC/ODBC (jdbc access hive)
    maintains communication with Hive Server through Client, and realizes interaction with thrfit rpc protocol
  • WEBUI (browser access hive)
    development and testing commonly used

Other components

  • Metastore is
    used to store the metadata of hive, including table name, database to which the table belongs, table owner, column/partition field, table type, table data directory, etc.
  • Drivers
    include Interpreter, Complier, Optimizer and Executor. The function is to parse, compile and optimize the HQL statements we wrote, generate an execution plan, and then call the underlying MapReduce computing framework
    >SQL parser: convert the SQL string (to be precise HiveQL) is transformed into an abstract syntax tree AST;
    >Compiler: compile AST to generate a logical execution plan;
    >Logical optimizer: optimize the logical execution plan;
    >Physical executor: convert the logical execution plan into an executable physical plan, Commonly used is MR/TEZ/Spark;

3.3, bottom layer

The underlying storage of Hive is based on HDFS for storage, and the underlying computing of Hive is converted to MapReduce for calculation.

  • Hive metastore database
    metadata database
  • Hadoop cluster
    uses Hadoop clusters for MapReduce calculations

Four, Hive metadata management

Metastore storage has two parts, service and storage. Hive metadata storage has three deployment modes, namely embedded mode, local mode and remote mode.

◼Embedded mode

  • This mode is also called single-user mode . It is the simplest way to deploy Hive Metastore. It uses Hive's embedded Derby database to store metadata.
  • But Derby can only accept access from one Hive session, and attempting to start a second Hive session will cause the Metastore connection to fail.
  • Suitable for testing and demonstration

◼Local mode (commonly used): Actual production is generally stored in MySQL.
Local mode is also called multi-user mode. It is the default mode of Metastore. In this mode, a single Hive session calls Metastore and Driver in component mode.

  • Modify the configuration file hive-site.xml (library path, connection driver, user name and password)
hive.jdo.option.connectionURL="jdbc:mysql://{hostname}/{database name}?createDatabaseIfNotExist=true"
hive.jdo.option.ConnectionDriverName="com.mysql.jdbc.Driver"
hive.jdo.option.connectionUserName="{userName}"
hive.jdo.option.connectionPassword="{userPassword}"
  • Put the MySQL JDBC driver Jar file in the lib directory of Hive

◼Remote mode is
used for non-Java clients to access the metabase. MetaStoreServer is started on the server side, and the client uses the Thrift protocol to access the metabase through MetaStoreServer. It separates the Metastore and becomes an independent Hive service


Five, Hive operation

Refer to hive operation-command line mode and interactive mode

There are two commonly used client tools for Hive operations, namely: Beeline and Hive Command Line (CLI)
Hive operations are divided into two modes, one is the command line mode and the other is the interactive mode.

5.1, HiveServer and HiveServer2

  • Hive has built-in HiveServer and HiveServer2 services, both of which allow clients to connect using multiple programming languages, but HiveServer cannot handle concurrent requests from multiple clients, so HiveServer2 was created.
  • HiveServer2 (HS2) allows remote clients to submit requests to Hive and retrieve results using various programming languages, and supports concurrent access and authentication by multiple clients. HS2 is a single process composed of multiple services, including Thrift-based Hive service (TCP or HTTP) and Jetty Web server for Web UI.
  • HiveServer2 has its own CLI (Beeline), which is a JDBC client based on SQLLine. Since
    HiveServer2 is the focus of Hive development and maintenance (hiveserver is no longer supported after Hive0.15), Hive CLI
    is no longer recommended, and the official recommendation is to use Beeline.

5.2, command line mode

Insert picture description here
hive

  • Use the -e parameter to directly execute the sql statement
# 注意sql语句末尾有分号
[root@single ~]# hive -e "show databases;"
  • Use the -f parameter to execute SQL statements by specifying a text file, which can be a local text file or an HDFS file
[root@single test]# vi hive2.sql
# 在hive2.sql文件中写入show databases;
# 保存退出,注意语句末尾有分号

# 本地文件
[root@single test]# hive -f hive2.sql
# HDFS文件
[root@single test]# hive -f hdfs://single:8020/test/hive2.sql;
  • Turn on the local operation mode:, the set hive.exec.mode.local.auto = true;local mode is valid only when the number of reduce is less than or equal to 1, the default number of reduce is -1, you can use the command: set mapreduce.job.reduces;view, if you need to set the number of reduce, you can use the command:set mapreduce.job.reduces=1;

beeline

  • Need to start HiveServer2 one after another
[root@single ~]# nohup hive --service hiveserver2>/dev/null 2>&1 &
  • Use the -e parameter to directly execute the sql statement
[root@single ~]# beeline -u "jdbc:hive2://single:10000" -e "show databases";
  • Use the -f parameter to execute SQL statements by specifying a text file
[root@single test]# beeline -f hive2.sql -u "jdbc:hive2://single:10000"
  • Remove the INFO log of the beeline page:set hive.server2.logging.operation.level=NONE;
  • Open the INFO log of the beeline page:set hive.server2.logging.operation.level=EXECUTION;

5.3, interactive mode

Insert picture description here
hive

(1) Interactive mode

[root@single ~]# hive

View all databases

hive> show databases;

 # hive后面会出现库名,只对当前会话有效
hive> set hive.cli.print.current.db=true;
hive (default)> 

Create database

hive> create database hivetest;

(2) Use beeline (need to start hiveserver2)

  1. Start metastore (not required)
[root@single ~]# nohup hive --service metastore>/dev/null 2>&1 &
  1. Start hiveserver2
[root@single ~]# nohup hive --service hiveserver2>/dev/null 2>&1 &

Use beeline to connect to hiveserver2——Method ① (you need to enter the user name and password)

[root@single ~]# beeline
...
beeline> !connect jdbc:hive2://single:10000
scan complete in 1ms
Connecting to jdbc:hive2://single:10000
Enter username for jdbc:hive2://single:10000: root
Enter password for jdbc:hive2://single:10000: ****
Connected to: Apache Hive (version 1.1.0-cdh5.14.2)
Driver: Hive JDBC (version 1.1.0-cdh5.14.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://single:10000>

Use beeline to directly connect to hiveserver2——method ② (no need to enter user name and password)

[root@single ~]# beeline -u jdbc:hive2://single:10000
...
scan complete in 2ms
Connecting to jdbc:hive2://single:10000
Connected to: Apache Hive (version 1.1.0-cdh5.14.2)
Driver: Hive JDBC (version 1.1.0-cdh5.14.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.14.2 by Apache Hive
0: jdbc:hive2://single:10000> 

Six, data type

6.1 Basic data types

type of data java mysql hive
String String char(n)/varchar(n)/text/… string/varchar(65536)/char(255)
String char
Integer byte/short/int/long smallint/int(n)/bigint(n) smallint/int/bigint
Decimal float/double/BigDecimal float/double/money/real float/double
Boolean boolean bit boolean
date java.util.Date date/datetime/timestamp date/timestamp
List HashSet set(‘V1’,‘V2’,‘V3’,…) array<data_type>

6.2, collection data type

type of data description Syntax example
STRUCT Similar to the struct in the C language, you can access the content of the element through the "dot" symbol struct< name:STRING, age:INT>
MAP MAP is a set of key-value pair tuples, the data can be accessed using array notation map< string, int>
ARRAY An array is a collection of variables with the same type and name. These variables are called the elements of the array, each array element has a number, the number starts from zero array< INT>

6.3, implicit conversion

The basic data types in Hive follow the following hierarchy. According to this hierarchy, implicit conversions from subtypes to ancestor types are allowed. For example, data of type INT allows implicit conversion to type BIGINT. An extra note is that the STRING type can be implicitly converted to the DOUBLE type according to the type hierarchy.

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_48482704/article/details/110849660