Hadoop Learning of Big Data Technology (7) - Hive Data Warehouse

Table of contents

material

1. Introduction to Data Warehouse

1. Understanding of data warehouse

(1) The data warehouse is subject-oriented.

(2) The data warehouse changes with time.

(3) The data warehouse is relatively stable

(4) OLTP and OLAP

2. The structure of the data warehouse

(1) Data source

(2) Data storage and management

(3) OLAP server

(4) Front-end tools

3. Data model of data warehouse

(1) Star model

(2) Snowflake model

(3) Understanding of transaction tables and dimension tables

2. Introduction to Hive

1. Understanding of Hive

2. Hive system architecture

(1) User interface

(2) Cross-language service (Thrift Server)

(3) The underlying driving engine

(4) Metadata storage system (Metastore)

3. Working principle of Hive

4. Hive data model

(1) database

(2) table

(3) partition

(4) bucket table

3. Hive installation

1. Understanding of Hive installation mode

(1) Embedded mode

(2) Local mode

(3) Remote mode

2. Download and install Hive

(1) Download

(2) Unzip

(3) Double naming

(4) Download the mysql driver package

3. Install MySql

(1) installation

(2) Configuration

4. Configure Hive

(1) Modify the hive-env.sh file

(2) Modify the hive-site.xml file

(3) Configure environment variables

(4) Initialize the data warehouse hive

(5) Synchronize files to other nodes in the cluster

(6) Set the hadoop proxy user so that the root user can log in remotely to access Hive

 4. Start Hive

1. Local CLI mode

2. Remote service mode

Five, Hive built-in data types

1. Hive basic data types

 2. Complex data types of Hive

6. Hive data model operation

1. Database operation

2. Create an internal table

3. Create an external table

4. Hive partition table operation

(1) Hive common partition

(2) Hive dynamic partition

5. Hive bucket table operation

Seven, Hive data operation

1. Create a table

2. Data query

reference books


material

http://Link: https://pan.baidu.com/s/1Aj2hGtXsvqYFySnObpGIvQ?pwd=n64i Extraction code: n64i

1. Introduction to Data Warehouse

1. Understanding of data warehouse

        A data warehouse is a subject-oriented, integrated, time-changing, but relatively stable collection of data, which is used to support the decision-making analysis and processing of an enterprise or organization.

(1) The data warehouse is subject-oriented.

        The data organization of the operational database is oriented to transaction processing tasks, while the data in the data warehouse is organized according to certain main domains. The "topic" mentioned here is an abstract concept, which refers to the time when users use the data warehouse to make decisions. In terms of focus, a topic is often associated with multiple operational information systems. For example, the product recommendation system is designed based on the data warehouse, and the product information is the subject of the data warehouse.

(2) The data warehouse changes with time.

        A data warehouse is a collection of data at different times. The information it possesses not only reflects the current operating status of the enterprise, but records information from a certain point in the past to the current stages. It can be said that the data storage time limit in the data warehouse must meet the needs of decision analysis (such as the past 5 to 10 years), and the data in the data warehouse must indicate the historical period of the data.

(3) The data warehouse is relatively stable

        The data warehouse cannot be updated, because the main purpose of the data warehouse is to provide data for decision-making analysis, and the operations involved are mainly data queries. Once a certain data is stored in the data warehouse, it will generally be retained for a long time, that is, There are generally a large number of query operations in the data warehouse, and there are few modification and deletion operations. Usually, only regular loading and refreshing are required to update the data.

(4) OLTP and OLAP

        Data processing can be roughly divided into two categories, namely online transaction processing (OLTP) and online analytical processing (OLAP).

a. OLTP is the main application of traditional relational databases, mainly for basic daily transaction processing, such as bank transfers.

b. OLAP is the main application of the data warehouse system. It supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results, such as product recommendation systems.

Comparison between OLTP and OLAP
Compare items OLTP OLAP
user Operators, low-level managers decision makers, senior managers
Function Daily Operations analysis decision
DB design Based on ER model, application-oriented Star/snowflake schema, subject-oriented
DB size GB to TB >= TB
data up-to-date, detailed, two-dimensional, discrete historical, aggregated, multidimensional, integrated
storage size Read/write several (even hundreds) records Read millions (or even hundreds of millions) of records
Operating frequency very often (in seconds) Relatively loose (hourly or even weekly)
unit of work strict affairs complex query
User number hundreds to tens of millions several to hundreds
measure transaction throughput Query throughput, response time

2. The structure of the data warehouse

        The structure of the data warehouse includes four parts, namely data source, data storage and management, OLAP server, and front-end tools.

(1) Data source

        The data source is the foundation of the data warehouse, that is, the data source of the system, which usually contains various internal and external information of the enterprise. Internal information, such as various business data stored in operational databases and various document data contained in automated systems; external information, such as various laws and regulations, market information, competitor information, external statistical data and other related documents, etc.

(2) Data storage and management

        Data storage and management is the core of the entire data warehouse. The organization and management method of the data warehouse determines that it is different from the traditional database and also determines the form of representation of external data. For the existing data in the system, extract, clean and effectively integrate, and organize according to themes. Data warehouses can be divided into enterprise-level data warehouses and department-level data warehouses according to the coverage of data, which is the so-called data mart. A data mart can be understood as a small departmental or workgroup-level data warehouse.

(3) OLAP server

        The OLAP server reorganizes the data to be analyzed according to the multidimensional data model, so as to support users to conduct multi-angle and multi-level analysis at any time, and discover data laws and trends.

(4) Front-end tools

        Front-end tools mainly include various data analysis tools, reporting tools, query tools, data mining tools, and various applications developed based on data warehouses or data marts.

3. Data model of data warehouse

(1) Star model

        A star schema is an option in dimensional modeling. The star model is composed of a fact table and a set of dimension tables, and centered on the fact table, all dimension tables are directly connected to the fact table, and the primary key of the dimension table is placed in the fact table, as a fact table and a dimension table The foreign key of the connection, therefore, the dimension table and the fact table are related, however, the dimension table and the dimension table are not directly connected, so there is no relationship between the dimension tables.

(2) Snowflake model

        The snowflake model is another option in dimensional modeling. It is an extension of the star model. The dimension table of the snowflake model can have other dimension tables, and the dimension tables are interrelated. Therefore, the snowflake model is more standardized than the star model. However, because the snowflake model needs to associate multi-layer dimension tables, its performance is also lower than that of the star model, so it is generally not very commonly used.

(3) Understanding of transaction tables and dimension tables

a. Fact table

        Each data warehouse contains one or more fact tables. The fact table is the measurement of the subject of analysis. It contains the foreign keys associated with each dimension table and is associated with the dimension tables through a join (Join) method. The measurement of the fact table is usually a numerical type, and the number of records will continue to increase, and the size of the table will grow rapidly. For example, there is an order fact table, its field Prod_id (product id) can be associated with the product dimension table, TimeKey (order time) can be associated with the time dimension table, etc.

b. Dimension table

        Dimension tables can be regarded as windows for users to analyze data. Dimension tables contain the characteristics of fact records in fact data tables. Some characteristics provide descriptive information, and some characteristics specify how to summarize fact data table data to provide useful information for analysts. Dimension tables contain hierarchies that help summarize the characteristics of data. A dimension is a unique perspective on data analysis. Looking at a problem from different perspectives will lead to different results. For example, when analyzing product sales, you can choose to analyze by commodity category and commodity region, which constitutes a dimension of category and region. The dimension table information is relatively fixed, and the amount of data is small. The column fields in the dimension table can divide the information into different levels of structure.

2. Introduction to Hive

1. Understanding of Hive

        Hive is a data warehouse built on the Hadoop file system. It provides a series of tools that can perform data extraction, transformation and loading (ETL) on the data stored in HDFS. Tools for large-scale data in Hadoop. Hive defines a simple SQL-like query language called HQL, which can map structured data files into a data table, allowing users who are familiar with SQL to query data, and developers who are familiar with MapReduce to develop custom mappers and reducer to handle complex analysis work, compared to MapReduce written in Java code, Hive has more obvious advantages. Since Hive uses SQL's query language HQL, it is easy to understand Hive as a database. In fact, from a structural point of view: Hive and databases have no similarities except that they have similar query languages.

Comparison between Hive and traditional databases
Comparison item Hive MySQL
query language Hive OL SQL
data storage location HDFS block device, local file system
Data Format user defined system decision
Data Update not support support
thing not support support
execution delay high Low
scalability high Low
data size big Small
Multi-table insert support not support

2. Hive system architecture

        Hive is a data warehouse processing tool that encapsulates Hadoop at the bottom layer. It runs on the basis of Hadoop. Its system framework mainly includes four parts, namely user interface, cross-language service, underlying driver engine, and metadata storage system.

(1) User interface

        It is mainly divided into three, namely CLI, JDBCI/ODBC and WebUI. Among them, CLI is the shell terminal command line, which is the most commonly used method. JDBCI/ODBC is a Java implementation, similar to the way of using traditional database JDBC, and WebUI refers to accessing Hive through a browser.

(2) Cross-language service (Thrift Server)

        Thrilt is a software framework developed by Facebook that can be used for scalable and cross-language services. Hive integrates this service, allowing different programming languages ​​to call the Hive interface.

(3) The underlying driving engine

        It mainly includes compiler (Compiler), optimizer (Optimizer) and executor (Executor), which are used to complete the HQL query statement from lexical analysis, syntax analysis, compilation, optimization and query plan generation. The generated query plan is stored in HDFS , and subsequently executed by MapReduce calls.

(4) Metadata storage system (Metastore)

        Metadata in Hive usually includes table names, columns, partitions and related attributes, and the location information of the directory where the table data is stored. Metastore is stored in the built-in Derby database by default. Because the Derby database is not suitable for multi-user operations, and the data storage directory is not fixed, it is not convenient to manage, so metadata is usually stored in the MySQL database.

3. Working principle of Hive

        Hive is built on Hadoop, and the working process between them is roughly as follows.

(1) The UI sends the executed query operation to the Driver for execution.

(2) Driver parses the query with the help of the query compiler, checks the syntax and query plan or query requirements.

(3) The compiler sends metadata requests to the Metastore.

(4) The compiler sends the metadata as a response to the compiler.

(5) The compiler checks the requirements and resends the plan to the Driver. At this point, query parsing and compilation have been completed.

(6) Driver sends the execution plan to the execution engine to execute the Job task.

(7) The execution engine obtains the result set from the DataNode and sends the result to the UI and Driver.

4. Hive data model

        All data in Hive is stored in HDFS, which includes four data types: Database, Table, Partition, and Bucket.

(1) database

        Equivalent to a namespace in a relational database, its function is to isolate users and database applications into different databases or schemas.

(2) table

        Hive tables logically consist of stored data and associated metadata that describe the form of the tabular data. The data stored in the table is stored in a distributed file system, such as HDFS. The tables in Hive are divided into two types, one is called internal table, and the data of this table is stored in the Hive data warehouse; the other is called external table, and the data of this table can be stored in the distribution outside the Hive data warehouse file system, and can also be stored in a Hive data warehouse. It is worth mentioning that the Hive data warehouse is also a directory in HDFS. This directory is the default path for Hive data storage. It can be configured in the Hive configuration file and will eventually be stored in the metadata database.

(3) partition

        The concept of partition is a mechanism to roughly divide the data of the table according to the value of the "partition column". It is reflected in the Hive storage as a subdirectory under the main directory of the table (the Hive table is actually displayed as a folder). The name of this subdirectory is the name of the defined partition column. Partitioning is designed to speed up data query. For example, now there is a log file, and each record in the file has a timestamp. If partitioning is based on time, the data of the same day will be divided into the same partition. In this way, if you query the data of each day or certain days, it will become very efficient, because you only need to scan the files in the corresponding partition.

Note: The partition column is not a field in the table, but an independent column, and the data files in the storage table are queried based on this column.

(4) bucket table

        To put it simply, the bucket table is to divide the "big table" into "small tables". The purpose of organizing tables or partitions into bucket tables is mainly to obtain higher query efficiency, especially sampling queries are more convenient. The bucket table is the smallest unit of the Hive data model. When data is loaded into the bucket table, the value of the field will be hashed, and then divided by the number of buckets to obtain the remainder for bucketing to ensure that there is data in each bucket. Physically, each bucket table is a file of the table or partition.

3. Hive installation

1. Understanding of Hive installation mode

        There are three Hive installation modes: embedded mode, local mode, and remote mode.

(1) Embedded mode

        Use the embedded Derby database to store metadata, which is the default installation of Hive. The configuration is simple, but only one client can be connected at a time, which is suitable for testing and not suitable for production environments.

(2) Local mode

        An external database is used to store metadata. This mode does not need to enable the Metastore service separately, because the local mode uses the Metastore service in the same process as Hive.

(3) Remote mode

        Like the local mode, the remote mode also uses an external database to store metadata. The difference is that the remote mode needs to open the Metastore service separately, and then each client configures and connects to the Metastore service in the configuration file. In remote mode, the Metastore service and Hive run in different processes.

The installation and configuration methods of the local mode and the remote mode are roughly the same. Essentially, the default metadata storage of Hive is replaced by the MySQL database through the built-in Derby database, so that no matter how to start Hive in any directory, as long as the connection is the same If a Hive service is used, the metadata information accessed by all nodes is consistent, thereby realizing the sharing of metadata.

2. Download and install Hive

(1) Download

        Download apache-hive-1.2.2-bin.tar.gz to the /export/software/ directory.

        Download address of all versions of Hive. The current highest version is 3.1.3 http://archive.apache.org/dist/hive/

        apache-hive-1.2.2 download address http://archive.apache.org/dist/hive/hive-1.2.2/

(2) Unzip

        Enter the directory /export/software/, unzip hive-1.2.2 to /export/servers/.

tar -xzvf apache-hive-1.2.2-bin.tar.gz -C /export/servers/

(3) Double naming

        Rename the installation directory apache-hive-1.2.2-bin to hive, enter the directory /export/servers/, and execute the command

cd /export/servers/
mv apache-hive-1.2.2-bin hive

(4) Download the mysql driver package

        Download the file mysql-connector-java-8.0.20.jar to hive/lib.

3. Install MySql

(1) installation

        There are many ways to install Mysql. Here we install Mysql through local commands and execute the following commands. It should be noted here that the following MariaDB is a replacement binary code for the same MySQL version, and is an enhanced replacement in MySQL.

下载MySQL
yum install mariadb-server
yum install mariadb-devel
yum install mariadb -y

查看安装
rpm -qa | grep mariadb

确保出现以下这四个
mariadb-5.5.68-1.el7.x86_64
mariadb-server-5.5.68-1.el7.x86_64
mariadb-libs-5.5.68-1.el7.x86_64
mariadb-devel-5.5.68-1.el7.x86_64

(2) Configuration

设置开机自启
systemctl enable mariadb  

启动数据库服务
systemctl start mariadb

进入数据库后执行以下代码
mysql>use mysql;
mysql>update user set Password=PASSWORD('123456') where user='root';
mysql>grant all PRIVILEGES on *.* to 'root'@'% ' identified by '123456' with grant option;
mysql>FLUSH PRIVILEGES;
mysql>quit   

 

        Log in again and execute the following command

mysql -uroot -p123456

4. Configure Hive

(1) Modify the hive-env.sh file

d /export/servers/hive/conf
复制文件 
cp hive-env.sh.template hive-env.sh

修改hive-env.sh配置文件,添加Hadoop环境变量,具体内容如下:
export JAVA_HOME=/export/servers/jdk
export HADOOP_HOME=/export/servers/hadoop-2.10.1 

#由于部署 Hadoop 时已经配置了全局 Hadoop 环境变量,因此可以不设置上面2行参数。
export HIVE_HOME=/export/servers/hive
export HIVE_CONF_DIR=/export/servers/hive/conf
export HIVE_AUX_JARS_PATH=/export/servers/hive/lib

 

(2) Modify the hive-site.xml file

创建相关文件
vi hive-site.xml 

添加以下配置
<configuration>
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://hadoop01.bgd01:3306/hive?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>Username to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
    <description>password to use against metastore database</description>
  </property>

  <property>
    <name>hive.execution.engine</name>
    <value>mr</value>
    <description>
      Expects one of [mr, tez, spark].
      Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
      remains the default engine for historical reasons, it is itself a historical engine
      and is deprecated in Hive 2 line. It may be removed without further warning.
    </description>
  </property>
</configuration>

(3) Configure environment variables

vi /etc/profile

添加如下2行:
export HIVE_HOME=/export/servers/hive
export PATH=$PATH:$HIVE_HOME/bin

(4) Initialize the data warehouse hive

bin/schematool -dbType mysql -initSchema

 Initialization successful

(5) Synchronize files to other nodes in the cluster

将 hadoop01 上安装的 Hive 程序分别复制到hadoop02、hadoop03服务器上
scp -r /export/servers/hive/ hadoop02.bgd01:/export/servers/
scp -r /export/servers/hive/ hadoop03.bgd01:/export/servers/

同步全局环境配置文件
scp /etc/profile hadoop02.bgd01:/etc/
scp /etc/profile hadoop03.bgd01:/etc/

(6) Set the hadoop proxy user so that the root user can log in remotely to access Hive

修改Hadoop配置文件 core-site.xml
vi /export/servers/hadoop-2.10.1/etc/hadoop/core-site.xml 

添加以下配置
<!-- 设置 hadoop 的代理用户-->
    <property>
        <!--表示代理用户的组所属-->
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
    
    <property>
        <!--表示任意节点使用 hadoop 集群的代理用户 hadoop 都能访问 hdfs 集群-->
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>

 4. Start Hive

1. Local CLI mode

启动Hive
执行如下命令:
hive
显示如下:
hive>

查看数据仓库中的数据库
hive>show databases;

查看数据仓库中的表
hive>show tables;

查看数据仓库中的内置函数
hive>show functions;

清屏
hive>!clear

退出
hive>exit;
hive>quit;

2. Remote service mode

在hadoop01上启动 Hiveserver2服务
hiveserver2

注意, 执行上述命令后, 没有任何显示. 但是,重新打开一个终端,用jps查询,会多出一个RunJar进程.

在hadoop02服务器的Hive安装包下, 执行远程连接命令连接到 Hive数据仓库服务器
(如果只有一台服务器,可以在本地打开另外一个终端进行操作演示)
//输入远程连接命令
bin/beeline

//出现如下显示信息
Beeline version 2.3.9 by Apache Hive
beeline> 

//如下输入连接协议
beeline> !connect jdbc:hive2://hadoop01.bgd01:10000
//显示正在连接信息
Connecting to jdbc:hive2://hadoop01.bgd01:10000

//根据提示输入 Hive服务器 hadoop01 的用户名和密码
Enter username for jdbc:hive2://hadoop01.bgd01:10000: root
Enter password for jdbc:hive2://hadoop01.bgd01:10000: ********
//显示已经连接到Hive服务器
Connected to: Apache Hive (version 2.3.9)
Driver: Hive JDBC (version 2.3.9)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://hadoop01.bgd01:10000> 

操作数据仓库
现在可以像 CLI方式一样操作数据仓库命令

查看数据仓库中的数据库
0: jdbc:hive2://hadoop01.bgd01:10000> show databases;


查看数据仓库中的表
0: jdbc:hive2://hadoop01.bgd01:10000> show tables;

查看数据仓库中的内置函数
hive>show functions;

退出
0: jdbc:hive2://hadoop01.bgd01:10000>!exit
0: jdbc:hive2://hadoop01.bgd01:10000>!quit

 

Five, Hive built-in data types

1. Hive basic data types

type of data describe
TINYINT 1-byte signed integer, -128~127
SMALLIAT 2-byte signed integer, -32768~32767
INT 4-byte signed integer, the 31st power of -2 to the 31st power of 2 sends -1
BIGINT 8-byte signed integer, -2 to the 64th power to 2 to the 63rd power -1
FLOAT 4-byte single-precision floating-point number
DOUBLE 8-byte double-precision floating-point number
DOUBLE PRECISION Alias ​​for Double , available since Hive 2.0
DECIMAL Arbitrary-precision signed decimal
NUMERIC Also DECIMAL, starting from Hive 3.0
TIMESTAMP Timestamp with nanosecond precision
DATE date described in year/month/day
INTERVAL Indicates the time interval
STRING string
VARCHAR Same as STRING, the string length is not fixed
CHAR fixed length string
BOOLEAN Used to store true (TRUE) and false (FALSE) values
BINARY byte array

 2. Complex data types of Hive

type of data describe
ARRAY A set of ordered fields, field types must be the same
MAP An unordered set of key-value pairs. The key type must be an atomic type, and the value can be of any type. The key types of the same map must be the same, and the value types must also be the same
STRUCT A set of named fields, fields can be of different types

6. Hive data model operation

        Create a directory /export/data/hivedata to store local data files        

1. Database operation

创建数据库 itcast
hive>create database if not exists itcast;

查看数据库信息
hive> describe database itcast;
显示如下信息:
OK
+----------+----------+-------------------------------------------+-------------+-------------+-------------+--+
| db_name  | comment  |                 location                  | owner_name  | owner_type  | parameters  |
+----------+----------+-------------------------------------------+-------------+-------------+-------------+--+
| itcast   |          | hdfs://ns1/user/hive/warehouse/itcast.db  | root        | USER        |             |
+----------+----------+-------------------------------------------+-------------+-------------+-------------+--+


其中,hdfs://ns1/user/hive/warehouse/itcast.db 表示数据库 itcast 所在的hdfs文件系统上的路径。

切换数据库
hive>use itcast;

2. Create an internal table

(1) 针对基本数据类型建表
hive>create table t_user(id int, name string, age int) 
    row format delimited fields terminated by ',';

上传结构化数据文件 user.txt
hadoop fs -put user.txt /user/hive/warehouse/itcast.db/t_user

查询表记录
hive>select * from t_user;

(2) 针对复杂数据类型建表
hive>create table t_student(id int, name string, hobby map<string, string>) 
    row format delimited fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':';

上传结构化数据文件 student_map.txt
hadoop fs -put student_map.txt /user/hive/warehouse/itcast.db/t_student

查询表记录
hive>select * from t_student;

(3) 查看表结构信息
hive>describe t_student;
显示如下信息:
OK
id                  	int                 	                    
name                	string              	                    
hobby               	map<string,string>  	

3. Create an external table

在hdfs上创建目录 /stu
hadoop fs -mkdir /stu

上传本地文件 student.txt 至 hdfs 中的 /stu 目录下
hadoop fs -put student.txt /stu

创建外部表 student_ext 
hive>create external table student_ext(Sno int, Sname string, Sex string, Sage int, Sdept string) 
  row format delimited fields terminated by ',' location '/stu'; 

查询表 student_ext 中记录
hive>select * from student_ext;

4. Hive partition table operation

(1) Hive common partition

创建分区表
hive>create table t_user_p(id int, name string) 
    partitioned by (country string)
    row format delimited fields terminated by ',';

从本地加载数据
hive>load data local inpath '/export/data/hivedata/user_p.txt' into table t_user_p
    partition(country='USA'); 

查询表中记录
hive>select * from t_user_p;

hive>select * from t_user_p where country=‘USA';

新增分区
hive>alter table t_user_p ADD PARTITION (country='China') location      '/user/hive/warehouse/itcast.db/t_user_p/country=China';

查询表中记录
hive>select * from t_user_p where country='China';

修改分区
hive>alter table t_user_p PARTITION (country='China') RENAME TO  PARTITION (country='Japan');

删除分区
hive>alter table t_user_p drop IF EXISTS PARTITION (country='Japan') ;

(2) Hive dynamic partition

开启动态分区功能
hive>set hive.exec.dynamic.partition=true;
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive>set mapreduce.framework.name=local; #2.x 版本需要设置

创建原始表
hive>create table dynamic_partition_table(day string, ip string) 
    row format delimited fields terminated by ',';

加载数据文件至原始表
hive>load data local inpath '/export/data/hivedata/dynamic_partition_table.txt' 
    into table dynamic_partition_table;

创建目标表
hive>create table d_p_t(ip string) partitioned by (month string, day string);

动态插入
hive>insert overwrite table d_p_t partition (month, day)
    select ip, substr(day,1,7) as month, day
    from dynamic_partition_table;

查看目标表中的分区数据
hive>show partitions d_p_t;
 

5. Hive bucket table operation

开启桶表功能
hive>set hive.enforce.bucketing=true;
hive>set mapreduce.job.reduces=4;

创建桶表
hive>create table stu_buck(Sno int, Sname string, 
  Sex string, Sage int, Sdept string) 
  clustered by(Sno) into 4 buckets
  row format delimited fields terminated by ','; 

创建stu_buck表的临时表student_tmp
hive>create table student_tmp(Sno int, Sname string, 
  Sex string, Sage int, Sdept string) 
  row format delimited fields terminated by ','; 

加载数据至临时表
hive>load data local inpath '/export/data/hivedata/student.txt' 
    into table student_tmp;

将数据导入桶表
hive>insert overwrite table stu_buck
    select * from student_tmp cluster by(Sno) ;

如果不再需要的话,可以删除临时表 student_tmp 以释放存储空间
hive>drop table if exists student_tmp ;

查看桶文件内容
在linux系统提示符下执行如下命令:
hadoop fs -cat /user/hive/warehouse/itcast.db/stu_buck/000000_0
显示如下内容:
95004,张立,男,19,IS
95008,李娜,女,18,CS
95012,孙花,女,20,CS
95016,钱国,男,21,MA
95020,赵钱,男,21,IS

将 000000_0 依次改为 000001_0、000002_0、000003_0 进行显示,并与 student.txt 文件中的内容进行比较。

Seven, Hive data operation

1. Create a table

(1)创建emp表
hive>create table emp(empno int, ename string, job string, mgr int, 
  hiredate string, sal double,  comm double, deptno int) 
  row format delimited fields terminated by '\t'; 
(2)创建dept表
hive>create table dept(deptno int, dname string, loc int) 
  row format delimited fields terminated by '\t'; 

(3)加载数据
hive>load data local inpath '/export/data/hivedata/emp.txt' 
    into table emp;
hive>load data local inpath '/export/data/hivedata/dept.txt' 
    into table dept;

2. Data query

例1 基本查询
(1) 全表查询
hive>select * from emp;
(2) 选择特点字段查询
hive>select deptno, dname from dept;
(3) 查询员工表总人数
hive>select count(*) cnt from emp;
(4) 查询员工表总工资额
hive>select sum(sal) sum_sal from emp;
(5) 查询5条员工表的信息
hive>select * from emp limit 5;

例2 Where条件查询
(1) 查询工资等于 5000 的所有员工
hive>select * from emp where sal = 5000;
(2) 查询工资在 500 到 1000 的员工信息
hive>select * from emp where sal between 500 and 1000;
(3) 查询 comm 为空的所有员工信息
hive>select * from emp where comm is null;
(4) 查询工资是 1500 和 5000 的员工信息
hive>select * from emp where sal in (1500, 5000);

例3 Like 和 Rlike
(1) 查询工资以 2 开头的员工信息
hive>select * from emp where sal like '2%';
(2) 查询工资的第二个数值为 2 的员工信息
hive>select * from emp where sal like '_2%';
(3) 查询工资中含有 2 的员工信息
hive>select * from emp where sal rlike '[2]'; 
#本句执行在 2.x 版本出现语法错误:Argument type mismatch 'sal': regexp only takes STRING_GROUP types as 1st argument, got DOUBLE

例4 Group by 语句
(1) 计算 emp 表每个部门的平均工资
hive>select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
(2) 计算 emp 表每个部门中每个岗位的最高工资
hive>select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;

例5 Having 语句
Having 只用于 Group by 分组统计语句
(1) 求每个部门的平均工资
hive>select deptno, avg(sal) from emp group by deptno;
(2) 求每个部门的平均工资大于 2000 的部门
hive>select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

例6 Order by 语句
(1) 查询员工信息,按工资降序排序
hive>select * from emp order by sal desc;
(2) 按部门编号和工资升序排序
hive>select * from emp order by deptno,sal;

例7 Sort by 语句
(1) 设置 reduce 个数
hive>set mapreduce.job.reduces=3;
(2) 查看设置 reduce 个数
hive>set mapreduce.job.reduces;
(3) 根据部门编号降序查看员工信息
hive>select * from emp sort by deptno desc;
(4) 将查询结果导入到文件中(按部门编号降序排列)
hive>insert overwrite local directory '/root/sortby-result'
    select * from emp sort by deptno desc;

例8 Distribute by 语句
(1) 设置 reduce 个数
hive>set mapreduce.job.reduces=3;
(2) 先按部门编号分区,再按员工编号降序排序
hive>insert overwrite local directory '/root/distribute-result'
    select * from emp distribute by deptno sort by empno desc;

例9 Cluster by 语句
以下两种排序等价
hive>select * from emp cluster by deptno;
hive>select * from emp distribute by deptno sort by deptno;


例10 Join 操作
(1) 根据员工表和部门表中的部门编号相等,查询员工编号、员工姓名、部门编号、部门名称
hive>select e.empno, e.ename, d.deptno, d.dname
    from emp e join dept d on e.deptno = d.deptno;

(2) 左外连接:Join操作符左边表中符合条件的所有记录会被返回
hive>select e.empno, e.ename, d.deptno, d.dname
    from emp e left join dept d on e.deptno = d.deptno;

(3) 右外连接:Join操作符右边表中符合条件的所有记录会被返回
hive>select e.empno, e.ename, d.deptno, d.dname
    from emp e right join dept d on e.deptno = d.deptno;

(4) 满外连接:返回所有表中符合条件的所有记录,如果任一表的指定字段没有符合条件的值的话,使用 NULL 值代替。
hive>select e.empno, e.ename, d.deptno, d.dname
    from emp e full join dept d on e.deptno = d.deptno;

(5) 使用 Join 操作符,返回部门编号为20的记录。
hive>select e.empno, e.ename, d.deptno, d.dname
    from emp e full join dept d on e.deptno = d.deptno where d.deptno=20;


1、打印列名:
set hive.cli.print.header=true;

2、每行显示一个key、value,即\G类似的方式
set hive.cli.print.header=true;
set hive.cli.print.row.to.vertical=true;
set hive.cli.print.row.to.vertical.num=1;

reference books

"Hadoop big data technology principle and application"

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_63507910/article/details/128632422