Hive data warehouse technology in the Hadoop ecosystem

1. Basic concepts of Hive data warehouse

Hive is also a top open source project on the Apache website. The official website address is: hive.apache.org

Hive technology uses SQL-like language (HiveQL-HQL) to manage, calculate and store distributed data.

Hive is a data warehouse software based on Hadoop, which uses a form similar to the data table in MySQL to manage and calculate massive data.

  • The data stored by Hive is in the form of a data table, but Hive itself does not store any data. It only stores data in the form of tables. The table data is ultimately stored on HDFS, but Hive uses a metadata database called metadatabase. The operation method can convert the structured files stored in HDFS into tabular form for display.
  • The data in Hive's tables can be calculated using SQL-like statements (DQL language, DDL language). However, Hive uses SQL language for calculation on the surface, but the bottom layer will convert the SQL language into a MapReduce program to run on YARN. (After the current Hive1.x version, Hive also supports the conversion of SQL-like language into Spark and Tez distributed computing programs)

Hive is based on Hadoop, so hive has requirements for the version of hadoop
hive.2.xx---->hadoop2.xx
hive3.xx---->hadoop3.xx

Hive is essentially a client of Hadoop, and it is similar to a client that can operate Hadoop through SQL.

2. Hive’s architecture

Hive is equivalent to a client of Hadoop. It can convert the data stored in HDFS into the form of a data table. At the same time, it can also calculate the data in the Hive data table with the help of SQL-like language. The underlying SQL-like statements used in the calculation are It will be converted into a MapReduce program and run on YARN.

The reason why Hive can achieve the above functions is mainly because of Hive’s design architecture. Hive as a whole mainly consists of the following parts:

  • 1. Hive Client: Hive client is a client that writes SQL-like language to create databases, data tables, and write query statements. There are many types of Hive clients: hive command line client, hive Java API ( JDBC) client, Hive web client, etc.
  • 2. Hive's Driver (driver) program: The core of hive. The reason why hive can convert SQL-like statements into MR programs is mainly because the Driver is responsible for the conversion. The hive Driver consists of the following parts:
    • 1. Parser: Abstract the written SQL-like language into a syntax tree and check whether there are any problems with the syntax.
    • 2. Compiler: Generate a logical execution plan from the abstract syntax tree.
    • 3. Optimizer: optimize the execution plan.
    • 4. Executor: Convert the optimized execution plan into real physical execution code, such as computing programs supported by Hive (MR, TEZ, Spark programs).
  • 3. Hive’s metadata metaStore, Hive is not responsible for storing any data, including databases, data tables, table structures, etc. created by hive, which are not stored in hive, as well as table data (on HDFS). We store information in Hive's metadata, and the metadata is stored in a relational database (such as MySQL, oracle, SQL Server, Derby database).
  • 4. Hive's metadatabase: Hive's metadatabase stores the databases, data tables, table fields and field types created in Hive, as well as the mapping relationships between table data and data tables.
    Hive's metadata database is not stored in Hive, but in a relational database such as derby, MySQL, SQL Server, Oracle, etc. By default, if no configuration is performed, hive will store metadata in the derby database by default (configured by the hive-default.xml.templete default file).

3. The difference between Hive and database

Hive uses a SQL-like language to calculate massive data, and the operation looks quite similar to that of a database. But you must know that Hive and the database are not the same thing at all. It is just that Hive simplifies the operation of processing massive data with the help of data tables and SQL ideas in the database. In addition, hive's storage mechanism, execution mechanism, execution delay, The amount of data stored is fundamentally different from that of a database.

  • Query Language Differences

      由于SQL被广泛的应用在数据仓库中,因此,专门针对Hive的特性设计了类SQL的查询语言HQL。熟悉SQL开发的开发者可以很方便的使用Hive进行开发。
    
  • Differences in data storage locations

      Hive是建立在Hadoop之上的,所有Hive的数据都是存储在HDFS中的。而数据库则可以将数据保存在块设备或者本地文件系统中。
    
  • The difference between data updates

      由于Hive是针对数据仓库应用设计的,而数据仓库的内容是读多写少的。因此,Hive中不支持对数据的改写和添加,所有的数据都是在加载的时候中确定好的。而数据库中的数据通常是需要经常进行修改的,因此可以使用 INSERT INTO …  VALUES 添加数据,使用 UPDATE … SET修改数据。
    
  • The difference between indexes

      Hive在加载数据的过程中不会对数据进行任何处理,甚至不会对数据进行扫描,因此也没有对数据中的某些Key建立索引。Hive要访问数据中满足条件的特定值时,需要暴力扫描整个数据,因此访问延迟较高。由于 MapReduce 的引入, Hive 可以并行访问数据,因此即使没有索引,对于大数据量的访问,Hive 仍然可以体现出优势。数据库中,通常会针对一个或者几个列建立索引,因此对于少量的特定条件的数据的访问,数据库可以有很高的效率,较低的延迟。由于数据的访问延迟较高,决定了 Hive 不适合在线数据查询。
    
  • Differences in execution methods

      Hive中大多数查询的执行是通过 Hadoop 提供的 MapReduce 来实现的。而数据库通常有自己的执行引擎。
    
  • The difference in execution delays

      Hive 在查询数据的时候,由于没有索引,需要扫描整个表,因此延迟较高。另外一个导致 Hive 执行延迟高的因素是 MapReduce框架。由于MapReduce 本身具有较高的延迟,因此在利用MapReduce 执行Hive查询时,也会有较高的延迟。相对的,数据库的执行延迟较低。当然,这个低是有条件的,即数据规模较小,当数据规模大到超过数据库的处理能力的时候,Hive的并行计算显然能体现出优势。
    
  • The difference in scalability

      由于Hive是建立在Hadoop之上的,因此Hive的可扩展性是和Hadoop的可扩展性是一致的(世界上最大的Hadoop 集群在 Yahoo!,2009年的规模在4000 台节点左右)。而数据库由于 ACID 语义的严格限制,扩展行非常有限。目前最先进的并行数据库 Oracle 在理论上的扩展能力也只有100台左右。
    
  • The difference in data size

      由于Hive建立在集群上并可以利用MapReduce进行并行计算,因此可以支持很大规模的数据;对应的,数据库可以支持的数据规模较小。
    

4. Installation and deployment of Hive

Hive is equivalent to a SQL-like client of Hadoop. The underlying storage and computing are all based on Hadoop. Therefore, Hadoop software (pseudo-distributed, fully distributed, HA high availability) must be deployed and installed before Hive is installed.

Hive itself is not a distributed software. It relies on Hadoop's distributed storage and distributed computing. Hive is equivalent to a client software. Therefore, no matter which mode the Hadoop software is installed in, Hive only needs to be installed in any mode of the Hadoop cluster. Just one node.

The installation of Hive is divided into the following steps :

  • 0. Before installing hive, you must first install and configure JDK and Hadoop successfully.

  • 1. Upload, decompress, and configure environment variables

    • tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/app/
      
    • vim /ect/profile
      
      export HIVE_HOME=/opt/app/hive-3.1.2
      export PATH=$PATH:$HIVE_HOME/bin
      
      source /etc/profile
      
    • For the installed big data software directory, the directory name should only contain the software name + version number.image-20230801170427400

  • 2. Modify hive configuration file

    • 1. Modify the association between hive and hive configuration files
    • 2. Modify the relationship between hive and Hadoop
    • Modify the configuration of the above two items in hive-env.sh

    First copy hive-env.sh.template in the conf directory and rename it to hive-env.sh

    image-20230801171653143

    vim hive-env.sh
    
    export HADOOP_HOME
    2export  HIVE_CONF_DIR
    之所以指定hive的配置文件目录,是因为hive默认提供的配置文件都是一个临时后缀名的文件,更像是一个hive的配置模块,Hive在默认情况下找不到配置文件目录
    

    image-20230801172040478

    • 3. Configure hive’s log output file hive-log4j2.properties
    cp hive-log4j2.properties.template hive-log4j2.properties
    
    vim hive-log4j2.properties
    

    image-20230801172440123

  • 3. Initialize hive's metastore metastore: by default, use the derby database (that comes with hive):
    schematool -dbType derby -initSchema

    • Initialization will report a NoSuchMethod exception: the versions of the dependent jar package of hive's guava and the dependent jar package of hadoop's guava conflict.

    • Solution: Delete or rename the guava.jar package in hive's lib directory, and then copy this ${HADOOP_HOME}/share/hadoop/common/lib/guava.xxx.jar to hive's lib directory.

      image-20230801173841260

    image-20230801173739668

  • 4. There is another dependency conflict between hive and hadoop, but if this conflict is not resolved, it will not affect the normal use of hive. A warning is given to us: the dependency of the log output of hive and Hadoop.

    • The log dependency version in hive is lower than that of Hadoop.
    • Delete hive's log dependency log4j-slf4jxxxxx
  • 5. Introduce a dependency file into Hive - driver dependency for JDBC connection to MySQL - configure hive's metadata database into MySQL

【Precautions】

  • 1. When installing hive for the first time, be sure to initialize the hive metadata database and then start the hive command line client.
    After initializing the metadatabase, create a metastore_db folder in the working directory where the initialization command is executed. The folder is the file location of the metadatabase recorded by derby.
    If the initialization fails, you must first delete the created directory and then re-initialize it.
  • 2. When configuring the associated path in the env.sh file, do not add a space between the path and =.
  • 3. Hive and Hadoop have two dependency packages that conflict, namely guava and log4j; for guava's dependency, you need to delete hive's, and then copy Hadoop's to hive; for log4j's, you need to delete hive's.
  • 4. When configuring Hive, the bottom layer of Hive needs MapReduce to run, but it is not certain how many map tasks and how many reduce tasks are started by the MR program converted by hive's HQL statement. Therefore, if the converted MR task book is too many, and you The computing resources (insufficient CPU and memory) of the Hadoop cluster have no problem with the syntax, but an error will be reported when executing the HQL statement - Insufficient resources: xxxG of xxxG Before using Hive, it is best to use
    Hadoop's mapred-site.xml and yarn-site.xml Adjust the resources in the file

5. Basic use of Hive

Hive is actually a Hadoop-like SQL client. Hive provides a variety of ways to perform SQL-like programming: Hive's command line method, Java's JDBC operation method, and so on.

If we want to use the hive command line to operate hive, then we do not need to start any hive-related services, we only need to start HDFS and YARN.

If we want to use Java's JDBC method to operate hive, we must configure and start Hive's remote connection service. We also need to start HDFS and YARN to operate.

Hive’s command line operation method :

  • Command line to start hive: hive must be used on the installation node of hive

  • When hive's HQL queries table data, it will be converted into an MR program to run when aggregate functions, filtering, etc. are designed.

  • Supports adding and deleting all data in the table, but does not support modifying or deleting some data in the table.

image-20230801180624770

image-20230801181216780

image-20230801181236667

image-20230801181330167

6. Hive metadata database configuration issues

A very important concept in hive is the metadata metastore. The metadata records which databases and data tables are created in hive, as well as the fields and field types of the data tables, the storage directory of the data table data on HDFS, and other information. .

Moreover, Hive requires that Hive itself is not responsible for storage of metadata. It requires that a relational database must be used to store Hive metadata, because Hive's metadata is actually a bunch of tables. The derby database used by hive by default is used to store metadata. However, derby has a very serious problem in storing metadata. It is impossible to use the hive command line with multiple clients.

The derby database only allows one client connection to access the metadata database at the same time. Therefore, if hive metadata is initialized to the derby database, multi-client operations on the hive data warehouse cannot be implemented.

Therefore, we recommend that everyone, including hive officials, also recommend that everyone initialize hive's metadata database into a relational database such as MySQL, oracle, or SQL Server.

Hive implements initialization of the metadata database into relational databases such as MySQL. The bottom layer is implemented with the help of Java's JDBC operation. If you want to initialize the metadata database in MySQL, you must first do three things:

  • 1. Delete the table files previously created and added in the derby metadata database on HDFS, and at the same time remove the previous derby metadata database from the Linux directory.

image-20230801183028677

image-20230801183212305

  • 2. Install MySQL on Linux
    • To install MySQL on Linux, you need to use the yum warehouse. The yum warehouse must first configure the yum source of Alibaba Cloud.
  • 3. Because hive initializes metadata to MySQL, it uses Java's JDBC technology, but there is no MySQL JDBC driver in hive, so we need to upload a copy of the MySQL jdbc driver to hive's lib directory.

There is another problem when Hive connects to MySQL. MySQL connection requires a username and password, but hive does not know this by default. You need to modify a configuration file of hive to specify the username and password of the connected MySQL, driver, etc.

  • You need to create a hive-site.xml file in the conf path of the hive installation directory.
    • This file mainly configures hive-related configuration items. The configuration items exist in the hive-default.xml.template file by default. However, if we configure the same configuration items in the hive-site.xml file, it will overwrite hive-default. .xml same configuration.
    • The first and second lines of code in the hive-site.xml file are the core of the XML file. If there are extra spaces or missing characters in the first and second lines, the XML file will not be parsed or will fail to parse.
    • First, we configure four configurations related to the metadata database: URL (& must be used in xml to comply with the identification), Driver, Username, and Password.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <configuration>
      <!-- 指定MySQL数据库地址以及hive元数据在MySQL中存储的数据库hive_metastore -->
      <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://single:3306/hive_metastore?createDatabaseIfNotExist=true&amp;serverTimezone=UTC&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
          <description>JDBC connect string for a JDBC metastore</description>
      </property>
      <!-- MySQL连接驱动 -->
      <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.cj.jdbc.Driver</value>
          <description>Driver class name for a JDBC metastore</description>
      </property>
      <!-- MySQL用户名 -->
      <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>root</value>
          <description>username to use against metastore database</description>
      </property>
      <!-- MySQL密码 -->
      <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>Root123456..</value>
          <description>password to use against metastore database</description>
      </property>
  </configuration>

Initialize hive metadata into MySQL: schematool -initSchema -dbType mysql -verbose.

7. Hive related configuration items

1. Configuration of the storage directory for Hive table data on HDFS

  • Table data stored in Hive is stored on HDFS by default, and is stored in the /user/hive/warehouse directory of HDFS by default.
  • After Hive is installed, we will be provided with a default database by default. If we do not specify which database to use in hive, hive will use the default database by default. The relevant table data in the default database are stored under the /user/hive/warehouse path.
    The table data of other databases will first create a xxx.db directory under the /user/hive/warehouse path, and then place the table data of the corresponding database in this directory.
  • Hive's configuration file has a configuration item that can change the storage directory of Hive table data: hive.metastore.warehouse.dir. The
    default configured path is the /user/hive/warehouse path.

2. When Hive's HQL statement is executed, it can be converted into an MR program, a Spark program, and a TEZ program. It is converted into an MR program by default.

  • There is a configuration item in hive-default.xml.template that can specify the underlying conversion rules of MR:
    hive.execution.engine mr

3. Configure hive database name and table header display

  • By default, the HiveCli-hive command line client cannot visually see which database we are using after using a certain database. Including when querying table data, only the table data will be displayed, and the table fields will not be displayed.

  • <property>
         <name>hive.cli.print.header</name>
         <value>true</value>
     </property>
     
     <property>
         <name>hive.cli.print.current.db</name>
         <value>true</value>
     </property>
    

4. There are two commonly used clients for hive

  • Hive's command line client: hive command
    The command line client can only be used on the node where hive is installed.

  • Hive JDBC client operations

    • If we want to operate hive on other machines or nodes, the hive command line client cannot be used, but we can use JDBC to remotely connect to hive and operate hive. But if you use jdbc to connect to hive, then hive must start a corresponding service hiveserver2. Only when this service is started, can we use jdbc to connect to hive remotely.

    • hivesever2 is hive's remote connection service. The remote connection server will start a port through which communication and remote operation of Hive can be performed. By default, Hiveserver2 is not configured. We need to configure and start it ourselves.

      • 1. Configure hive-site.xml file
      # HiveServer2需要在hive-site.xm1中增加如下配置项
      <!-- 设置权限校验-->
      <property>
      </property>
              <value>NONE</value>
      </property>
      <!--设置hiveserver2主机地址 必须是hive的安装节点-->
      <property>
              <name>hive.server2.thrift.bind.host</name>
              <value>single</value>
      </property>
      <!-- 设置hiveserver2绑定端口 -->
      <property>
              <name>hive.server2.thrift.port</name>
              <value>10000</value>
              <description>TCP port number to listen on, default 10000</description>
      </property>
      <!--设置hiveserver2的http请求端中(不常用) -->
      <property>
              <name>hive.server2.thrift.http .port</name>
              <value>10001</value>
      </property>
      <!--设置hiveserver2的连接用户-->
      <property>
              <name>hive . server2.thrift.client.user</name>
              <value>root</value>
              <description>Username to use against thrift client</description>
      </property>
      <!--设置hiveserver2的连接密码. -->
      <property>
              <name>hive.server2.thrift.client.password</name>
              <value>root</value>
              <description>Password to use against thrift client</description>
      </property>
      
      • 2. Configure Hadoop's core-site.xml file
        because after hiveserver2 is started, remote connection requires a username and password. The username and password used for remote connection cannot access HDFS by default.
      <property>
            <name>hadoop.proxyuser.root.hosts</name>
            <value>*</value>
      </property>
      <property>
            <name>hadoop.proxyuser.root.groups</name>
            <value>*</value>
      </property>
      
    • Start hiveserver2: nohup hiveserver2 1>/opt/app/hive-3.1.2/hive.log 2>&1 &

    • [Note] When the hiveserver2 backend is started, the log needs to be output to a specified file, and the log file is given 777 permissions.chmod 777 hive.log

    • There are three ways to operate Hive via JDBC connection:

      • 1. Use a client called beeline to connect to the hiveserver2 service for operation. Beeline is a command line tool, but the bottom layer of the command line tool needs to operate hive through JDBC. beelinehive is integrated. We can install it separately.

      image-20230802111430257

      • 2. Use DBeaver to connect to hiveserver2 for operation. DBeaver uses JDBC to connect to related databases.

      image-20230802112213299

      image-20230802112328428

      Delete the original driver and add our own driver jar package to ithive-jdbc-3.1.2-standalone.jar

      image-20230802112430142

      Click again to edit driver settings

      image-20230802112621676

      Finally, click Test Connection. The premise is that the port number 10000 has been activated in the big data environment.

      image-20230802112748454

      • 3. Use native Java code to connect to hiveserver2 for operation, seven steps

8. Basic usage of Hive

1. Use of Hive command line client

  • Advantages: You only need to start HDFS and YARN, and you can use them directly without starting any hive services.
  • Disadvantages: It can only be used on the hive installation node and cannot be operated remotely.
  • There are three ways to use the hive command line:
    • 1、hive
      • Directly executing hive will enter hive's interactive command line window (REPL). Write a line of HQL statement in the window. Just press Enter to execute a line of HQL statement.
    • 2. hive -e "HQL statement"; "HQL statement"
      • You can quickly execute hive's HQL statement without entering hive's interactive command line (REPL). The disadvantage is that only one statement can be executed. There can also be multiple HQL statements, as long as the statements are separated by semicolons, it is not recommended to execute multiple HQL statements in this way.
    • 3、hive -f xxx.sql --hiveconf key=value --hivevar key=value
      • The HQL statements that need to be executed can be encapsulated into a SQL file. Multiple HQL statements can be written in the file. Each HQL statement only needs to be separated by a semicolon.
      • The –hiveconf option can be added or not. If added, it means passing a parameter to the sql file. You can use ${hiveconf:key} to get the value of the parameter in the SQL file.
      • The parameters passed by the –hivevar option need to be passed through ${hivevar:key}
      • If you pass multiple parameters, use hiveconf to pass: hive -f xxx.sql --hiveconf key=value --hiveconf key=value
      • If multiple parameters are passed, use hivevar to pass: hive -f xxx.sql --hivevar key=value --hivevar key=value

2. Use hiveserver2 method to operate Hive

  • Advantages: You can remotely operate Hive on any node through JDBC remote connection.
  • Disadvantages: Because the remote connection requires data transmission through the network, the speed is not as fast as using the hive client directly.

HDFS and Linux file systems can also be directly operated in the Hive client.

  • dfs option HDFS path
  • !Command related operations—Linux related operations

[Note] The command line client supports using sql files to execute multiple HQL commands. Comments can be added to the sql file. The comment language in the Hive SQL file is a space comment.
The same is true in the SQL editor in DBeaver, -- comments.

3. Use of Hive’s JDBC client

You can use Java code to remotely connect to the Hive data warehouse with the help of JDBC tools, and then transfer HQL statements and execution results through the network.

Prerequisites for use: hiveserver2 must be started. hiveserver2 is equivalent to hive's remote connection service, specifically designed to allow us to connect remotely through JDBC. After hiveserver2 is started, it will provide us with a network port 10000 (the relevant parameters of hiveserver2 must be configured in the hive-site.xml file, core-site.xml allows users of hiveserver2 to operate the Hadoop cluster, and hive is released in the hdfs-site.xml file) Service user's permissions).

Java URL:jdbc:hive2://ip:port

Start hiveserver2: nohup hiveserver2 1>xxxx.log 2>&1 &

The Hive service has a lot of startup and shutdown code, so we can encapsulate the startup and shutdown commands into a shell script to facilitate our later operations hs2.sh [Note]
Every time we start the virtual machine, we need to enable hdfs, yarn, and jobhistory , hiveserver2, extended job: Encapsulate the opening of HDFS, YARN, Jobhistory, and hiveserver2 into a common script file.

#!/bin/sh
if [[ "$1" = "start" ]];then
  echo "starting hiveserver2......"
  nohup hiveserver2 1>/opt/app/hive-3.1.2/hive.log 2>&1 &
  echo "start hiveserver2 complete!"
elif [[ "$1" = "stop" ]];then
   echo "stopping hiveserver2......"
   pid=`netstat -untlp | grep 10000 | awk '{print $7}' | awk -F '/' '{print $1}'`
   if [[ "$pid" = "" ]];then
      echo "hiveserver2未开启";
   else
      kill -9 $pid
      echo "stop hiveserver2 complete!"
   fi
else
    echo "传递参数有误"
fi

chmod 755 hadoop.sh

#!/bin/sh
# 调用这个脚本需要传递一个参数 start/stop start开启所有hadoop相关服务  stop关闭所有的hadoop服务
if [[ "$1" = "start" ]];then
	echo "====================开启HDFS.....====================="
	start-dfs.sh
	echo "====================HDFS开启成功======================"
        echo "====================开启YARN.... ====================="
	start-yarn.sh
        echo "====================开启YARN成功======================"
        echo "====================开启历史日志服务器JobHistory..===="
	mapred --daemon start historyserver
        echo "====================开启历史日志服务器成功============"
        echo "====================开启hiveserver2服务.....=========="
        hs2.sh start
        echo "====================开启hivsserver2服务成功======================"
elif [[ "$1" = "stop" ]];then
	echo "stop..."
	stop-dfs.sh
        stop-yarn.sh
        mapred --daemon stop historyserver
        hs2.sh stop
	echo "ok"
else
	echo "参数有误"
fi

How to use:

  • 1. Use the original JDBC in Java code to operate Hiveserver2
  • 2. Use some JDBC-based tools
    • beeline – the jdbc client that comes with hive
    • dbeaver – jdbc-based database visualization tool
  • 3. Use the Chat2DB tool developed by Alibaba Cloud to operate

9. HQL syntax in Hive

Hive provides SQL-like syntax for data storage and calculation operations, and the stored data also exists in the form of tables and libraries. Therefore, the HQL language and the SQL language have many similarities, but there are also many different operations.

1. DDL syntax

  • DDL syntax is the language hive uses to manage databases and data tables. Although Hive uses databases and data tables to manage structured data, the underlying implementation of libraries and tables has nothing to do with authentic databases.

  • Management syntax of databases and data tables: syntax for creating, deleting, modifying, and querying databases and data tables.

  • Database management syntax

    • create grammar

      • create database [if not exists] database_name
        [comment   "备注"]     给数据库加个介绍和备注
        [location  "hdfs路径"]   指定数据库的数据在HDFS上的存储位置,如果没有指定,那么默认存储到hdfs的/user/hive/warehouse/xxx.db
        [with  dbproperties("key=value","key=value")]
        
      • -- 创建数据库的语法
        create database if not exists demo01;-- 默认数据库数据存放到HDFS的/usr/hive/warehouse路径下
        create database if not exists demo02
        comment "这是一个测试使用的专属数据库"
        location "hdfs://192.168.31.104:9000/demo02"-- 一般不会指定路径,默认路径就挺好
        with dbproperties("createtime"="2023.08.02","createuser"="kanglei","name"="demo02");
        
    • Modify grammar

      • The name of the database and the location of the database on HDFS cannot be changed, but the dbproperties attribute value of the database can be modified.

      • Modify the storage location of the database (only supported after hive2.2.1 version):alter database database_name set location "hdfs路径"

      • alter database database_name set dbproperties('createtime'='20180830');
        
      • -- 修改数据库语法
        alter database demo02 set dbproperties("createtime"="2023.08.01","tablenum"="3");
        
    • View syntax

      • show databases: ; View which databases are in hive

      • show databases like 'name'; query which databases have this name

      • desc database database name; view brief information about the database

      • desc database extended; database name; view database details

      • desc formatted table_name; view details of the data table

      • -- 查看demo02的数据库信息
        desc database demo02;
        -- 查看demo02的数据库详细信息
        desc database extended demo02;
        
    • Use syntax

      • use database name;
    • delete database syntax

      • drop database if exists database_name [cascade];

      • -- 使用数据库
        use demo02;
        create table student(student_name string);
        -- 删除数据库
        drop database demo02;-- 这个命令只能删除空数据库 非空数据库无法使用该命令删除 —— 报错 
        drop database demo02 cascade;-- 删除非空数据库 把库下的数据表以及表数据一并删除了 —— 慎用
        
  • Data table management syntax

    • Syntax for creating data tables

      • CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name #external 外部的
        [(col_name data_type [COMMENT col_comment], ...)]   #表字段
        [COMMENT table_comment]     #表的备注
        [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] #hive中特有的数据表 分区表
        [CLUSTERED BY (col_name, col_name, ...) #hive中特有的数据表  分桶表
        [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]  #分桶表的信息
        [ROW FORMAT row_format]  #表字段之间的分隔符
        [STORED AS file_format]  #hdfs存储的文件的类型格式 默认是文本格式
        [LOCATION hdfs_path]     #单独指定数据表在hdfs上存储的目录,如果没有指定 那么就在表对应的数据库的路径下
        
      • Delimiter problem of underlying storage files in hive data table

        • row_format   DELIMITED
          [FIELDS TERMINATED BY char [ESCAPED BY char]]    列和列之间的分隔符
          [LINES TERMINATED BY char]  行和行之间分隔符  \n
          [COLLECTION ITEMS TERMINATED BY char]     集合、struct、数组等等结构元素之间的分隔符
          [MAP KEYS TERMINATED BY char]             map集合key value之间的分隔符
          [NULL DEFINED AS char]                    null值用什么字符表示
          
      • There are two more special syntaxes for creating data tables in Hive.

        • 1. Create a data table based on query syntax

          • create table table_name  as  select查询语句
            
          • -- 1、根据查询语句创建数据表:创建的数据表字段会根据查询语句的字段自动确定,类型自动推断
            use demo;
            create table teacher as select teacher_id as td,teacher_name from teacher1;
            select * from teacher;
            
          • image-20230804100753779

        • 2. Create a new data table based on another data table

          • create table table_name  like  other_table_name;
            
          • The new table created only has the structure of the old table but no data from the old table.

          • Partition information and bucket information will also be copied.

    • There are four types of tables in Hive - one data table may satisfy multiple types at the same time - a table in Hive must be an internal table or an external table; partition tables and bucket tables are internal and external tables Derived tables based on

      • Management table/internal table

        • create  table  table_name(.......)
          
        • The management table is a table that Hive has all permissions on. If the table is deleted, the table data files will also be deleted simultaneously on HDFS.

        • -- 表的创建语法
          use demo;
          -- 1、管理表/内部表
          create table if not exists demo(
          	username string comment "name",
          	password string comment "password"
          )comment "this is manager table"
          row format delimited fields terminated by "," lines terminated by "\n"
          stored as textfile;
          
          INSERT INTO demo values("kl","123456");
          
          dfs -cat /user/hive/warehouse/demo.db/demo/000000_0;
          
          -- 删除表
          drop table demo;
          
      • external table

        • create  external  table  table_name(......)
          
        • Hive only has the operations of querying and adding data to external tables. If the external table is deleted, only the table metadata information is deleted in Hive, and the table data still exists in HDFS.

        • In some cases, the data we use is not only used by hive, but also spark flink. Therefore, if hive no longer uses this data, delete the table, but the data cannot be deleted. At this time, such a table Set up as an external table.

        • -- 2、外部表
          create external table if not exists demo(
          	username string comment "name",
          	password string comment "password"
          )comment "this is external table"
          row format delimited fields terminated by "-" lines terminated by "\n"
          stored as sequencefile;
          
          insert into demo values("kl","123456");
          
          drop table demo;-- hdfs中仍然存在此数据
          
      • Partition Table

        • create [external] table  table_name(.......)
          comment ""
          partitioned by(分区字段 字段类型,第二个分区字段 字段类型)
          row format .......
          
        • Partition tables can be management tables or external tables. The difference between partition tables and ordinary data tables is that when HDFS stores data, non-partition tables store data directly in the form of files in the directory where the data table is located. The partition table will first create folders in the directory where the data table is located, and then put the data of the corresponding partition in the folder. The folders are operated according to the specified values.

        • If you create a partitioned table and specify one or more partition fields, and the partition fields have multiple values, multiple different folders will be created in the directory of the database table to store the data of different partitions.

        • The purpose of a partitioned table is to divide table data into different areas according to specified rules, so that when we process data in the future, we can obtain the data we want according to the specified areas.

        • A partitioned table needs to specify a partition field, which must not be a table field.

        • Syntax for adding data to a partitioned table

          • insert into table_name   partition(分区字段=)  values(表字段值)
            
          • -- 3、管理分区表
            create table if not exists student(
            	student_name string,
            	student_age int,
            	student_phone string
            )partitioned by(student_sex string)
            row format delimited fields terminated by "-" ;
            
            insert into student partition(student_sex="man") values("kl",21,"123456789");
            insert into student partition(student_sex="man") values("gb",20,"123789456");
            insert into student partition(student_sex="woman") values("hyf",18,"123159753");
            
            
            dfs -cat /user/hive/warehouse/demo.db/student/student_sex=man/000000_0;
            dfs -cat /user/hive/warehouse/demo.db/student/student_sex=man/000000_0_copy_1;
            
            select * from student;
            select * from student where student_sex="man";
            
            -- 4、多级分区表
            create table if not exists student1(
            	student_name string,
            	student_age int,
            	student_phone string
            )partitioned by(student_sex string,student_birthday string)
            row format delimited fields terminated by ",";
            
            insert into student1 partition(student_sex="man",student_birthday="2023-07") 
            values("kl",21,"123456789");
            
            select * from student1;
            
      • Bucket table

        • create [external] table  table_name(.......)
          comment ""
          partitioned by(分区字段 字段类型,第二个分区字段 字段类型)
          clustered by(分桶字段--一定表字段)   [sorted by (排序字段  asc|desc)]   into   num  buckets;
          row format .......
          
        • A bucket table can be a partitioned table, an external table, or an internal table. The bucket table refers to storing the final stored result file into a specified number of files, which is equivalent to starting multiple reduceTasks when the MR program is executed, and each reduceTask outputs a result file.

        • The fields of the bucket table must be table fields

        • -- 5、创建一个普通的分桶教师表,要求按照教室的编号分为4个文件存储教师信息 每个文件需要按照教室的年龄的进行降序排序
          create table if not exists teacher(
          	teacher_num string,
          	teacher_name string,
          	teacher_age int
          )clustered by (teacher_num) sorted by(teacher_age desc) into 4 buckets;
          
          insert into teacher values("t001","zs",30),
          						  ("t002","zs1",28),
          						  ("t003","zs2",27),
          						  ("t004","zs3",45),
          						  ("t005","zs4",50),
          						  ("t006","zs5",55),
          						  ("t007","zs6",46),
          						  ("t008","zs7",39),
          						  ("t009","zs8",52),
          						  ("t0010","zs9",43);
          						  
          dfs -cat /user/hive/warehouse/demo.db/teacher/000003_0;
          
          -- 分桶表的抽样查询  总共设置了4个桶 1 out of 2的意思 从第1个桶开始抽取4/2个桶的数据
          select * from teacher tablesample(bucket 1 out of 2);
          
          -- 按照比例抽取 如果抽取某一个数据块大于小于128M 返回数据块的所有数据
          select * from teacher tablesample(0.9 percent);
          
    • Modify the syntax of the data table

      • 1. Modify the table name

        • alter table  table_name  rename to new_table_name
          
      • 2. Modify/add/replace columns

        • alter table table_name  change  old_column  new_column  type
          
      • 3. Add partition information – instead of adding a partition field, add a new directory based on the existing partition field.

        • alter table table_name add partition(分区字段=分区值)
          
      • 4. Delete partition information

        • alter table  table_name  drop partition(分区字段=分区值)
          
        • -- 修改表名
          alter table teacher rename to teacher1;
          -- 修改增加列
          alter table teacher1 change teacher_num teacher_id string;
          alter table teacher1 add columns(teacher_phone string);
          -- 增加分区信息
          alter table student1 add partition(student_sex="women",student_birthday="2023-08") 
          						 partition(student_sex="no",student_birthday="2023-09");
          -- 删除分区  分区下数据丢失
          alter table student1 drop partition(student_sex="man",student_birthday="2023-07");
          
    • View information about a table syntax

      • show tables;  查看某一个数据库下的所有数据表
        

        image-20230803165529203

      • desc 表名    查看某个表的字段、分区字段
        

        image-20230803165539948

      • desc  formatted 表名      查看某个表的详细信息
        

        image-20230803165612823

        image-20230803165627754

      • show partitions 表名   查看某个表有多少个分区
        

        image-20230803165647250

    • Syntax to delete data table

      • drop table  if not exists  table_name;
        
  • Data table field type

    • integer type

      • tinyint
      • smallint
      • int/integer
      • bigint
    • Boolean type

      • boolean
    • Decimal type

      • float
      • double
    • string type

      • string
    • Time and date related types

      • timestamp
    • Byte type

      • binary
    • complex data types

      • array - array type
      • map - map collection in Java
      • struct—Java object (can store multiple data, each data type can be different)
    • -- 3、创建一个具有复杂数据类型的数据表 必须指定复杂数据类型的元素的分隔符
      -- array map struct 三个类型都是有多条数据组成的,需要指定数据之间的分隔符
      create table demo(
      	hobby array<string>,
      	menu map<string,double>,
      	students struct <name:string,age:int,sex:string>
      )row format delimited
      fields terminated by ","
      collection items terminated by "_"
      map keys terminated by ":"
      lines terminated by "\n";
      
      -- 向数据表增加特殊数据;此方法不建议使用
      insert into demo values("game"_"study"_"sleep","apple":20.0_"pear":30.0_"orange":40.0,"zs"_20_"man");
      -- insert增加问题比较多,不用insert增加了,而是使用文件增加
      select * from demo;
      select hobby,menu,students.age from demo;
      select hobby[0],menu["apple"],students.age from demo;
      
    • image-20230804112907787

    • image-20230804112936747

    • image-20230804112952229

    • image-20230804113031484

2. DML syntax

The data stored in Hive is stored in the form of databases and data tables, so we can use DML operations to perform related operations such as adding, deleting, and modifying table data. However, due to the particularity of hive, Hive does not provide special support for data modification and deletion.

Hive's DML operations are divided into two parts:

  • 1. Normal DML operations: adding, deleting, and modifying data

    • Add data syntax

      • Ordinary insert command: the bottom layer will be translated into MR program execution

        • insert into  table_name(表字段)partition(分区字段=分区值)values(字段对应的值列表),(值列表).......
          Hive中基本不用
          
        • insert  into    table_name(表字段)partition(分区字段=分区值)select 查询语句
          insert overwrite table  table_name(表字段)partition(分区字段=分区值)select 查询语句
          Hive比较常用  根据一个查询语句添加数据
          要求 table_name后面跟的表字段的个数、类型、顺序 必须和查询语句的得到结果一致
          
        • 3、多插入语法,从同一个表A查询回来不同范围的数据插入到另外一个表B
          form  A
          insert into/overwrite [table]  table_name [partitio(分区字段=分区值)] select 查询字段  where筛选条件
          insert into/overwrite [table]  table_name [partitio(分区字段=分区值)] select 查询字段  where另外一个筛选条件;
          
        • -- 1、insert增加单条或者多条数据
          create table test(
          	name string,
          	age int
          )row format delimited fields terminated by ",";
          insert into test values("zs",20),("ls",30);
          insert into test select name,age from test01;
          
          create table test01(
          	name string,
          	age int
          )partitioned by (timestr string)
          row format delimited fields terminated by ",";
          insert into test01 partition(timestr="2023") values("zs",20),("ls",30);
          insert overwrite table test01 partition(timestr="2023") select name,age from test;
          
          -- 多插入语法,根据多条增加语句增加数据,要求多条增加语句的查询是从同一张表查询过来的
          from test
          	insert overwrite table test01 partition(timestr="2022") select name,age 
          	insert overwrite table test01 partition(timestr="2023") select name,age;
          
      • If we add data to the table, in addition to insert syntax, we can also add data through some techniques.

        • 1. According to the format requirements of the table, upload a data file that meets the format requirements to the HDFS directory where the data table is located. It is not recommended to
          use
          [Note]
          If the data is not a partition table, the table will be automatically recognized if the data is uploaded successfully.
          If it is a partitioned table, the data may be uploaded successfully, but the table is not recognized (the partition directory is manually created by us), we repair the partition table msck repair table table_name.
        • 2. Specify location when creating the table, and the location can exist.
      • load loading command

        • It also loads the file into the data table (the underlying performance is to move the file to the directory where the data table is located). Compared with manually uploading files, the load command will not cause the data upload to be unrecognized, so load Loading data will use hive metadata.

        • At the same time, manually upload the file to the data table directory. Because metadata is not included, we execute the count(*) command to count the data rows in the table. The result is inaccurate because count(*) directly obtains the result from the metadata. But if you use load, the file is also uploaded to the storage directory of the hive data table, but load takes away the metadata.

        • load  data [local]  inpath  "路径"  [overwrite]   into table table_name [partition(分区字段=分区值)]
          
          local 如果加了local  那么后面路径是linux的路径
          如果没有加local   那么路径是HDFS的路径(如果是HDFS上的文件装载,把文件移动到数据表的目录下,原始文件不见)
          
        • [Notes] The format of the file loaded by load must be consistent with the delimiter of the data table, and the columns must also correspond. Otherwise, loading failure or data anomalies may occur.

        • -- 装载Linux数据到hive的某个非分区表中
          load data local inpath "/root/test.txt" into table test;
          load data inpath "/t.txt" into table test;
          
          -- 装载Linux数据到hive的某个分区表中
          load data local inpath "/root/test.txt" into table test01 partition(timestr="2023");
          
    • update operation

      • Partition tables, management tables, external tables, and bucket tables created in Hive do not support update operations by default.
      • The update operation requires some special means of hive, hive's transaction operation.
    • Delete operation

      • These tables created in Hive do not support the operation of deleting some data by default, but they support the operation of clearing table data, deleting data in a certain partition, or deleting all data.
      • If you want to delete all data in the table, you must use truncate [table] table_name [partition partition_spec] command is a DDL command.
  • 2. Import and export operations

    • Export operation

      • Export the data in the hive data table to the specified directory for storage.
      • export table table_name [partition(partition=value)] to “path”
    • Import operation

      • Import the data exported by hive into hive.
      • import [external] table table_name [partition(partition=value)] from "hdfs path - must be data exported through export"
        If you import the specified partition, the partition must be exported and the directory must also exist.
    • -- 把test1上的数据导出到hdfs的/export目录
      export table test01 to "/export";
      
      -- 导入数据
      import table test02 from "/export";
      

3. DQL syntax - the core and most important syntax in Hive

DQL language is a data manipulation language. The DQL language provided by Hive is the core of Hive's statistical analysis. Hive can implement the complex calculation logic of our previous MR program through basic connection queries, conditional queries, group queries and other operations of the DQL language.

Hive’s DQL syntax is very similar to MySQL’s DQL query syntax.

  • SELECT [ALL | DISTINCT] 查询列表(表达式、常量、函数、表字段)
      FROM table_reference  as 别名
      [inner join | left join | right join | full join  other_table  as 别名 on  连接条件]
      [WHERE where_condition]
      [GROUP BY col_list]
      [ORDER BY col_list]
      [CLUSTER BY col_list  | [DISTRIBUTE BY col_list] [SORT BY col_list] ]
      [LIMIT [offset,] rows]
      [union | union all  other_select]
    
  • Single table query

    • SELECT [ALL | DISTINCT] 查询列表(表达式、常量、函数、表字段)
        FROM table_reference
        [WHERE where_condition]
        [GROUP BY col_list]
        [ORDER BY col_list]
        [CLUSTER BY col_list
          | [DISTRIBUTE BY col_list] [SORT BY col_list]
        ]
       [LIMIT [offset,] rows]
      
    • Basic query

      • select query list from table_name

      • If you are querying the data of all columns in the table, you can use * instead in the query list
        [Note] * Although it is simple, don't use it if you can.

      • (1) HQL language is not case sensitive.
        (2) HQL can be written on one line or multiple lines
        (3) Keywords cannot be abbreviated or separated into lines
        (4) Each clause should generally be written on separate lines.
        (5) Use indentation to improve the readability of statements.

      • select student_name
        from student;
        
        select * from student;
        
    • Conditional query

      • SELECT query list from table_name where query conditions
      • Query conditions can be conditional expressions, logical expressions, fuzzy queries
        • Conditional expression: > < >= <= = !=
        • Logical expression: is null | is not null | and | or | in | not in | ! | between a and b
        • Fuzzy query: like two special symbols % _
          rlike regular expression
    • Group query

      • select query list (can only be constants, expressions, aggregate functions, and fields for each group) from table_name [where query conditions] group by grouping fields,..., having filtering after grouping
    • sort query

      • Global sorting: order by sorting field asc | desc
        • The bottom layer of Hive's HQL's complex query statement will be converted into an MR program for running. If we need to sort the query results during the query process, then we can use order by to sort. Order by is a global sort. Once order by is used, then There is only one reduce task at the bottom of the MR program converted by the HQL statement. In this case, the data of all map tasks will be pulled over and a result file will be output. The result file will be globally ordered.
        • Global sorting has only one reduce task. If too much data is processed, the reduce calculation will be slow or even crash.
      • Local sorting: sort by sort field asc|desc
        • For the MR program that converts HQL statements, you can specify multiple reduceTasks, and then the data output by the map will be randomly partitioned according to the hash partitioning mechanism (which cannot be controlled by sort by), but the data in each partition will eventually be sorted according to sort by.
        • [Note] When using sort by, you must specify the number of tasks in the reduceTask to be greater than 1. If =1, then there will be no difference between sort by and order by.
        • Partial sorting of specified partition fields: custom partitioning mechanism in MR
          • Sort by locally sorts the data of each partition, but sort by is not responsible for how the partition data is divided. We have no control over the data of each partition.
          • Distribute By partition field sort by sort field asc |desc
            First partition the data according to the partition field of Distribute By (remainder of the hash code of the partition field and the number of reduce tasks), and then sort the data of each partition according to the sort field
          • Set the number of tasks for reduceTask - set mapreduce.job.reduce = 2; the default value is -1.
          • image-20230821223855526
          • image-20230821224223245
          • image-20230821224230448
          • image-20230821224249592
          • image-20230821224843956
        • CLUSTER BY
          • When the fields of Distribute By and sort by are consistent, you can use the cluster by partition sort field.
          • select * from a cluster by age 等同于
            select * from a distribute by age sort by age
          • cluster by cannot specify whether the sorting rule is ascending or descending order, it can only be ascending order
    • Paging query

      • limit num: query returns num pieces of data
      • limit offset,num: Starting from the offset data,
        the offset of the num data returned starts from 0.
    • [Two special syntaxes]

      • Give an alias
      • Remove duplicates
    • [Note]: Partition tables are specially stored in the form of folders with partition field values. When used, they can be used as table fields.

  • Join query

    • Usage scenario: The queried data comes from multiple data tables, and multiple data tables have "foreign key" relationships.

    • Hive, like MySQL, only supports equal value connections.

    • Connection query classification (all four types of connection queries in Hive are supported)

      • Inner join query inner join
      • left outer join query left join
      • Right outer join query right join
      • Full outer join query full join
    • Multiple table connection problem

      • select * from 
        a xxx join b on a.xx=b.xx  
        xxx join c  on xxx=xxx
        
    • Cartesian product problem (be sure to avoid this problem)

      • a join bam pieces of data and bn pieces of data. Finally, the join is completed and m*n pieces of data are obtained.
        The effect is that every piece of data in table a matches every piece of data in table b.
      • Cause: The connection condition is not written or the connection condition is written incorrectly.
  • Union query

    • Usage scenario: Connect the results of multiple HQL query statements through union|union all
    • union, union all union deduplicates data union all does not deduplicate data
    • Restrictions: The query lists (number, type and order of query lists) of multiple query statements must be consistent
  • subquery

    • Exactly the same as MySQL’s subquery
    • There is a query nested in the query, and the subquery can have from sub-statements, where sub-statements...

Commonly used functions in DQL query statements – the core knowledge points of Hive

  • built-in functions

    • -- 如何查看系统自带的内置函数
      show functions;
      desc function abs;
      desc function extended abs;
      
    • Usage of some common built-in functions in Hive

      • Math functions:UDF

        • UDF function: one-to-one function, input one data and output one data

        • abs(x): Returns the absolute value of x

        • ceil(x): Round up and return the smallest positive integer greater than x

        • floor(x): round down

        • mod(a,b):a%b

        • pow(a,b):a^b

        • round(x,[n]): Rounding. If n is not passed, it means that the decimal point is not retained. If n>=1, it means that n digits are retained after the decimal point.

        • sqrt(x): Root sign x

        • select abs(-12);
          select ceil(12.4);
          select ceil(11.4);
          select floor(12.4);
          select floor(11.4);
          select mod(11,3);
          select pow(2,5);
          select round(3.1415926,4);
          select round(3.1415926,0);
          select round(3.1415926);
          select round(3.1415926,-1);
          select sqrt(99);
          
      • String functions

        • concat: direct splicing. Splicing requires passing multiple parameters, and multiple parameters will be spliced ​​together. If one parameter is null, the result will be null directly.

        • concat_ws: A delimiter that can be spliced. The first parameter passed is a delimiter. If a null value is spliced, the null value will not be calculated.

        • lpad|rpad(str,x,pad) fills the string with the specified pad characters on the left/right side of str to the specified x length

        • ltrim|rtrim|trim(str) remove spaces

        • length(x): Returns the length of the string

        • replace(str,str1,replacestr): Replace the specified string in the string with another string

        • reverse(str): string reverse

        • split(str,x):array string cutting

        • substr|substring(str,n,[num]), intercept string

        • create table demo(
          	name string
          );
          insert into demo values("zs"),("ls"),("ww");
          select * from demo;
          select concat(name,null) from demo;
          select concat_ws("-","zs","ls",null);
          
          select lpad("zs",10,"-");
          select rpad("zs",10,"-");
          select ltrim("  z   s    ");
          select rtrim("  z   s    ");
          select trim("  z   s    ");
          select length("sdadadadsasd");
          
          select replace("2022-10-11","-","/");
          select reverse("asdfghjkl");
          
          select split("as-df-gh-jkl","-");
          select substring("as df gh jkl",1,3);
          
      • date

        • current_date(): Returns the current date year, month and day

        • current_timestamp(): returns the current time

        • date_format(date,"format") formats time

        • datediff(date,date1): Returns the difference (days) between these two times

        • date_add(date,day)

        • date_sub(date,day)

        • select current_date();
          select current_timestamp();
          select date_format(current_date(),"yy-MM-dd");
          select datediff(current_date(),"2002-08-03");
          
      • conditional judgment function

        • if

        • case when then [when then] … else end

        • select if(1>2,"zs","ls");
          
          select 
          	case 10
          	when 10 then "zs"
          	when 20 then "ls"
          	else "ww"
          	end
          	
          select 
          	case 
          		when 1>2 then "zs"
          		when 1<2 then "ls"
          		else "ww"
          		end
          	end
          
      • special function

        • Functions related to array and set operations

          • An array needs to be passed inside the function, or the return value is a function of array type

          • split(str,spea):array

          • collect_set (column name): array encapsulates the data of all rows in a column into an array column and transfers rows.
            Duplication is not allowed.

          • collect_list (column name): array encapsulates the data of all rows in a column into an array column and changes rows.
            Duplication is allowed.

          • array(ele…):array

          • map(key,value,key,value,key,value...):map passes in an even number of parameters

          • concat_ws(spe,array(string)): String concatenates all the strings in an array with the specified delimiter to obtain a new string

          • explode (array, map set): multi-row and multi-column data explosion function, row-to-column function.
            If an array is passed, the result is one column and multiple rows
            . If a map set is passed, the result is two columns and multiple rows.

          • insert into demo values("zs"),("ls"),("ww");
            select collect_set(name) from demo;
            select collect_list(name) from demo;
            select concat_ws("-",collect_set(name)) from demo;
            
            
            select array(1,2,3,4,5);
            select explode(array(1,2,3,4,5));
            select map("name","zs","age","20","sex","man");
            select explode(map("name","zs","age","20","sex","man"));
            
        • Special functions related to strings: (The string must be a URL)

          • The concept of URL

            • URL是叫做统一资源定位符,是用来表示互联网或者主机上的唯一的一个资源的
              URL整体上主要有如下几部分组成的:
              协议:http/https、ftp、file、ssh
              host:主机名、域名、ip地址
              port:端口号
              path:资源路径
              queryParam:参数 ?key=value&key=value....
              
              例子:http://192.168.35.101:9870/index.html?name=zs&age=20
              https://www.baidu.com/search/a?key=value
              
              URL中,如果没有写端口号,那么都是有默认端口,http:80  https:443  ssh:22
              
          • parse_url(urlstr, "special character"): string
            can only extract one component of the URL at a time

          • parse_url_tuple(urlstr, "Special Character"...): Each component is displayed as a separate column. The function can convert one data into one row of data with multiple columns. The function has
            one more special character: QUERY:key
            parse_url_tuple(urlstr, "Special Character" …) as (list name…)

          • hive提供用来专门用来解析URL的函数:从URL中提取URL组成成分
            特殊字符代表的是URL的组成成分,特殊字符有如下几种:
            HOST:提取URL中的主机名、IP地址
            PATH,:提取URL中资源路径
            QUERY, 提取URL中的所有请求参数
            PROTOCOL, 提起URL中的请求协议
            AUTHORITY, 
            FILE, 
            USERINFO,
            REF,
            
          • select parse_url("http://www.baidu.com:80/search/a?name=zs&age=30","HOST");
            select parse_url("http://www.baidu.com:80/search/a?name=zs&age=30","PATH");
            select parse_url("http://www.baidu.com:80/search/a?name=zs&age=30","QUERY");
            select parse_url("http://www.baidu.com:80/search/a?name=zs&age=30","PROTOCOL");
            
            select parse_url_tuple("http://www.baidu.com:80/search/a?name=zs&age=30","QUERY","HOST","PATH","PROTOCOL","QUERY:name") as (query,host,path,protocol,name);
            
        • side view

          • Lateral View is specially used in combination with the UDTF function to generate a virtual table. Then one row of data in this virtual table will be generated. The virtual table is dynamic. One row of data will generate a virtual table. The generated virtual table is the same as the current one. Do a Cartesian product of rows to get some functions that our ordinary SQL cannot achieve

          • Side view usage scenario: In a table, a certain column of a certain row is data composed of multiple fields. We want to split the multi-field columns and combine them with the current row to obtain a multi-row result.

          • -- 侧视图的使用
            create table test(
            	name string,
            	age int,
            	hobby array<string>
            );
            
            insert into test values("zs",20,array("play","study")),("ls",30,array("sleep","study"));
            select * from test;
            
            select name,age,hobby,temp.hb from test
            lateral view explode(hobby) temp as hb;
            
        • In the window function over
          select substatement

          • The windowing function refers to splitting the table into multiple virtual windows according to specified rules when querying table data (there is no real splitting, similar to grouping), and then you can get some information in the window that can only be obtained after grouping. Some information is then combined with the original data to obtain both basic fields and aggregate fields in the same query.
            When you need both ordinary fields and some aggregated information, windowing functions are the perfect choice.

          • Syntax: function (parameter) over(partition by column name order by field rows between upper boundary and lower boundary) as alias-column name

            • The function of partition by is to specify which field to group (window).
              The function of order by is to sort the divided windows by the specified field.
              The function of rows between is to divide the boundaries of the window. Each window defaults to The boundary is all the data in the group, but the window can also be a partial row of data in the group.
              By default, we do not write boundaries. The default boundary (no upper boundary or lower boundary by default) is all the data in a group
          • There are three main types of functions that can be used in conjunction with window functions.

            • first_value(col)|last_value(col) over(partition by column name order by field) as alias-column name

            • Aggregation function sum/avg/count/max/min over(partition by column name) as column alias

            • Ranking function row_number()/rank()/dense_rank() over(partition by column name order by field) as The function of the column alias
              ranking function is to open a window on the data, and after querying a certain row of data, look at the position of this row of data Rank-position of the window to which it belongs, and then put a serial number based on the position. The serial number starts from 1.
              row_number() The serial number is incremented from 1. If the two rows of data have the same ranking, they will also be numbered in sequence.
              rank() The serial number is incremented from 1. , if the two rows of data have the same ranking, the two rows with the same number will jump to the ranking
              dense_rank(). The sequence number starts from 1 and increases in sequence. If the two rows of data have the same ranking, they will also be numbered in sequence and will not jump to the ranking.

              Usage scenario: Find the topN ranked data information in different groups

          • Note: The windowing function must be used in conjunction with some other functions. When other functions are used, the default boundaries of most functions are no upper boundary and no lower boundary. However, if a small number of functions do not write the window boundary, the default boundary is not none. Boundaries, but boundaries. Therefore, when using window functions in the future, it is recommended to declare the window boundary.

          • -- 开窗函数
            create table student(
            	student_name string,
            	student_age int,
            	student_sex string
            );
            
            insert into student values("zs",20,"man"),("ls",20,"woman"),("ww",20,"man"),("ml",20,"woman"),("zsf",20,"man");
            
            select * from student;
            -- 查询不同性别的总人数
            select student_sex,count(1) from student group by student_sex;
            -- 查询表中所有的学生信息,并且每个学生信息后面需要跟上这个学生所属性别组的总人数
            select 
            	student_name,
            	student_age,
            	student_sex,
            	count(1) over(partition by student_sex) as sex_count
            from student;
            
            select 
            	student_name,
            	student_age,
            	student_sex,
            	first_value(student_name) over(partition by student_sex) as sex_count
            from student;
            
            select 
            	student_name,
            	student_age,
            	student_sex,
            	row_number() over(partition by student_sex order by student_name desc) as sex_count
            from student;
            
            -- 排名函数的使用场景
            create table employees(
            	employees_id int,
            	employees_name string,
            	employees_dept int,
            	employees_salary double
            );
            insert into employees values(1,"zs",1,2000.0),
            							(2,"ls",1,1800.0),
            							(3,"ww",1,1700.0),
            							(4,"ml",1,2000.0),
            							(5,"zsf",1,1900.0),
            							(6,"zwj",2,3000.0),
            							(7,"qf",2,2500.0),
            							(8,"cl",2,2500.0),
            							(9,"jmsw",2,2000.0);
            
            select * from employees;
            -- 获取每个部门薪资排名前二的员工信息
            -- 部门分组 薪资降序排序 排名窗口函数 给每一行数据打上一个序号
            select 
            	*,
            	dense_rank() over(partition by employees_dept order by employees_salary desc) as salary_rank
            from employees;
            
            select * from (
            	select 
            		*,
            		dense_rank() over(partition by employees_dept order by employees_salary desc) as salary_rank
            	from employees
            )as b
            where salary_rank <= 2;
            

            Supplementary code for using window function:

            create table student_score(
              student_id int,
              student_name string,
              student_class int,
              student_score double
            );
            
            insert into student_score values(1,"zs",1,80.0),
            								(2,"ls",1,90.0),
            								(3,"ww",1,100.0),
            								(4,"ml",1,85.0),
            								(5,"zsf",2,80.0),
            								(6,"zwj",2,70.0),
            								(7,"qf",2,60.0);
            select * from student_score;
            -- 查询每一个学生的成绩,并且还要获取同一个班级的学生和前一个学生的成绩的差值
            select 
               a.*,
               abs(a.student_score-a.front_score) as score_diff 
            from(
            	select *, 
            	 first_value(student_score) over(partition by student_class order by student_id asc rows between 1 preceding and current row) as front_score
            	from student_score
            ) as a
            
            -- 查询每一个学生的成绩,同时还要获取同一个班级每一个学生和班级最高分的差值。
            select a.*,abs(a.max_score - a.student_score) as score_diff from
            (select * ,
            	first_value(student_score) over(partition by student_class order by student_score desc) as max_score
            	from student_score
            ) as a
            
            select 
            	*,
            	abs(student_score - (max(student_score) over(partition by student_class rows between unbounded preceding and unbounded following))) as score_diff
            from student_score;
            
  • user defined function

    • User-defined functions mean that we feel that hive’s built-in functions do not meet our needs. We can customize functions to achieve the functions we want.

    • Steps to customize Hive functions
      (the bottom layer of Hive is also Java, and custom functions are also written in Java code)

      • 1. Create a Java project

      • 2. Introduce programming dependencies

        • 1. Create the lib directory, find the jar package and put it in the lib directory, and then add the lib directory as library to the
          lib directory of hive's installation directory.
        • 2. Use maven and introduce dependencies based on gav coordinates
      • 3. Write corresponding function classes: UDF, UDTF, UDAF.
        Most of the customizations are UDF and UDTF functions.

      • 4. Type the written Java code into a jar package

      • 5. Upload the jar package to HDFS

      • 6. Use create function... to create a function from the jar package in the form of a fully qualified class name.

      • Note: There are two ways to create a custom function:
        create [temporary] function function_name as “fully qualified class name” using jar “path of jar package on hdfs”

        • Temporary function: only valid for this session.
          It can only be viewed through show functions and will not be recorded in the metadata database.
        • Permanent function: takes effect permanently
          and cannot be viewed through show functions, but can be viewed through the FUNS table of Hive's metadata database.
      • Introduce dependencies

        <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
          <modelVersion>4.0.0</modelVersion>
        
          <groupId>com.kang</groupId>
          <artifactId>hive-function</artifactId>
          <version>1.0</version>
          <packaging>jar</packaging>
        
          <name>hive-function</name>
          <url>http://maven.apache.org</url>
        
          <properties>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
          </properties>
        
          <dependencies>
            <dependency>
              <groupId>org.apache.hive</groupId>
              <artifactId>hive-exec</artifactId>
              <version>3.1.2</version>
            </dependency>
          </dependencies>
          <build>
            <finalName>hf</finalName>
          </build>
        </project>
        
      • Write Java code

        package com.kang.udf;
        
        import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
        import org.apache.hadoop.hive.ql.metadata.HiveException;
        import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
        import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
        import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
        
        /**
         * 类就是从一个字符串中找大写字符个数的一个函数
         * UDF函数需要继承一个类GenericUDF,并且重写三个方法
         */
        public class FindUpperCount extends GenericUDF {
                  
                  
            /**
             * 初始化方法,方法是用来判断函数参数的
             *      指定函数参数的个数以及函数参数的类型
             * @param objectInspectors  函数参数的类型和个数的一个数组
             * @return  方法的返回值代表的是函数执行完成之后的返回值类型
             * @throws UDFArgumentException
             */
            @Override
            public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException {
                  
                  
                /**
                 * 1、判断参数的类型和个数是否满足需求
                 */
                //数组的长度就是函数参数的个数
                int length = objectInspectors.length;
                if(length != 1){
                  
                  
                    throw new UDFArgumentException("function only need one param");
                }else {
                  
                  
                    //ObjectInspector是一个Hive数据类型的顶尖父类 参数的类型
                    ObjectInspector objectInspector = objectInspectors[0];
                    //PrimitiveObjectInspectorFactory是Hive中所有基础数据类型的工厂类
                    //返回函数的执行完成之后输出的结果类型  整数类型
                    return PrimitiveObjectInspectorFactory.javaIntObjectInspector;
                }
            }
        
            /**
             * 方法就是函数实现的核心逻辑和方法
             * @param deferredObjects   函数传递的参数
             * @return  返回值就是函数执行之后的返回结果,返回结果必须和initialize的返回值类型保持一致
             * @throws HiveException
             */
            @Override
            public Object evaluate(DeferredObject[] deferredObjects) throws HiveException {
                  
                  
                //获取函数传递的那一个参数
                DeferredObject deferredObject = deferredObjects[0];
                //get方法是获取封装的参数值
                Object o = deferredObject.get();
                String str = o.toString();
                int num = 0;
                for (char c : str.toCharArray()) {
                  
                  
                    if (c >= 65 && c <=90){
                  
                  
                        num++;
                    }
                }
                return num;
            }
        
            /**
             * HQL的解析SQL的输出---没有用处
             * @param strings
             * @return
             */
            @Override
            public String getDisplayString(String[] strings) {
                  
                  
                return "";
            }
        }
        
      • Type the Java code written on Windows into a Jar package, first upload it to the Linux system, then upload it to the HDFS path through Linux, and finally run it in the Hive environment

        -- 想通过HQL语句获取一个字符串中大写字符的数量 UDF
        -- 根据HDFS上的jar包创建函数
        -- 临时函数,可以通过show functions;查看到
        create temporary function find_upper as "com.kang.udf.FindUpperCount" using jar "hdfs://192.168.31.104:9000/hf.jar";
        show functions;
        select find_upper("AsdsdSsdsDdsdF");
        -- 永久函数,不可以通过show functions;查看到,但是可以通过Hive的元数据库的FUNS表中查看
        create function find_upper as "com.kang.udf.FindUpperCount" using jar "hdfs://192.168.31.104:9000/hf.jar";
        select find_upper("AsdsdSsdsDdsdF");
        
    • Custom UDF function

      • package com.kang.udtf;
        
        import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
        import org.apache.hadoop.hive.ql.metadata.HiveException;
        import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
        import org.apache.hadoop.hive.serde2.objectinspector.*;
        import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
        
        import java.util.ArrayList;
        import java.util.List;
        
        /**
         * 输入参数有两个:
         *      字符串
         *      分隔符
         *  输出结果一列多行的结果
         *      word
         *      zs
         *      ls
         */
        public class SplitPlus extends GenericUDTF {
                  
                  
            /**
             *作用:
             *  1、校验输入的参数
             *  2、返回UDTF函数返回值的类型和函数返回的列的个数、名字、类型
             * @param argOIs  当作一个数组来看,里面多个参数组成的
             * @return
             * @throws UDFArgumentException
             */
            @Override
            public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
                  
                  
                List<? extends StructField> allStructFieldRefs = argOIs.getAllStructFieldRefs();
                if (allStructFieldRefs.size() != 2){
                  
                  
                    throw new UDFArgumentException("function need two params");
                }else {
                  
                  
                    /**
                     * 返回一列多行 UDTF函数可以返回多行多列
                     */
                    //返回的列的名字  是一个集合 集合有几项 代表UDTF函数返回几列
                    List<String> columnNames = new ArrayList<>();
                    columnNames.add("word");
        
                    //返回的列的类型 集合的个数必须和columnNames集合的个数保持一致
                    List<ObjectInspector> columnTypes = new ArrayList<>();
                    columnTypes.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        
                    //构建StandardListObjectInspector,需要两个List集合 List<String> List<ObjectInspector>
                    StandardStructObjectInspector standardStructObjectInspector = ObjectInspectorFactory.getStandardStructObjectInspector(columnNames, columnTypes);
                    return standardStructObjectInspector;
                }
            }
        
            /**
             * UDTF函数执行的核心逻辑
             *    结果的输出需要借助forward方法
             * @param objects  函数的输入参数
             * @throws HiveException
             */
            @Override
            public void process(Object[] objects) throws HiveException {
                  
                  
                String str = objects[0].toString();
                String split = objects[1].toString();
                String[] array = str.split(split);
                for (String s : array) {
                  
                  
                    //一行数据需要输出一次  如果输出一行数据  那么只需要 调用一次forward方法即可
                    /**
                     * 如果一行数据有多列,可以先创建一个List集合,List<Object>集合中把一行的多列值全部加进来
                     */
                    forward(s);
                }
            }
        
            /**
             * close用于关闭一些外部资源
             * @throws HiveException
             */
            @Override
            public void close() throws HiveException {
                  
                  
        
            }
        }
        
      • -- 想实现一个类似于split的函数功能,split函数切割之后返回多行数据,而非一个数组
        -- split("zs-ls","-"):array("zs","ls")  split_plus("zs-ls","-"):zs ls
        create function split_plus as "com.kang.udtf.SplitPlus" using jar "hdfs://192.168.31.104:9000/hf.jar";
        list jars;
        DELETE jar;
        
        select split("zs-ls-ww","-");
        select split_plus("zs-ls-ww","-");
        
    • Custom UDTF function

    • Delete a custom function: drop function function name

    • [Note] There is a particularly important issue with user-defined functions. Custom functions are bound to the database. Functions can only be used in the database in which they were created. If you want to use it in other databases, you need to re-create the function in other databases.

  • Classification of Hive functions:

    • UDF function: one-to-one function, input one data and output one data
    • UDTF function: one-to-many function, input one data and output multiple data
    • UDAF function: many-to-one function, input multiple data and output one data
    • Custom function
    • Side view function: specially used to perform Cartesian products

10. Hive compression mechanism

The bottom layer of Hive will be converted to MapReduce operation, and compression can be performed in the middle of the MapReduce stage. Therefore, Hive also supports setting the compression mechanism (it also sets whether the bottom layer of the converted MR program is map stage compression or reduce stage compression). The bottom layer of
Hive can also be converted into a Spark or TEZ program to run. The compression of Spark and TEZ is different from the compression of Mapreduce.

Guess you like

Origin blog.csdn.net/weixin_57367513/article/details/132782196