Apache Hive for big data development
1 Overview
Apache Hive is an open source data warehouse system built on the Hadoop ecosystem. It can map structured and semi-structured data files stored in Hadoop files into a database table, and then provide a SQL-like query based on the database table. Model, also called Hive Query Language, HQL for short;
Then the core of Hive is to convert HQL into a MapReduce program, and then submit the converted MapReduce program to the Hadoop cluster for execution
Why use Hive
Directly writing Hadoop MapReduce programs requires a certain learning cost, such as Java foundation, and then complex data query association, sorting, deduplication and other logic writing is more complicated;
Hive provides a method similar to SQL, we can quickly get started and write SQL to analyze and process massive data;
Relationship between Hive and Hadoop
As a data warehouse software, it must at least have the ability of data storage and data analysis;
Hive uses hadoop to achieve
- Data storage Hive uses Hadoop's Hdfs to store data
- Data analysis Hive uses Hadoop's MapReduce to analyze data
It can be seen that Hive actually does only one function, parsing HQL into MapReduce , and the ability to store and analyze is based on Hadoop components.
2. Hive architecture
- UI – The user interface through which users submit queries and other operations to the system. As of 2011, the system has a command-line interface and a web-based GUI is being developed.
- Driver – the component that receives the query. This component implements the concept of a session handle and provides an execution and acquisition API modeled on the JDBC/ODBC interface.
- Compiler - includes parser, plan compiler, optimizer, executor ; parses queries, performs semantic analysis on different query blocks and query expressions, and finally in table and partition metadata looked up from metastore Generate an execution plan with help.
- Metastore – A component that stores all the structural information for the various tables and partitions in the warehouse, including column and column type information, serializers and deserializers needed to read and write data, and the corresponding HDFS files where the data is stored.
- Execution Engine - The component that executes the execution plans created by the compiler. The program is a one-stage DAG. The execution engine manages the dependencies between these different phases of the plan and executes these phases on the appropriate system components. Hive itself does not directly deal with data files. Instead, it is processed by the execution engine. Currently, Hive supports three execution engines: MapReduce, Tez, and Spark.
3.Hive Metastore
Hive Metastore is a component or service that manages Hive metadata
Metadata, also known as metadata, describes the data of the data; for example, our common database fields, this can be called metadata;
Hive's metadata Hive Metadata , we all know that Hive's data storage is saved using files in Hadoop, so these Hadoop files correspond to the table in our HQL? This is recorded by Hive Metastore; Hive Metastore includes Hive's database, table, field attributes, field order, and hadoop-hdfs storage location of the table;
There are two types of metadata storage methods, a Derby storage that comes with Hive, or a third-party storage such as MySQL;
Hive Metastore plan data service manages hive Metadata, provides service addresses externally, and clients connect to Hive Metastore instead of directly storing metadata databases, which guarantees the security of metadata to a certain extent
There are 3 modes of metastore service configuration: embedded mode, local mode, remote mode
embedded mode | local mode | remote mode | |
---|---|---|---|
Metastore separate deployment starts | no | no | yes |
Metadata storage | Comes with Derby | Third-party MySQL | Third-party MySQL |
4. Hive deployment
Preparation
Hadoop cluster is available normally, cluster time synchronization, firewall settings, host name, password-free login, JDK, environment variables, etc. are installed successfully;
4.1 Hadoop Hive Consistency
Hive needs to store data on HDFS, so you need to add relevant configuration properties in Hadoop to meet Hive running on Hadoop
Modify core-site.xml in Hadoop, and Hadoop cluster synchronization configuration file, restart to take effect
<!-- 整合hive 用户代理设置 -->
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
4.2 MySQL installation
slightly slightly
4.4 Installation package
Install Hive on the master node
tar zxvf apache-hive-3.1.2-bin.tar.gz
mv apache-hive-3.1.2-bin/ hive
# 解决Hive与Hadoop之间guava版本差异
cd /soft/server/apache-hive-3.1.2-bin/
rm -rf lib/guava-19.0.jar
cp /soft/server/hadoop-3.3.0/share/hadoop/common/lib/guava-27.0-jre.jar ./lib/
Modify hive-env.sh
cd /soft/server/apache-hive-3.1.2-bin/conf
mv hive-env.sh.template hive-env.sh
vim hive-env.sh
export HADOOP_HOME=/soft/server/hadoop-3.3.0
export HIVE_CONF_DIR=/soft/server/apache-hive-3.1.2-bin/conf
export HIVE_AUX_JARS_PATH=/soft/server/apache-hive-3.1.2-bin/lib
Add hive-site.xml
<configuration>
<!-- 存储元数据mysql相关配置 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.141.155:3306/hive3?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
<!-- H2S运行绑定host -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>node1</value>
</property>
<!-- 远程模式部署metastore metastore地址 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://node1:9083</value>
</property>
<!-- 关闭元数据存储授权 -->
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
</configuration>
4.5 Initialization
Upload the MySQL JDBC driver to the mysql-connector-java-5.1.32.jar under the lib path of the Hive installation package, and initialize the metadata of Hive
cd /soft/server/apache-hive-3.1.2-bin/
bin/schematool -initSchema -dbType mysql -verbos
4.6 Start the service
#前台启动
/soft/apache-hive-3.1.2-bin/bin/hive --service metastore
#前台启动开启debug日志
/soft/apache-hive-3.1.2-bin/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console
##后台启动、
nohup /soft/apache-hive-3.1.2-bin/bin/hive --service metastore &
4.7 hive client
Corresponding to the following scripts beeline and hive respectively
[root@node1 bin]# pwd
/soft/apache-hive-3.1.2-bin/bin
[root@node1 bin]# ll
总用量 44
-rwxr-xr-x. 1 root root 881 8月 23 2019 beeline
drwxr-xr-x. 3 root root 4096 7月 12 21:45 ext
-rwxr-xr-x. 1 root root 10158 8月 23 2019 hive
....
Start the metastore service first, and then start the hiveserver2 service
nohup /soft/servers/hive/bin/hive --service metastore &
nohup /soft/servers/hive/bin/hive --service hiveserver2 &
5. Hive use
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL