This article defaults to download all files to /home/YourUserName/downloads
1. Introduction to Hive
Hive is a data warehouse tool based on Hadoop for data extraction, transformation, and loading. This is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. The hive data warehouse tool can map structured data files to a database table, and provide SQL query functions, which can convert SQL statements into MapReduce tasks for execution. The advantage of Hive is that the learning cost is low, and fast MapReduce statistics can be realized through similar SQL statements, making MapReduce simpler without having to develop special MapReduce applications.
2. Hive installation and configuration
- Download Hive
Enter the download source page of Apache Hive
Note: 3.xx version works with Hadoop 3.yy version, and 2.xx works with Hadoop 2.yy version.
My Hadoop is version 3.1.3, so I download 3.1.2 Hive** (remember to download the compiled package)**
wget http://mirror.bit.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
In addition, the database I configured for Hive is MySQL, so I need to download the MySQL driver jar package
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.47/mysql-connector-java-5.1.47.jar
The download speed above is too slow, you can try the following
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.47.tar.gz # 从这里下载的文件要解压哦 tar -zxf mysql-connector-java-5.1.47.tar.gz
- start installation
Unzip
tar -zxf apache-hive mv apache hive cp hive /usr/local/hive
Configure environment variables
vi /etc/source
Join
export HIVE_HOME=/usr/local/hive export PATH=$PATH:$HIVE_HOME/bin
In the hdfs directory and grant permissions
hdfs dfs -mkdir -p /usr/local/hive/warehouse hdfs dfs -mkdir -p /usr/local/hive/tmp hdfs dfs -mkdir -p /usr/local/hive/log hdfs dfs -chmod g+w /usr/local/hive/warehouse hdfs dfs -chmod g+w /usr/local/hive/tmp hdfs dfs -chmod g+w /usr/local/hive/log
Configuration
hive-env.sh
cd /usr/local/hive/conf cp hive-env.sh.template hive-env.sh vim hive-env.sh
Add the following content inside (there is a corresponding line after the template is copied)
export JAVA_HOME=/usr/lib/jvm/java-1.8.0_241 export HADOOP_HOME=/usr/local/hadoop export HIVE_HOME=/usr/local/hive export HIVE_CONF_DIR=$HIVE_HOME/conf export HIVE_AUX_JARS_PATH=$HIVE_HOME/lib/*
Configuration
hive-site.xml
cp -r hive-default.xml.template hive-site.xml vim hive-site.xml
In
<configuration>
addition of the following label<property> <name>system:java.io.tmpdir</name> <value>/tmp/hive/java</value> <property> <property> <name>system:user.name</name> <value>${user.name}</value> <property>
And modify the following
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://Master:3306/hive?createDatabaseIfNotExist=true&characterEncoding=utf8&useSSL=false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>toor</value> </property>
Note: Why use & in the value of ConnectionURL here? Because & is a predefined entity reference in XML
Entity reference Characters represented Character name < < Less than > > more than the & & with ' ’ apostrophe " " Double quotes Put the downloaded driver file
/usr/local/hive/lib
undercp /home/yourusername/downloads/mysql-connector-java-5.1.47.jar /usr/local/hive/lib # 如果是用第二个命令下载的则 # cp /home/yourusername/downloads/mysql-connector-java-5.1.47/mysql-connector-java-5.1.47.jar /usr/local/hive/lib
Use to
schemaTool
initialize the mysql databaseschematool -dbType mysql -initSchema
Then we log in to MySQL to check
mysql -u root -p yourpassword
See if there is a database called hive
show databases;
Should be as follows
- Start Hive (have to start Hadoop first)
hive
The output is as follows
3. Problems encountered in the process
- Prompt log4j binding related causes, unable to initialize
Delete the relevant package in the classpath, leaving only one
For example, here we delete the log4j package under $HIVE_HOME/lib
rm -rf /usr/local/hive/lib/log4j-slf4j-impl-2.6.2.jar
Note: The name of the package may be different, please enter the command according to the log output in the terminal
- 提示Exception in thread “main” java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument
This is a problem that the guava version used by hive is too low. You can copy guava under Hadoop directly, and remember to delete the guava that comes with hive.
cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /usr/local/hive/lib cd /usr/local/hive/lib rm -rf guava-19.0.jar
- hive-site.xml problem
cd /usr/local/hive/conf vim hive-site.xml
Use vim commands
:3224
Press enter to see where the error occurred
Then there is a problem of closing the <configuration> tag
- hive启动失败:Cannot create directory /tmp/hive. Name node is in safe mode.
Command to exit safe mode
hdfs dfsadmin -safemode leave
Possible Causes:
The security mode of hdfs is affected by the following properties of the configuration file hdfs-site.xml:
<property> <name>dfs.namenode.safemode.threshold-pct</name> <value>0.99</value> </property>
This is a percentage, which means that when the NameNode obtains 99% of the entire file system data block, it will automatically exit the safe mode. This percentage can be set by yourself. If this value is less than or equal to 0, it will not enter the safe mode, but if the value is greater than 1, it will be in the safe mode indefinitely.
Infinite is in safe mode, there are the following possibilities:
1) The value of dfs.namenode.safemode.threshold-pct is greater than 1
2) Is there a small number of nodes, and the copy requirement is not 1, for example, there is only one node, but the minimum copy (dfs.replication.min) is 2, then, in this case, there will definitely be an infinite security mode, because 2 Copies, but only one node, the second copy cannot be copied.
Safe mode explanation:
The NameNode first enters the safe mode when it is started. If the datanode loses a certain percentage of blocks (dfs.safemode.threshold.pct) , the system will always be in the safe mode, that is, read-only .
dfs.safemode.threshold.pct (default value 0.999f) means that when HDFS is started, if the number of blocks reported by the DataNode reaches 0.999 times the number of blocks in the metadata record, the safe mode can be left, otherwise it will always be this way Read only mode. If set to 1, HDFS is always in SafeMode.
There are two ways to leave this safe mode
(1) Modificationdfs.safemode.threshold.pctIt is a relatively small value, and the default is 0.999. (Note: Set it to 0 in hdfs-site.xml, turn off safe mode)
(2)hadoop dfsadmin -safemode leaveOrder to leaveNote: The user can use
dfsadmin -safemode
to operate the safe mode, the description of the parameter value is as follows:
- enter-enter safe mode
- leave-Force NameNode to leave safe mode
- get-Returns whether the safe mode is enabled
- wait-wait until the end of safe mode.