Recommendation system from scratch (3)-Hive

This article defaults to download all files to /home/YourUserName/downloads

1. Introduction to Hive

Hive is a data warehouse tool based on Hadoop for data extraction, transformation, and loading. This is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. The hive data warehouse tool can map structured data files to a database table, and provide SQL query functions, which can convert SQL statements into MapReduce tasks for execution. The advantage of Hive is that the learning cost is low, and fast MapReduce statistics can be realized through similar SQL statements, making MapReduce simpler without having to develop special MapReduce applications.

2. Hive installation and configuration

  1. Download Hive

Enter the download source page of Apache Hive

Note: 3.xx version works with Hadoop 3.yy version, and 2.xx works with Hadoop 2.yy version.

Select version

My Hadoop is version 3.1.3, so I download 3.1.2 Hive** (remember to download the compiled package)**

Insert picture description here

wget http://mirror.bit.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

In addition, the database I configured for Hive is MySQL, so I need to download the MySQL driver jar package

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.47/mysql-connector-java-5.1.47.jar

The download speed above is too slow, you can try the following

wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.47.tar.gz
# 从这里下载的文件要解压哦
tar -zxf mysql-connector-java-5.1.47.tar.gz
  1. start installation

Unzip

tar -zxf apache-hive
mv apache hive
cp hive /usr/local/hive

Configure environment variables

vi /etc/source

Join

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

In the hdfs directory and grant permissions

hdfs dfs -mkdir -p /usr/local/hive/warehouse
hdfs dfs -mkdir -p /usr/local/hive/tmp
hdfs dfs -mkdir -p /usr/local/hive/log
hdfs dfs -chmod g+w /usr/local/hive/warehouse
hdfs dfs -chmod g+w /usr/local/hive/tmp
hdfs dfs -chmod g+w /usr/local/hive/log

Configurationhive-env.sh

cd /usr/local/hive/conf
cp hive-env.sh.template hive-env.sh
vim hive-env.sh

Add the following content inside (there is a corresponding line after the template is copied)

export JAVA_HOME=/usr/lib/jvm/java-1.8.0_241
export HADOOP_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HIVE_AUX_JARS_PATH=$HIVE_HOME/lib/*

Configurationhive-site.xml

cp -r hive-default.xml.template hive-site.xml
vim hive-site.xml

In <configuration>addition of the following label

 <property>
   <name>system:java.io.tmpdir</name>
   <value>/tmp/hive/java</value>
 <property>
 <property>
   <name>system:user.name</name>
   <value>${user.name}</value>
 <property>

And modify the following

<property>
   <name>javax.jdo.option.ConnectionURL</name>
 <value>jdbc:mysql://Master:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=utf8&amp;useSSL=false</value>
 </property>
 <property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>com.mysql.jdbc.Driver</value>
 </property>
 <property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>root</value>
 </property>
 <property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>toor</value>
 </property>

Note: Why use & in the value of ConnectionURL here? Because & is a predefined entity reference in XML

Entity reference Characters represented Character name
&lt; < Less than
&gt; > more than the
&amp; & with
&apos; apostrophe
&quot; " Double quotes

Put the downloaded driver file /usr/local/hive/libunder

cp /home/yourusername/downloads/mysql-connector-java-5.1.47.jar /usr/local/hive/lib
# 如果是用第二个命令下载的则
# cp /home/yourusername/downloads/mysql-connector-java-5.1.47/mysql-connector-java-5.1.47.jar /usr/local/hive/lib

Use to schemaToolinitialize the mysql database

schematool -dbType mysql -initSchema

Finished output

Then we log in to MySQL to check

mysql -u root -p yourpassword

See if there is a database called hive

show databases;

Should be as follows

The result after initialization

  1. Start Hive (have to start Hadoop first)
hive

The output is as follows

Insert picture description here

3. Problems encountered in the process

  • Prompt log4j binding related causes, unable to initialize

Delete the relevant package in the classpath, leaving only one

Error prompt

For example, here we delete the log4j package under $HIVE_HOME/lib

rm -rf /usr/local/hive/lib/log4j-slf4j-impl-2.6.2.jar

Note: The name of the package may be different, please enter the command according to the log output in the terminal

  • 提示Exception in thread “main” java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument

This is a problem that the guava version used by hive is too low. You can copy guava under Hadoop directly, and remember to delete the guava that comes with hive.

cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /usr/local/hive/lib
cd /usr/local/hive/lib
rm -rf guava-19.0.jar
  • hive-site.xml problem

3224

cd /usr/local/hive/conf
vim hive-site.xml

Use vim commands

:3224

Press enter to see where the error occurred

Error line

Then there is a problem of closing the <configuration> tag

Label closure

  • hive启动失败:Cannot create directory /tmp/hive. Name node is in safe mode.

Command to exit safe mode

hdfs dfsadmin -safemode leave

Possible Causes:

The security mode of hdfs is affected by the following properties of the configuration file hdfs-site.xml:

<property>
	<name>dfs.namenode.safemode.threshold-pct</name>
	<value>0.99</value>
</property>

This is a percentage, which means that when the NameNode obtains 99% of the entire file system data block, it will automatically exit the safe mode. This percentage can be set by yourself. If this value is less than or equal to 0, it will not enter the safe mode, but if the value is greater than 1, it will be in the safe mode indefinitely.

Infinite is in safe mode, there are the following possibilities:

1) The value of dfs.namenode.safemode.threshold-pct is greater than 1

2) Is there a small number of nodes, and the copy requirement is not 1, for example, there is only one node, but the minimum copy (dfs.replication.min) is 2, then, in this case, there will definitely be an infinite security mode, because 2 Copies, but only one node, the second copy cannot be copied.


Safe mode explanation:

The NameNode first enters the safe mode when it is started. If the datanode loses a certain percentage of blocks (dfs.safemode.threshold.pct) , the system will always be in the safe mode, that is, read-only .
dfs.safemode.threshold.pct (default value 0.999f) means that when HDFS is started, if the number of blocks reported by the DataNode reaches 0.999 times the number of blocks in the metadata record, the safe mode can be left, otherwise it will always be this way Read only mode. If set to 1, HDFS is always in SafeMode.
There are two ways to leave this safe mode
(1) Modificationdfs.safemode.threshold.pctIt is a relatively small value, and the default is 0.999. (Note: Set it to 0 in hdfs-site.xml, turn off safe mode)
(2)hadoop dfsadmin -safemode leaveOrder to leave

Note: The user can use dfsadmin -safemodeto operate the safe mode, the description of the parameter value is as follows:

  • enter-enter safe mode
  • leave-Force NameNode to leave safe mode
  • get-Returns whether the safe mode is enabled
  • wait-wait until the end of safe mode.

Guess you like

Origin blog.csdn.net/JikeStardy/article/details/105210672