This article has been updated to Yuque. If you reply "Yuque" in the background, you can get all the information that is continuously updated throughout the career of Attack Bar Big Data.
This article is based on the topic of Hive - from SQL to Hive's underlying execution principle to further study Hive in depth. It is believed that most children's shoes are only limited to the theoretical level for the execution process of Hive's underlying layer. Then this article will take you to spend about half an hour to track the Hive source code on your own machine. Through this article, you will be able to learn:
1. The installation and deployment of each component
2. The ability to solve problems
3. Do it yourself to debug the Hive source code locally, that is, submitting SQL locally is more appropriate to trace the underlying process
There are many similar implementation articles on the Internet, but most of them are scattered. This article will take you from zero to one complete implementation of this function. Note: All basic software packages and execution files involved in this article have been uploaded to Yuque Documentation and this article is based on Windows
Environmental preparation
1. Maven installation
1.1. Download the software package and decompress it
The software package has been uploaded to the Yuque document. Children’s shoes in need can be collected by replying to "Yuque" in the background. If the password fails, please contact the editor
1.2. Configure environment variables
1.3, configure maven image
Open the $MAVEN_HOME/conf/settings.xml file, adjust the maven warehouse address and mirror address
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0
https://maven.apache.org/xsd/settings-1.0.0.xsd">
<localRepository>你自己的路径</localRepository>
<mirrors>
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorOf>central</mirrorOf>
</mirror>
</mirrors>
</settings>
Two, cywin installation
The software installation is mainly to support the basic environment package involved in compiling the source code of windows
2.1. Download the software package
The software package has been uploaded to the Yuque document. Children’s shoes in need can be collected by replying to "Yuque" in the background. If the password fails, please contact the editor
2.2. Install related software packages online
Need to install cywin, gcc+ related compilation packages
binutils
gcc
gcc-mingw
gdb
3. JDK installation
3.1. Download the software package and decompress it
The software package has been uploaded to the Yuque document. Children’s shoes in need can be collected by replying to "Yuque" in the background. If the password fails, please contact the editor
3.2, configure environment variables
3.2.1. Create JAVA_HOME system variable
3.2.2. Create CLASSPATH variable 3.2.3. Add Path variable
4. Hadoop installation
4.1. Download the software package and decompress it
The software package has been uploaded to the Yuque document. Children’s shoes in need can be collected by replying to "Yuque" in the background. If the password fails, please contact the editor
4.2, configure environment variables
Create the HADOOP_HOME system variable and append this variable to the Path global variable
4.3. Edit configuration file
Note: The downloaded software package may not contain the following files, but the template file is provided, just rename it directly. Since this article is for self-use, only the basic core information needs to be configured. The following files are located in the $HADOOP_HOME/etc/hadoop directory.
4.3.1. Modify the core-site.xml file
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
4.3.2. Modify the yarn-site.xml file
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
4.3.3. Modify the mapred-site.xml file
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
4.4. Dependent files
To start Hadoop in the Windows environment, special binary files and dependent libraries are required, that is, winutils support and hadoop.dll and other files. The file needs to be placed in the HADOOP_HOME/bin directory, and hadoop.dll can be placed in C:\Windows\System32 to avoid dependency errors.
4.5. Initialization start
The following commands are executed in the cmd command line, jump to the HADOOP_HOME/bin directory and the HADOOP_HOME/sbin directory
# 首先初始化namenode,打开cmd执行下面的命令
$HADOOP_HOME/bin> hadoop namenode -format
# 当初始化完成后,启动hadoop
$HADOOP_HOME/sbin> start-all.cmd
When the following processes are all started successfully, the basic hadoop environment has been successfully built!
Five, Mysql installation
5.1. Download the software package and decompress it
The software package has been uploaded to the Yuque document. Children’s shoes in need can be collected by replying to "Yuque" in the background. If the password fails, please contact the editor
5.2, configure environment variables
5.2.1. Create the MYSQL_HOME variable 5.2.2. Add the MYSQL_HOME variable to the Path variable
5.3. Generate Data file
Open a CMD window with sysadmin
--执行下面命令
mysqld --initialize-insecure --user=mysql
5.4, install Mysql and start
--执行命令
mysqld -install
--启动服务
net start MySQL
5.5. Password modification
--初始登录时不需要密码,直接回车即可
mysql -u root -p
--修改root默认密码
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY '123456';
--刷新提交
flush privileges;
6. Hive compilation and installation
If you want to learn the bottom layer of Hive in depth, then source code compilation is essential, so this article will use the way of source code package compilation and installation.
6.1. Download the software package and decompress it
The software package has been uploaded to the Yuque document. Children’s shoes in need can be collected by replying to "Yuque" in the background. If the password fails, please contact the editor
6.2, compile
Execute the following command under IDEA:
mvn clean package -Phadoop-2 -DskipTests -Pdist
When the compilation result is as shown in the figure above, it means that the compilation is successful and you can go to the next step.
Note: It is very possible to encounter problems during the compilation process, most of which are due to maven, which may be network problems, or may also be related to your mirror configuration. In addition, for the "Could not transfer artifact XXXX" problem that occurs during the compilation of some modules, you can first comment out the scope setting in the pom file!
6.3. Installation
After the compilation is complete, you can find the corresponding executable compressed package in the $HIVE_SRC_HOME/packaging/target directory 6.3.1, configure environment variables
Create the HIVE_HOME system variable, and then append the variable to the Path variable 6.3.2, edit hive -env.sh file
# 配置环境信息
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=D:\GitCode\hadoop-2.7.2
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=D:\GitCode\apache-hive-2.3.9-bin\conf
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
export HIVE_AUX_JARS_PATH=D:\GitCode\apache-hive-2.3.9-bin\lib
6.3.3. Edit the hive-site.xml file
There are a little more parameters in this file, but we only need to change the basic part.
<property>
<name>hive.repl.rootdir</name>
<value>D:\GitCode\apache-hive-2.3.9-bin\tmp_local</value>
</property>
<property>
<name>hive.repl.cmrootdir</name>
<value>D:\GitCode\apache-hive-2.3.9-bin\tmp_local</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>D:\GitCode\apache-hive-2.3.9-bin\tmp_local</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>D:\GitCode\apache-hive-2.3.9-bin\tmp_local\${hive.session.id}_resources</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>D:\GitCode\apache-hive-2.3.9-bin\tmp_local</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>D:\GitCode\apache-hive-2.3.9-bin\tmp_local</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
6.3.3. Rename the log file.
Just remove the suffix of several Log files with template in the conf file. 6.3.4. Load the driver package
This article uses mysql as metadata storage, so the JDBC driver package needs to be placed in $ HIVE_HOME/lib directory
6.4. Metadata initialization
In the Windows environment, the executable file ending in cmd is not found in the $HIVE_HOME/bin directory, so in order to adjust the basic environment, you can copy it from the lower version. The scripts and software packages involved in this article are packaged and uploaded to the cloud disk
# 打开cmd命令行
$HIVE_HOME/bin> hive schematool -dbType mysql -initSchema --verbose
Note: During the initialization process, there may be a problem that the table cannot be created, and it can be created manually here. Or directly source sql file. This article uses hive-shceam-2.3.0.mysql.sql, and the SQL file is located under HIVE_HOME/scripts/metastore/upgrade/mysql:
Service startup debugging
In the link of building Hadoop, the Hadoop service has been started, and the Hive Metastore service is started here
hive --service metastore
2.1, the server starts the Debug mode
For the convenience of learning, you can open the Terminal in IDEA, enable the debug mode and start the metastore service.
hive --debug
2.2. Client breakpoint mode
2.2.1. Configure remote debug
Configure Remote Debug information in IDEA 2.2.2. Breakpoint identification
This article traces the source code on the client side, so it will generally enter the CliDriver class, such as setting a breakpoint in the main method, and then start Debugging At this point, the entire debugging function has been realized. You can write HSQL locally, and then learn the specific execution process at the bottom of Hive according to the breakpoints, and even modify the source code yourself!