Hive environment construction (nanny-level tutorial)

1. Introduction to Hive

Hive is a data statistics tool open sourced by Facebook for solving massive structured logs.
Hive is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provides SQL-like query functions, which can convert SQL statements into MapReduce tasks for execution.

Of course, SQL statements can also be converted to Tez or Spark.

Hive

Why do you need Hive?

  • Reduce the cost of learning MapReduce, so that DBAs and operation and maintenance personnel can implement it through SQL
  • Open source, can save costs
  • Can handle big data

Hive Architecture System

Hive Architecture Diagram

  • Driver component

    • Parser (SQL Parser): converts SQL strings into abstract syntax tree AST, this step is generally done with third-party tool libraries, such as antlr; AST is syntactically analyzed, such as whether tables exist, fields exist, and SQL semantics mistaken.

    • Compiler (Physical Plan): Compile the AST to generate a logical execution plan.

    • Optimizer (Query Optimizer): Optimize the logical execution plan.

    • Execution: Converts a logical execution plan into a runnable physical plan. for Hive

      Say, it's MR/Spark/Tez.

  • MetaStore component

    • The MetaStore component stores the metadata information of hive, and stores its own metadata in the relational database. The supported databases mainly include: Mysql, Derby, and supports the independent metastore on the remote cluster, making hive more robust.
    • Metadata mainly includes the name of the table, the database to which the table belongs (default is default), the owner of the table, the columns of the table, partitions and their attributes, the attributes of the table (whether it is an external table, etc.), and the location of the data in the table. directory etc.
  • user interface

    • CLI, the Command Line Interface, the command interaction tool. In this way, we can operate Hive with commands.
    • HiveServer2, when this service is started, we can operate Hive by developing programs. For example, through a Java program, calling the Hive JDBC Driver to add, delete, modify, and query the database is as convenient as operating a relational database.
    • Hive Web Interface, this is a Web version of the operation interface. But it is deprecated in the new version. HUE can be used instead.

2. Hive environment installation

The three deployment modes of Hive are mainly distinguished by the operation mode of Metastore.

Hive deployment mode

This article mainly demonstrates in Local mode, and the relational database will use MariaDB, a branch version of Mysql.

1. Preparations

1) Install Hadoop

This article takes Hadoop2.7.7 as an example, you can refer to the tutorial: https://blog.csdn.net/tangyi2008/article/details/121908766

After installing Hadoop, you need to add the user's proxy configuration:

vi $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration items

<property>
    <name>hadoop.proxyuser.xiaobai.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.xiaobai.groups</name>
    <value>*</value>
</property>

2) Install MariaDB

The MariaDB database is a fork of MySQL developed by MySQL founder Michael Widenius. One of the reasons for developing this branch is: after Oracle acquired MySQL, there is a potential risk of closing MySQL, so the community adopts the method of branching to avoid this risk.

The MariaDB name comes from the name of Michael Widenius' daughter Maria.

(1) Check the MariaDB installation

Use the following command to check the installation of MariaDB

sudo dpkg -l | grep maria

If nothing is printed, it means that it is not installed, otherwise, you can decide whether you need to uninstall and reinstall according to your needs

(2) Uninstall MariaDB/MySQL

If you want to uninstall the previously installed mariadb or mysql, you can use the following command

sudo apt autoremove mysql-\*

If you are sure that maria is installed, you can use the following command to uninstall

sudo apt autoremove mariadb-*

(3) Install MariaDB

A convenient installation can be done with the following command, it will be installed in your default source

sudo apt update
sudo apt install mariadb-server

Note : You can https://downloads.mariadb.org/mariadb/repositories/set your own operating system and version, the MariaDB version you need to install, and the repository that matches your country's region, so that you can install the required version more quickly and accurately.

When the operating system (18.04LTS), MariaDB version (10.6) and warehouse (Alibaba Cloud) are set up, the command to add the warehouse and the command to install will be given on the webpage.

MariaDB version selection

Commands to add Repository

sudo apt-get install software-properties-common dirmngr apt-transport-https
sudo apt-key adv --fetch-keys 'https://mariadb.org/mariadb_release_signing_key.asc'
sudo add-apt-repository 'deb [arch=amd64,arm64,ppc64el] https://mirrors.aliyun.com/mariadb/repo/10.6/ubuntu bionic main'

Commands to update sources and install MariaDB

sudo apt update
sudo apt install mariadb-server

After the installation is complete, check the MariaDB service status using the command below

sudo systemctl status mariadb

When you see the following interface, congratulations, the installation is successful! ! !

MariaDB status view

(4) Simple configuration of MariaDB

Use the following commands for basic configuration:

sudo mysql_secure_installation

After executing the command, the terminal will have the following problems that need to be set:

1.Enter current password for root (enter for none): 
(预设的MariaDB没有密码,直接Enter即可)

2.Switch to unix_socket authentication [Y/n] n
(是否切换unix_socket安全认证,这里输入的n,即不切换)

3.Change the root password? [Y/n] y
(是否修改root账户的密码,输入的y,即会修改root密码
注意:
- root密码最好是复杂密码,否则可能会每次连接MariaDB时需要加sudo
- 在设置密码时,输入密码时是不会跳光标的
- 本文将密码设置成abc@@123
)

4.Remove anonymous users? [Y/n] y
(是否删除匿名用户,这里输入的是y,即删除匿名用户
默认情况下,MariaDB安装有匿名用户,允许任何人登录MariaDB而无需为其创建帐户,在生产环境中一定要删除
)

5.Disallow root login remotely? [Y/n] y
(是否允许远程登录root账户,否则只能在localhost上登录root账户,这里输入的y,即不允许远程登录root账户)

6.Remove test database and access to it? [Y/n] y
(是否删除test数据库,这里输入的y,即删除test数据库
默认情况下,MariaDB有一个test数据库,允许任何用户获取
)

7.Reload privilege tables now? [Y/n] y
(是否重新加载权限表,这里输入的y,即立即重新加载)

(5) Common problems and solutions

  • Connecting to MariaDB requires sudo (mysql has similar problems)

Problem description: After installing MariaDB in Ubuntu 18.04, you need to add sudo to access MariaDB each time.

Reason: The password strength of the root user in MariaDB is not a strong type.

Solution: Change the password of the root account to a strong type, such as at least 8 digits containing uppercase and lowercase letters, numbers, and symbols. For example, the following command will change the password toabc@@123

SET PASSWORD = PASSWORD('abc@@123');
  • remote access

Description of the problem: Cannot use the server's ip address to connect to the database

Reason: The ip is bound to the default by default 127.0.0.1, so it can only be used 127.0.0.1or localhostconnected

Solution: Modify the configuration file/etc/mysql/mariadb.conf.d/50-server.cnf

sudo vi /etc/mysql/mariadb.conf.d/50-server.cnf

modified [mysqld]parametersbind-address = 127.0.0.1 => bind-address = 0.0.0.0

bind-address modification

Then restart MariaDB service

sudo service mariadb restart

The order in which the MariaDB/MySQL database reads configuration files is as follows:

  1. “/etc/mysql/my.cnf” symlinks to this file, reason why all the rest is read.

  2. “/etc/mysql/mariadb.cnf” (this file) to set global defaults,

  3. “/etc/mysql/conf.d/*.cnf” to set global options.i

  4. “/etc/mysql/mariadb.conf.d/*.cnf” to set MariaDB-only options.

  5. “~/.my.cnf” to set user-specific options.

2. Hive installation

1) Download the Hive installation package

Download address: https://dlcdn.apache.org/hive/, the download version of this article is: apache-hive-2.3.9

Recording can also be downloaded on Windows through acceleration software such as Thunder, and then uploaded to the corresponding directory using the sftp tool

cd ~/soft
wget https://dlcdn.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz

2) Install Hive

tar -xvf apache-hive-2.3.9-bin.tar.gz -C ~/opt
cd ~/opt
ln -s apache-hive-2.3.9-bin/ hive

Configure environment variables

vi ~/.bashrc

Add the following at the end:

export HIVE_HOME=/home/xiaobai/opt/hive
export PATH=$HIVE_HOME/bin:$PATH

make the configuration take effect

source ~/.bashrc

3) Upload the MySQL driver

MySQL connection driver download page: https://downloads.mysql.com/archives/cj/

Here use wget to download the 5.1.49 version of the connector:

cd ~/soft
wget https://cdn.mysql.com/archives/mysql-connector-java-5.1/mysql-connector-java-5.1.49.tar.gz
tar -xvf mysql-connector-java-5.1.49.tar.gz
cp mysql-connector-java-5.1.49/mysql-connector-java-5.1.49.jar   ~/opt/hive/lib

$HIVE_HOME/libOf course, you can also download the corresponding connector on Windows, and then pass it into the directory.

4) Modify the configuration file

Configuration file: hive-site.xml

Introduction to common configuration parameters:

property name type Defaults describe
hive.metastore.warehouse.dir URI /user/hive/warehouse Relative to the directory of fs.defaultFS, the managed table is stored here
hive.metastore.uris comma-separated URIs Not set If not set (the default), the current metastore is used, otherwise it connects to the remote metastore server to connect to as specified by the list of URIs. If there are multiple remote servers, clients connect in a round-robin fashion
javax.jdo.option.ConnectionURL URI jdbc:derby:;databaseName=metastore_db;create=true JDBC URL, mysql示例: jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
javax.jdo.option.ConnectionDriverName String org.apache.derby.jdbc.EmbeddedDriver Example of class name mysql for JDBC driver: com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName String APP JDBC username
javax.jdo.option.ConnectionPassword String mine JDBC password

Open hive-site.xml

vi ~/opt/hive/conf/hive-site.xml

Add the following

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://node1:3306/hive?useSSL=false&amp;createDatabaseIfNotExist=true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive_pwd</value>
    </property>
    <property> 
        <name>hive.server2.thrift.port</name> 
        <value>10000</value> 
    </property>
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>node1</value>
    </property>
</configuration>

Special Note: The database host set here is node1, and the problem of database ** remote inaccessibility ** needs to be solved

5) Metadata initialization

(1) The database authorizes the account

mysql -uroot -pabc@@123
grant all privileges on hive.* to 'hive'@'%' identified by 'hive_pwd';

(2) Initialize metadata

schematool -dbType mysql -initSchema

3. Test environment

1) Start hdfs and yarn

#确保启动hdfs和yarn
start-dfs.sh
start-yarn.sh

2) CLI command window

#进入hive的CLI命令窗口,hive命令后不跟参数时,默认启动cli,即下面命令等价于hive
hive --service cli 

hive cli

3) beeline client

(1) Start the HS2 service

To use the beeline client, you need to start the hiveserver2 service (HS2)

nohup hive --service hiveserver2 &

When starting some background daemons, it is often used in combination nohupwith &:nohup <程序名> &

  • Run the program with nohup

    • Use Ctrl+C to send a SIGINT signal and the program closes
    • Close the session and send the SIGHUP signal, the program is immune
  • Use & run in the background

    • Use Ctrl+C to send SIGINT signal, program immunity
    • Close the session and send the SIGHUP signal, the program closes

(2) Start beeline

#启动beeline
beeline

#在交互界面输入连接信息:
!connect  jdbc:hive2://node1:10000

#输入用户名和密码

#查看所有数据库
show databases;

beeline

You can also specify the corresponding connection parameters when beeline starts.

beeline -u jdbc:hive2://node1:10000 -n xiaobai -p 123456
  • -u: JDBC URL to connect to
  • -n: username to connect to
  • -p: password for connection

So far, you're done! ! !

Guess you like

Origin blog.csdn.net/tangyi2008/article/details/123368215