Introduction to HIVE

Introduction to HIVE

    Start with this article to introduce some technical points of Hive

1. Overview

1. The problem of MapReduce

    1. It can only be developed in Java. This is a threshold for those who do not know Java or even programming, such as data warehouse development engineers.

    2. It is necessary to understand the underlying layer and API of Hadoop in order to develop complex and efficient code, such as the process of shuffle.

    3. Development and debugging is troublesome.

    Can you find a way to solve the above problems? It is best to develop a program for big data processing in a general way, which can shield the underlying details and simplify development and testing.

2. The essence of HIVE

1. database

    Provide real-time data for online systems, provide complete ability to add, delete, modify and query, with complete transaction support, try to avoid data redundancy, save storage space and improve processing efficiency.

2. database

    Storage and processing of offline historical data. Provides support for offline data analysis, can only write multiple queries at a time, does not support row-level additions, deletions, and changes, does not emphasize transaction characteristics, and artificially creates redundancy to improve data query efficiency.

3.HIVE

    On the basis of Hadoop, a layer of SQL operation interface is set up, so that we can operate HIVE through SQL-like HQL. HIVE translates these HQL statements into mapreduce to process massive amounts of data.

    HIVE operates massive data based on HQL, can it be considered as a Hadoop-based database? Can not!

    HIVE is not a Hadoop-based database tool, but a Hadoop-based data warehouse tool.

    HIVE maps structured data files to a database table and provides complete SQL query capabilities. However, it can only write multiple queries at a time, and does not support row-level additions, deletions and modifications (appends can be added after Hadoop 2.0), which is limited by the underlying HDFS. In essence, it just adds a layer of SQL shell on the basis of Hadoop, and it is still an offline data analysis tool. Transaction features are not supported. The queryability of the data is often improved by creating redundancy.

    HIVE is a data warehouse tool based on Hadoop.

3. Advantages and disadvantages

1. advantage

    1. The learning cost is low, and simple MapReduce statistics can be quickly implemented through SQL-like statements, without the need to develop special MapReduce applications, which is very suitable for statistical analysis of data warehouses.

    2. Provides a series of tools that can be used for data extraction, transformation and loading (ETL), which is a mechanism for storing, querying and analyzing large-scale data stored in Hadoop.

    3. Hive defines a simple SQL-like query language called HiveQL, which allows users familiar with SQL to query data. At the same time, the language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot.

2. shortcoming

    1. Hive does not support online transaction processing.

    2. Row-level inserts and updates and deletes are not supported.

    3. Relatively speaking, the speed is relatively slow.

Second, the installation configuration of HIVE

1. Preconditions

1.JDK

    Install the JDK and configure the JAVA_HOME environment variable.

2.Hadoop

    Need hadoop support, install hadoop and configure the HADOOP_HOME environment variable.

    For the installation of Hadoop, please refer to: Hadoop pseudo-distributed mode construction , Hadoop fully distributed cluster construction .

2. Download

    Download the new version of hive from the apache official website, and pay attention to the match with the hadoop version.

    The versions I use here are: Hadoop2.7.1, Hive1.2.0.

3. Installation

    Upload the downloaded hive installation package to linux.

    Unzip:

tar -zxvf apache-hive-1.2.0-bin.tar.gz

 

4. Start

    Enter the hive/bin directory and run the hive command directly to enter the hive prompt.

./hive

    hive can run without any configuration, because it can know hadoop configuration information through the HADOOP_HOME environment variable.

5. Installation conflict

1. Problem Description

    In the hadoop2.5.x environment, start hive and find an error:

    java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

2. problem analysis

    The reason for this error is because the jline.Terminal class has an error.

    After inspection, it is found that jline-0.9.x.jar exists in the hadoop/share/hadoop/yarn/lib directory. And jline-2.12.jar exists in the hive/lib/ directory. Duplicate package incompatibilities are causing this problem.

3. Solution

    1. Copy hive/lib/jline-2.12.jar to replace jline-0.9.x.jar in hadoop/share/hadoop/yarn/lib, and restart hadoop and hive.

    2. Directly upgrade hadoop to a higher version, for example, this problem has been solved in 2.7.x.

6. Introductory case

    View database:

show databases;

    After execution, it is found that there is a default library default.

show tables;

    It is found that there is no table, and it is proved that when other libraries are not used, the default library is the default.

create database school;

    Found that there is an extra /user/hive/warehouse/tedu.db directory in hdfs

    Conclusion 1: The database in hive corresponds to the directory ending with .db in the /user/hive/warehouse directory in hdfs.

use school;
create table student (id int,name string);
show tables;
desc student;
show create table student;

    It was found that the table was created correctly.

    It is found that there is an additional /user/hive/warehouse/tedu.db/sutdent directory in hdfs.

    Conclusion 2: A table in hive corresponds to a directory in hdfs/user/hive/warehouse/[db directory].

load data local inpath '../mydata/student.txt' into table student;

    It is found that there are more files under /user/hive/warehouse/tedu.db/sutdent.

select * from student;

    It was found that the detected data was incorrect because the separator was not specified when the table was created. The default delimiter is space.

create table student2 (id int,name string) row format delimited fields terminated by '\t';
load data local inpath '../mydata/student.txt' into table student2;
select * from student2;

    It was found that the data was correctly queried.

    Conclusion 3: The data in hive corresponds to the data in the file in the hdfs directory corresponding to the current hive table.

select count(*) from student;

    It was found that the mapreduce job was executed, and the result was finally realized

    Conclusion 4: hive will convert the command to mapreduce execution.

use default;
create table teacher(id int,name string);

    It is found that there is a tedu.db folder in the directory corresponding to hive, which contains the user folder.

    Conclusion 5: The default default database of hive directly corresponds to the /user/hive/warehouse directory, and the table created in the default database will directly create the corresponding directory in this directory.

3. mysql metastore

    MySQL metastore: The metastore.

    In addition to saving real data, HIVE also saves additional data used to describe libraries, tables, and columns, called HIVE metadata.

    Where is this metadata stored?

    HIVE needs to store this metadata in another relational database. If you do not modify the configuration, HIVE uses the built-in derby database to store metadata by default. Derby is a java-based file database developed by apache.

    You can check the directory where the command was executed in the previous entry case, and you will find that a metastore.db file is generated in it, which is the database file generated by derby to save metadata. The derby database can only be used for testing, and there are many restrictions when it is actually used. The most obvious problem is the inability to support concurrency.

    After testing, it can be found that when using HIVE in the same directory, it is impossible to open multiple HIVEs at the same time. You can open HIVE in different directories at the same time, but each will generate a metastore.db file, which makes the data inaccessible. So in the real production environment, we will not use the default derby database to save HIVE metadata.

    HIVE currently supports both derby and mysql databases to store metadata.

1. Change the metabase

1. Install MySQL

    If the MySQL database has been installed before and can be used normally, then the operation of installing the database can be omitted. If there is a problem, please refer to the following MySQL related issues.

1> Upload the installation package

    Upload the rpm package to your own management directory (or another directory).

    There are many versions of MySQL, and the database with the cs architecture as follows is used here.

    MySQL-server-5.6.29-1.linux_glibc2.5.x86_64.rpm

    MySQL-client-5.6.29-1.linux_glibc2.5.x86_64.rpm

2> Check the installation

    What is checked here is that the database has not been installed, and the database that comes with the system.

    To check whether mysql has been installed before, the command is as follows:

rpm -qa | grep -i mysql

    If installed, execute the following code to delete the previously installed mysql:

rpm -ev --nodeps mysql-libs-5.1.71-1.el6.x86_64

3> Add users and user groups

    Add user group mysql:

groupadd mysql

    Add the user mysql and join the mysql user group:

useradd -r -g mysql mysql

4> Install MySQL

    Install the server:

rpm -ivh MySQL-server-5.6.29-1.linux_glibc2.5.x86_64.rpm

    Install client:

rpm -ivh MySQL-client-5.6.29-1.linux_glibc2.5.x86_64.rpm

    The directory where mysql5.6 is installed after installation:

    Directory Contents of Directory

    /usr/bin Client programs and scripts

    /usr/sbin The mysqld server

    /var/lib/mysql Log files, databases

    /usr/share/info MySQL manual in Info format

    /usr/share/man Unix manual pages

    /usr/include/mysql Include (header) files

    /usr/lib/mysql Libraries

    /usr/share/mysql Miscellaneous support files, including error messages, character set files, sample configuration files, SQL for database installation

    /usr/share/sql-bench Benchmarks

5> Modify the configuration file

    Modify my.cnf, the default is /usr/my.cnf.

vim /usr/my.cnf

    Add the following content to the location of the configuration file [mysql] and replace [mysql]. The following content mainly configures the character encoding format of MySQL:

[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
character_set_server=utf8

6> Add random startup

    Add mysqld to the system service and start it randomly. The command is as follows:

cp /usr/share/mysql/mysql.server /etc/init.d/mysqld

7> Start MySQL

    The command to start mysqld is as follows:

service mysqld start

8> Change password

    First get the random password of the root user during mysql installation:

vim /root/.mysql_secret

    You can also use the cat command to view:

cat /root/.mysql_secret

    This password can only be used to change the password.

    You must change the password of the root user before you can use mysql, otherwise you can only connect and cannot operate

mysqladmin -u root -p  password root

9> Test

    To connect to mysql, the command is as follows:

mysql -u root -p

    To view the installation and running path of mysql, the command is as follows:

ps -ef|grep mysql

10> MySQL related issues

    If there is a problem with no permissions, authorize it in mysql (execute on the machine where mysql is installed)

mysql -u root -p

Execute the following statement to authorize:

GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION;
FLUSH PRIVILEGES;

    *.*: All tables under all libraries.

    %: Any IP address or host can connect. If the % configuration does not take effect, configure a specific host name.

2. Configure HIVE

1> Delete information

    To configure hive to use mysql to save metadata information, you need to delete the metadata information installed by default first.

Directory structure in hdfs

    Delete /user/hive in hdfs:

hadoop fs -rmr /user/hive

    It is also possible to delete using the view mode in Eclipse.

derby's file

    Delete the folder of meta_store where metadata is stored in the derby database in the directory where HIVE has been used.

2> Configuration file

    Copy hive/conf/hive-default.xml.template to hive-site.xml.

cp hive-default.xml.template hive-site.xml

    The file name must be modified according to the above format, otherwise the configuration will not take effect.

    Edit the hive-site.xml file:

vim hive-site.xml

    Configure in <configuration>, delete all the original configuration information in the tag, and fill in the following configuration information, which is basically the configuration information of jdbc:

<!--要使用的数据-->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<!--jdbc驱动-->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!--数据库帐号-->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<!--数据库密码-->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>

3. Create a metabase

    Since HIVE only supports the character encoding format of latin1 in MySQL, you need to manually create a metabase.

    Note that this library must be latin1, otherwise strange problems will occur! So manual creation is recommended! And there can be no arbitrary hive operations before the library is created, otherwise the automatically created library table will use the default character set of mysql, and an error will still be reported!

    Another method is to modify the configuration file of mysql, so that the default encoding set of mysql is latin1, so that the meta database automatically created by hive is latin1, but this modification will affect the entire mysql database. If there are other libraries in mysql, this This way is not good.

    Enter the MySQL database and execute the following statement to create the database:

create database hive character set latin1;

4. Copy the jar package

    Copy the mysql connection jar package to the $HIVE_HOME/lib directory.

5. start test

    Enter the hive command line again, and try to create a library table and find that there is no problem.

    The test found that there is no problem with opening multiple connections.

    Connect to mysql and find an additional hive library. It saves the metadata of hive.

    HIVE has built-in 29 tables to record the metadata of HIVE itself.

    Important information is stored as follows:

    DBS: The metadata information of the database.

    TBLS: Table information.

    COLUMNS_V2: Field information in the table.

    SDS: The table corresponds to the hdfs directory.

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325194096&siteId=291194637