installation of hive

1. Hive
1.1 is a data warehouse in the hadoop ecosystem. He can manage the data in hadoop, and can query the data in hadoop.
  Essentially, hive is an SQL parsing engine. Hive can convert SQL queries into jobs in MapReduce to run.
     Hive has a set of mapping tools that can convert SQL into jobs in MapReduce, and convert tables and fields in SQL into files (folders) and columns in files in HDFS.
     This set of mapping tools is called metastore, which is generally stored in derby and mysql.
1.2 The default location of hive in hdfs is /user/hive/warehouse, which is determined by the property hive.metastore.warehouse.dir in the configuration file hive-site.xml (for short, it can be modified to /hive).

1.3 system architecture of hive

There are three main user interfaces: CLI, JDBC/ODBC and WebUI

  CLI, the Shell Command Line

  JDBC/ODBC is Hive's Java, similar to how you use traditional database JDBC

  WebGUI is to access Hive through browser

Hive stores metadata in the database (metastore), currently only mysql and derby are supported. The metadata in Hive includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, etc.

Interpreter, compiler, optimizer complete HQL query statement from lexical analysis, syntax analysis, compilation, optimization and query plan (plan) generation. The resulting query plan is stored in HDFS and executed in subsequent MapReduce calls

Hive data is stored in HDFS, and most of the queries are completed by MapReduce (queries containing *, such as select * from table, will not generate MapReduce tasks)

 

2. Hive installation (on hadoop0)
(1) Unzip, rename, set environment variables vi /etc/profile
  export HIVE_HOME=/usr/local/hive
  export PATH=....:$HIVE_HOME/bin.. .
  Set up, execute source /etc/profile
(2) In the directory $HIVE_HOME/conf/, execute the command mv hive-default.xml.template hive-site.xml Rename it
  in the directory $HIVE_HOME/conf/, execute the command mv hive-env.sh.template hive-env.sh Rename
(3) Modify the hadoop configuration file hadoop-env.sh, the modification content is as follows:
  open and modify export HADOOP_CLASSPATH=.:CLASSPATH:CLASSPATH:HADOOP_CLASSPATH:$HADOOP_HOME/bin
(4) Under the directory $HIVE_HOME/bin, modify the file hive-config.sh and add the following content:
  export JAVA_HOME=/usr/local/jdk
  export HIVE_HOME=/usr/local/hive
  export HADOOP_HOME=/usr/local/hadoop

Execute the hive command on hadoop0 to enter the hive command line mode. In this mode, some database operations are the same as the operations under the MySQL command line (cmd-->mysql -uroot -proot)!
For example: show databases;
  use default;
  show tables;
  create table t1(id string);
  show tables;
  select * from t1;

 

By entering hadoop0:50070 in the browser, you can view the related information of hive in HDFS.

 

3. Install mysql
(1) Delete the mysql-related library information that has been installed on linux. rpm -e mysql-libs-xxx --nodeps
  execute the command rpm -qa |grep mysql to check whether the deletion is clean
(2) execute the command rpm -i mysql-server-******** to install the mysql server
(3) Start the mysql server, execute the command mysqld_safe &
(4) execute the command rpm -i mysql-client-******** to install the mysql client
(5) execute the command mysql_secure_installation to set the root user password ( Y admin nnn Y ) to log in :mysql -uroot -padmin

 

4. Use mysql as the metastore of hive

The metastore is a centralized storage place for hive metadata. The metastore uses the embedded derby database as the storage engine by default. Disadvantages of the Derby engine: only one session can be opened at a time. Mysql is used as the external storage engine, and multiple users can access it at the same time.


(1)把mysql的jdbc驱动放置到hive的lib目录下cp mysql-connector-java-5.1.10.jar /usr/local/hive/lib/
(2)修改hive-site.xml文件,修改内容如下: 
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hadoop0:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>admin</value>
</property>

To use a tool to remotely connect to a MySQL database, you must first authorize it under the MySQL command line: grant all on hive.* to 'root'@'%' identified by 'admin';

                           Refresh: flush privileges  ;

 

5. Internal table
CREATE TABLE t1(id int); 
LOAD DATA LOCAL INPATH '/root/id' INTO TABLE t1; -----Create a new id file first, cd /root/ 
vi id to enter some data

  1 
  2 
  3 
  4 
  5   
Note: If LOCAL is not added above, the corresponding file will be loaded from HDFS!
select * from t1;----You can view the inserted data, you can also view hadoop0:50070 in the browser

 

Note: To load data, you can use LOAD DATA.. or you can directly use hadoop fs -put /root/id /hive/t1/id2, so the effect of loading data is the same as LOAD DATA!


CREATE TABLE t2(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; ----- columns are distinguished by tabs

cp id stu
vi stu input data

  1 zhangsan
  2 lisi
  3 wangwu
  4 zhaoliu
  5 qian
Upload data to the table: hadoop fs -put stu /hive/t2

 

6. Partition table
CREATE TABLE t3(id int) PARTITIONED BY (day int); 
LOAD DATA LOCAL INPATH '/root/id' INTO TABLE t3 PARTITION (day=22); 
LOAD DATA LOCAL INPATH '/root/id' INTO TABLE t3 PARTITION (day=23);
LOAD DATA LOCAL INPATH '/root/id' INTO TABLE t3 PARTITION (day=24);
View in the browser, in the /hive/t3/ directory, there will be three partitions day=22 ,day=23,day=24


Query by partition name: select * from t3 where day=22;

 

7. Bucket table
create table t4(id int) clustered by(id) into 4 buckets; 
set hive.enforce.bucketing = true; ----Buckets are not used by default, now use buckets
insert into table t4 select id from t3;
can be observed in the browser, there are four buckets under the /hive/t4 table, all of which store data. (When the data is loaded into the bucket table, the field value will be hashed, and then modulo the number of buckets. Put the data in the corresponding file)

 

8. External table (the external table is different from the above table, the above tables are all controlled tables MANAGED_TABLE)
create external table t5(id int) location '/external'; 
hadoop fs -put /root/id /external/id
select * from t5;

 

9. Use the Java client to view the data of the table in Hive

Hive remote service start #hive --service hiveserver >/dev/null 2>/dev/null &

JAVA client related code (import the Jar package in hive first)

copy code

1 package hive;
 2
 3 import java.sql.Connection;
 4 import java.sql.DriverManager;
 5 import java.sql.ResultSet;
 6 import java.sql.Statement;
 7
 8 public class App {
 9
10     public static void main(String[] args) throws Exception {
11         Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
12         Connection con = DriverManager.getConnection("jdbc:hive://hadoop0:10000/default", "", "");
13         Statement stmt = con.createStatement();
14         String querySQL="SELECT * FROM default.t1";
15
16         ResultSet res = stmt.executeQuery(querySQL);  
17
18         while (res.next()) {
19         System.out.println(res.getInt(1));
20         }
21
22     }
23
24 }

copy code

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325354665&siteId=291194637