Hive (data warehouse tools)

hive is a Hadoop-based data warehousing tools, you can map the structure of the data file to a database table, and provides a simple sql query function, you can convert the sql statement to run MapReduce tasks. The advantage is the low cost of learning, you can quickly achieve a simple MapReduce statistics by type of SQL statements, without having to develop specialized MapReduce applications, data warehouse is very suitable for statistical analysis.

Hive is based on Hadoop data warehouse infrastructure. It provides a range of tools that can be used for data extraction transformation loading (ETL), which is a store, query, and mechanisms for large-scale data stored in Hadoop Analysis. Hive defines a simple SQL-like query language called HQL, which allows users familiar with SQL to query data. At the same time, the language also allows developers familiar with MapReduce developer of self-mapper and reducer defined to handle complex analysis built-in mapper and reducer can not be completed.

Installation Hive

1.上传tar包

2.解压
	tar -zxvf hive-1.2.1.tar.gz
3.安装mysql数据库
   推荐yum 在线安装(运行脚本安装)

4.配置hive
	(a)配置HIVE_HOME环境变量  
		vi conf/hive-env.sh 
		配置其中的$hadoop_home

	
	(b)配置元数据库信息  
		vi  hive-site.xml 
		添加如下内容:
		<configuration>
		<property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
		<description>JDBC connect string for a JDBC metastore</description>
		</property>

		<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
		<description>Driver class name for a JDBC metastore</description>
		</property>

		<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>root</value>
		<description>username to use against metastore database</description>
		</property>

		<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>root</value>
		<description>password to use against metastore database</description>
		</property>
		</configuration>
	
5.安装hive和mysq完成后,将mysql的连接jar包拷贝到$HIVE_HOME/lib目录下
	如果出现没有权限的问题,在mysql授权(在安装mysql的机器上执行)
	mysql -uroot -p
	
	设置密码
	set password=password('root');
	
	#(执行下面的语句  *.*:所有库下的所有表   %:任何IP地址或主机都可以连接)
	GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION;
	
	FLUSH PRIVILEGES;
	
	
	通过mysql -uroot -proot
	
	
6. Jline包版本不一致的问题,需要拷贝hive的lib目录中jline.2.12.jar的jar包替换掉hadoop中的 
	6.1、 cp hive/lib/jline-2.12.jar /opt/software/hadoop-2.6.4/share/hadoop/yarn/lib/
	6.2、装hive和mysq完成后,将mysql的连接jar包拷贝到$HIVE_HOME/lib目录下


启动:hive


Create a table

Hive几种使用方式:
	1.Hive交互shell      bin/hive
	
	2.Hive JDBC服务(参考java jdbc连接mysql)
	
	3.hive启动为一个服务器,来对外提供服务
		bin/hiveserver2
		nohup bin/hiveserver2 1>/var/log/hiveserver.log 2>/var/log/hiveserver.err &
		
		启动成功后,可以在别的节点上用beeline去连接
		bin/beeline -u jdbc:hive2://mini1:10000 -n root
		
		或者
		bin/beeline
		! connect jdbc:hive2://mini1:10000
	
	4.Hive命令 
		hive  -e  ‘sql’
		bin/hive -e 'select * from t_test'


创建表:

Hive 内部表
	CREATE  TABLE [IF NOT EXISTS] table_name
	删除表时,元数据与数据都会被删除
Hive 外部表
	CREATE EXTERNAL TABLE [IF NOT EXISTS] table_name LOCATION hdfs_path
	删除外部表只删除metastore的元数据,不删除hdfs中的表数据

Hive 查看表描述

DESCRIBE [EXTENDED|FORMATTED] table_name

Hive 建表

Create Table Like:
CREATE TABLE empty_key_value_store LIKE key_value_store;

Create Table As Select (CTAS)
CREATE TABLE new_key_value_store 
      AS
    SELECT columA, columB FROM key_value_store;

Hive 分区partition
	必须在表定义时指定对应的partition字段
	a、单分区建表语句:
	create table day_table (id int, content string) partitioned by (dt string);
	单分区表,按天分区,在表结构中存在id,content,dt三列。
	以dt为文件夹区分
b、 双分区建表语句:
	create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
	双分区表,按天和小时分区,在表结构中新增加了dt和hour两列。
	先以dt为文件夹,再以hour子文件夹区分

advantage

1 size, scalability, scale, Hive clusters can extend freely without restarting the service general scale: scale-extended by way of the cluster share the pressure scale: a core server cpu i7-6700k 4 8 thread, 8 core 16 threads, memory 64G => 128G

2, ductility, Hive support for custom functions, users can implement your own functions according to their needs

3, good fault tolerance, can guarantee even if there is a problem node, SQL statements can still be completed execution

Shortcoming

1, Hive does not support record level CRUD operations, but the user can create a new table or query by the query results into a file (hive-2.3.2 version of the currently selected record level support insert)

2, Hive query latency is very serious, because the startup process MapReduce Job consumed for a long time, it can not be used in interactive query system.

3, Hive does not support transactions (because it is not no additions and deletions, it is mainly used for OLAP (online analytical processing), instead of OLTP (online transaction processing), which is the two-level data processing).

Here Insert Picture Description

The internal structure seen from the figure the hive consists of four parts:

1, the user interface: the shell / the CLI, JDBC / ODBC, WebUI the Command Line Interface
  the CLI, Shell terminal command (Command Line Interface), interactively used in the form Hive command to interact with Hive, the most commonly used (learning, commissioning, production)

JDBC / ODBC, Hive is a client-based JDBC operations provided, users (developers, operation and maintenance personnel) through this connection to the Hive server service

Web UI, accessed through a browser Hive

2, cross-language service: thrift server provides a capability so that users can use a variety of different languages to manipulate Hive
  Thrift is a software framework developed by Facebook, scalable and can be used for cross-language services development, Hive the integrated service that allows different programming languages call Hive interface

3, the underlying Driver: Driver Driver, Compiler Compiler, Optimizer Optimizer, executor Executor
  Driver assembly complete HQL query from lexical analysis, parsing, compilation, optimization, and generation logic generates an execution plan. Generated logic execution plan is stored in HDFS, MapReduce and then executed by calling

Hive is the core engine driven, driven engine consists of four parts:

(1) interpreter: the role of the interpreter is to convert HiveSQL statement abstract syntax tree (AST)

(2) the compiler: the compiler is compiled syntax tree is a logical implementation plan

(3) optimizer: The optimizer is the logical implementation plan to optimize

(4) actuator: Actuator is to call the operating framework of the implementation of the underlying logic execution plan

4, the data storage system metadata: RDBMS MySQL
 metadata popular speaking, the description information is data stored in the Hive.

Hive metadata typically includes: table name, and a list of attributes and attribute partition, table (inner and outer tables), the data directory table

Metastore present by default comes with the Derby database. Drawback is not suitable for multi-user operation, and the data storage directory is not fixed. Hive go along with the database, extremely inconvenient for management

Solution: The MySQL database is usually kept create our own (local or remote)

Between Hive and MySQL service interactions by MetaStore

Guess you like

Origin blog.csdn.net/sincere_love/article/details/92844999