Hive common operation HiveQL

A, Hive Introduction

Hive is a Facebook developer to build a data warehouse applications on top of Hadoop cluster, you can map the structure of the data file to a database table and provides full SQL queries, you can convert SQL statements to run MapReduce tasks.

Hive is a can provide an effective, rational and intuitive model of organization and use of data, even for experienced Java developers engineers, these common data calculation corresponds to the underlying MapReduce Java API is awesome. Hive can help users do the work, the user can focus attention on the query itself. Hive can be most queries into MapReduce tasks. Hive is best suited for data warehousing applications, using the application related static data analysis, it does not need to respond quickly give results, but the data itself is not changed frequently.

Hive is not a complete database. Hadoop and HDFS design their own constraints and limitations can limit the Hive qualified. The biggest limitation is the Hive does not support record-level update, insert, or delete. Users can generate a new tables or query results by querying into file. Because, Hadoop is a batch-oriented system, and start the MapReduce task consumes a long time to start the process, so Hive delay is relatively long. Hive does not support transactions. Therefore, Hive does not support online transaction processing (OLTP), online analytical technique closer to a (OLAP) tool, however, has not yet meet the "Online" section.

Hive provides a range of tools that can be used for data extraction transformation loading (ETL), which, ETL is a way to store, query and mechanisms in Hadoop large-scale analysis of the data storage. Therefore, Hive is best suited for data warehousing applications, it can maintain huge amounts of data, but also data mining, and then form an opinion and reporting.

Because most data warehouse applications are based on relational SQL database reality, so, Hive reducing barriers to these applications ported to the Hadoop. If you know SQL, you can easily learn how to use Hive. Because Hive defines a simple SQL-like query language --HiveQL, It is worth mentioning that, compared with SQLServer, Oracle, HiveQL SQL language and MySQL provide closer. Also, compared to other languages ​​and tools for Hadoop, Hive also allows developers to SQL-based applications to Hadoop easier.

Two, Hive common operation HiveQL

(A) basic data types Hive

First, we briefly describe what HiveQL of basic data types.

Hive supports basic and complex types of data, the basic data types are value types (INT, FLOAT, DOUBLE), and Boolean string, there are three types of complex: ARRAY, MAP and STRUCT.
1. The basic data types
TINYINT: 1 byte
SMALLINT: 2 bytes
INT: 4 bytes
BIGINT: 8 bytes
BOOLEAN: TRUE / FALSE 
FLOAT:. 4-byte single-precision floating-
DOUBLE: 8 th byte, double-precision floating-point string sTRING


2. The complex data types
ARRAY: ordinal fields
MAP: random fields
STRUCT: a set of field named

(Ii) common operation command HiveQL

Hive HiveQL common operation commands include: data definition, data manipulation. Next, details about these commands as Use (To learn more, please refer to "Hive Programming Guide," a book).

1. Data Definition

Mainly used to create modify and delete databases, tables, views, functions and indexes.

(1) create, modify and delete database

$ Create database if not exists hive; # create database

$ Show databases; # View Hive contained in the database

$ Show databases like 'h. *' # H beginning to see the Hive database

$ Describe databases; # View hive location information database

$ Alter database hive set dbproperties; # set the key attribute for the hive

$ Use hive; # switch to the hive database

$ Drop database if exists hive; # delete database tables without

$ Drop database if exists hive cascade; # delete the database and its tables

Note: In addition to dbproperties attributes, metadata information database is unalterable, including the name and location of the database directory where the database resides, there is no way to remove or reset the database properties.

(2) create, modify and delete tables

$ Create table if not exists hive.usr (# create an internal table (management table)

name string comment 'username',

pwd string comment 'password',

address struct<street:string,city:string,state:string,zip:int>,

comment 'home address',

identify map<int,tinyint> comment 'number,sex')

comment 'description of the table'

tblproperties('creator'='me','time'='2016.1.1');

$ Create external table if not exists usr2 (# create an external table

name string,

pwd string,

address struct<street:string,city:string,state:string,zip:int>,

identify map<int,tinyint>)

row format delimited fields terminated by ','

location '/usr/local/hive/warehouse/hive.db/usr';

$ Create table if not exists usr3 (# create a partition table

name string,

pwd string,

address struct<street:string,city:string,state:string,zip:int>,

identify map<int,tinyint>)

partitioned by(city string,state string);

$ Create table if not exists hive.usr1 like hive.usr; # copy table schema usr table

$ Show tables in hive; # View hive all tables

$ Show tables 'u *.'; # See the table at the beginning of the hive to u

$ Describe hive.usr; # View usr information table

$ Alter table usr rename to custom; # rename table

$ Alter table usr2 add if not exists # to add a partition table

partition(city=”beijing”,state=”China”)

location '/usr/local/hive/warehouse/usr2/China/beijing';

$ Alter table usr2 partition (city = "beijing", state = "China") # Modify partition path

set location '/usr/local/hive/warehouse/usr2/CH/beijing';

$ alter table usr2 drop if exists partition(city=”beijing”,state=”China”) #删除分区

$ Alter table usr change column pwd password string after address; # modify column information

$ Alter table usr add columns (hobby string); # Increase Column

$ Alter table usr replace columns (uname string); # delete override column

$ Alter table usr set tblproperties ( 'creator' = 'liming'); # modify table properties

$ Alter table usr2 partition (city = "beijing", state = "China") # modify the storage properties

set fileformat sequencefile;

$ Use hive; # switch to the hive database

$ Drop table if exists usr1; # delete table

$ Drop database if exists hive cascade; # delete the database and its tables

(3) view and index creation, modification and deletion

Grammatical below, the user can achieve.

$ Create view view_name as ....; # Create a view

$ Alter view view_name set tblproperties (...); # modify the view

Because the view is read-only, so the view changing only attribute tblproperties metadata.

$ Drop view if exists view_name; # Delete View

$ create index index_name on table table_name(partition_name/column_name) #创建索引

as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild....;

Here 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' is an index processor, i.e. a Java class that implements the interface index, the index of other additional Hive achieved.

$ Alter index index_name on table table_name partition (...) rebulid; # rebuild the index if a deferred rebuild, so the new index to a blank state, any time you can be the first time the index is created or rebuilt.

$ Show formatted index on table_name; # display Index

$ drop index if exists index_name on table table_name; #删除索引

(4) User-defined functions

In the New User-defined functions before (UDF) method, first look at those functions Hive comes. "Show functions;" command displays all the function names in the Hive:

If you want to see a specific function uses the method can be used describe function function name;

First written in front of their class UDF UDF need to inherit and implement evaluate () function, or class inheritance GenericUDF implement initialize () function, evaluate () function and getDisplayString () function, there are other methods to achieve, interested users can self-learning.

In addition, if the user wants to use the UDF in the Hive need to write Java code that we compile, and then the UDF binary class files (.class files) compiled packaged into a JAR file, and the JAR file in the Hive session added to the class path, the good use of this function Java classes defined by the create function statement.

$ Add jar <absolute path to the jar file>; # create function

$ create temporary function function_name;

$ Drop temporary function if exists function_name; # delete function

2. Data Operation

The main achievement is to load data into the table (or is derived from the table), and the corresponding query, users familiar with SQL language should not be unfamiliar.

(1) loading data into the table

Here we have a simple table with only two attributes of example to introduce. First, create the table and stu course, stu has two properties id and name, course has two attributes cid and sid.

$ create table if not exists hive.stu(id int,name string)

row format delimited fields terminated by '\t';

$ create table if not exists hive.course(cid int,sid int)

row format delimited fields terminated by '\t';

There are two ways to load data into the table: imported from files and inserted through query.

a. Import from file

If the records are stored in this table in stu.txt file, the file is stored in the path usr / local / hadoop / examples / stu.txt, as follows,

stu.txt:

1 xiapi

2 xiaoxue

3 qingqing

Let the data in the file is loaded into the table in stu, as follows:

$ load data local inpath '/usr/local/hadoop/examples/stu.txt' overwrite into table stu;

If stu.txt files are stored on HDFS, you do not need local keywords.

b. Insert through query

Use the following command to create stu1 table, and it attributes the same stu table, we want to insert query received from stu table data to stu1 in:

$ create table stu1 as select id,name from stu;

The above is to create a table and insert data directly to the new table; if the table already exists, perform the following commands to insert the data:

$ insert overwrite table stu1 select id,name from stu where(条件);

Here overwrite key role is to replace the table (or partition) in the original data, and replaced into the keyword directly added to the original contents.

(2) derived from the data table

a. may simply copy files or folders

Command is as follows:

$ hadoop fs -cp source_path target_path;

b. write temporary files

Command is as follows:

$ insert overwrite local directory '/usr/local/hadoop/tmp/stu' select id,name from stu;

(3) query operation

And SQL query exactly the same, not repeat them here. The main use of select ... from ... where ... and other statements, combined keyword group by, having, like, rlike and other operations. Here we briefly explain SQL is not the case ... when ... then ... sentence, join operations and sub-queries.

Similar case ... when ... then ... and if sentence conditional statement, query result for processing a single column, the following statement:

$ select id,name,

case

when id=1 then 'first'

when id=2 then 'second'

else 'third'

end from stu;

The results are as follows:

Connect (join) are combined together those rows of two tables match each other on a common data items, divided into HiveQL connection connector, connector outer left and right connection outwardly full outer connector half and five kinds of connection.

a. the connector (equivalent connections)
are connected using comparison matches two rows of the table according to each column of the table in total value.

First of all, we first inserted into the course the following table (complete on their own).

1 3

2 1

31
below, all of the query row and the same course stu Secondary number table, the following command:

$ select stu.*, course.* from stu join course on(stu .id=course .sid);

Execution results are as follows:

B. left connecting
result set comprises a left connecting "LEFT OUTER" all rows specified in clause left table, rather than a column line connected to the matched. If no matching table row of a left and right rows of the table, then all the selected columns are null the result set associated right table, the following command:

$ select stu.*, course.* from stu left outer join course on(stu .id=course .sid);

Execution results are as follows:

C. right connecting
right connecting left outer reverse connection is connected, returns all rows in the right table. If there are no matching rows in a row left the table and right table will return a null value for the left table. Command is as follows:

$ select stu.*, course.* from stu right outer join course on(stu .id=course .sid);

Execution results are as follows:

D. Full connector
fully connected to return the left table and all rows in the right table. When a line does not match rows in another table, then select another list contains a null value. If there is a match between the tables, the entire result set containing the data values of the base table. Command is as follows:

$ select stu.*, course.* from stu full outer join course on(stu .id=course .sid);

Execution results are as follows:

. e semijoin
semijoin Hive is unique, does not support Hive in operation, but having an alternative embodiment; left semi join, known as semi-connected, to be noted that the table can not be connected column of the query, can only occur in the on clause. Command is as follows:

$ select stu.* from stu left semi join course on(stu .id=course .sid);

Execution results are as follows:


(4) sub-queries

Sub-standard SQL query support nested select clause subquery support HiveQL is limited, can only appear in subqueries guidance from clause.

Note: When you define or operating table, do not forget to specify the desired database.

Three, Hive simple programming practice

Here we have the word frequency statistics algorithm as an example to introduce how to use Hive in specific applications. Frequency statistics algorithm is one of the best embodies the MapReduce algorithm thought, here we can compare it implemented in MapReduce is to illustrate the advantages of using the Hive.

MapReduce implementations of the code word frequency statistics can be downloaded via the Hadoop source, / mapreduce / hadoop-mapreduce-examples-2.7.1.jar package found (wordcount class) at $ HADOOP_HOME / share / hadoop, wordcount class consists of 63 lines of Java code written made. The following first briefly explain how to use MapReduce wordcount class to count the number of times the word appears, follow these steps:

(1) create input directory, output directory will be automatically generated. Wherein the input is the input directory, output directory output directory. Command is as follows:

$ cd /usr/local/hadoop

$ mkdir input

(2) Then, create two test files file1.txt and file2.txt in input folder, the command is as follows:

$ cd /usr/local/hadoop/input

$ echo "hello world" > file1.txt

$ echo "hello hadoop" > file2.txt

(3)执行如下hadoop命令:

$ cd ..

$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount input output

(4)我们可以到output文件夹中查看结果,结果如下:

下面我们通过HiveQL实现词频统计功能,此时只要编写下面7行代码,而且不需要进行编译生成jar来执行。HiveQL实现命令如下:

$ create table docs(line string);

$ load data inpath 'input' overwrite into table docs;

$ create table word_count as

select word, count(1) as count from

(select explode(split(line,' '))as word from docs) w

group by word

order by word;

执行后,用select语句查看,结果如下:

由上可知,采用Hive实现最大的优势是,对于非程序员,不用学习编写Java MapReduce代码了,只需要用户学习使用HiveQL就可以了,而这对于有SQL基础的用户而言是非常容易的。

Guess you like

Origin blog.csdn.net/weixin_44961794/article/details/91355350