Hadoop Big Data Development Foundation series: Eight understanding Hive

First, what Hive that?

Hive is the essence: the HQL / SQL into MapReduce running on Hadoop, can be seen as a SQL parsing engine

Hadoop Hive is based on a data warehouse tool, you can map the structure of the data file to a table, and provides SQL-like query.

Hive table is HDFS file directory, a table corresponds to a directory name, if there is a partition, then the partition values corresponding subdirectory.

Hive Tutorial: hive wiki

Two, Hive architecture:

1. User Interface:

(1) CLI: Hive will also start a copy when you start

(2) JDBC client: encapsulates Thrift, java application, the server can connect to the hive another process running in the host and port specified by

(3) ODBC Client :: ODBC driver allows applications that support ODBC protocol connection to the Hive.

2.Thrift server: socket-based communication, cross-language support

3. parser (parse the statement to be executed):

(1) compiler: parsing the statement, parsing, compilation, develop work plans inquiry

(2) Optimizer: Evolution assembly rules: the construction of the column, under pressure predicate

(3) Actuator: will perform all Job order.

4. yuan database:

Hive data consists of two parts: the data files and metadata. Metadata is used to store basic information Hive repository, it is stored in a relational database, such as mysql, derby. Metadata includes: a column attribute information database, table names, and table partitions and their attributes, tables, table of contents and other data is located.

Three, Hive operating mechanism

1. A user interface to a user connected Hive, published Hive SQL;

2.Hive parse the query and formulate a query plan ;

3.Hive turn the query into MapReduce jobs ;

4.Hive implementation of MapReduce in Hadoop jobs.

Four, Hive advantages and disadvantages

Location data warehouse, data analysis and calculation of the direction of deflection

1. Advantages:

(1) suitable for handling large quantities of data

(2) take advantage of the cluster CPU computing resources, storage resources, parallel computing

(3) Class SQL, generated automatically MapReduce

(4). Scalability

2. Disadvantages:

(1) a finite HQL expression .Hive

Low (2) .Hive efficiency: Hive MR job generated automatically, usually not intelligent; HQL tuning difficult, coarse grain size; poor controllability.

For inefficiencies: SparkSQL the emergence of the Sql effectively improve the operational efficiency of the analysis on Hadoop.

Five, Hive applicable scene

1. The mass data storage and data analysis

2. Data Mining

3. Not suitable for complex algorithms and calculations, not suitable for real-time query

Sixth, the use of Hive

(A) connection Hive

Use HiveServer2, Beeline, Cli connection

(Ii) .Hive data type

classification	Types of	description	Examples
Primitive types	BOOLEAN	true/false	TRUE
	TINYINT	1-byte signed integer -128 to 127	1Y
	SMALLINT	2-byte signed integer -32768 to 32767	1S
	INT	With 4-byte signed integer	1
	BIGINT	8-byte signed integer	1L
	FLOAT	4-byte single-precision floating-point number 1.0
	DOUBLE	8-byte double-precision floating-point	1.0
	DEICIMAL	Arbitrary-precision signed decimal	1.0
	STRING	String, variable length	“a”,’b’
	VARCHAR	Variable-length strings	“a”,’b’
	CHAR	Fixed-length string	“a”,’b’
	BINARY	Byte array	Can not be represented
	TIMESTAMP	Time stamp, nanosecond precision	122327493795
	DATE	date	‘2016-03-29’
Complex type	ARRAY	An ordered set of the same type	array(1,2)
	MAP	key-value, key must be a primitive type, value can be any type	map(‘a’,1,’b’,2)
	STRUCT	Field set, may be different types	struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
	UNION	A value within a limited range of	create_union(1,’a’,63)

(C) .Hive operating table and

Data stored metadata composition +

Hive also has a database, the database can be created by CREATE DATABASE. The default library is the default library.

1 comprises two kinds of hosting and outer tables

(1) The data storage

Managed Table: directory data storage in the warehouse.

External table: Any directory data storage HDFS.

(2) The data deleted

Managed Table: Remove metadata and data.

External table: delete only the metadata .

(3) Create table

Hosting Table: CREATE TABLE table_name (attr1 STRING);

外部表：CREATE EXTERNAL TABLE table_name(attr1 STRING) LOCATION ‘path’;

A data partitioning approach can speed up queries. Table -> Partition -> barrel.

2. Partition (folder-level classification carried out)

(1) Partition Column Data are not actually stored, but the partition table nested directory under the directory.

Example:

data:

(2) The table may be partitioned (according to the above example is to partition the ID fan) in various dimensions.

(3) partitions can narrow your search query to improve efficiency.

(4) when the partition is to create a table with the PARTITION BY clause defined.

(5) loading data into a partition specified partition value used statements LOAD, and to be displayed.

(6) Use SHOW PARTITIONS statement to see what the next Hive partition table.

(7) using the SELECT statement to view data in a specified partition, Hive only scan the specified partition data.

(8) Hive table is divided into two partitions, static partitions and dynamic partitions. Static and dynamic partitions partitioning the time difference is to import data, partition name is entered manually, or data to determine the data partition (typically through naming conventions in accordance with the hive by a partition to help us hive generated automatically partitions ). For large bulk import data, it is clear that the use of dynamic partitioning is more simple and convenient

3. barrel (split classified within the file)

(1) barrel is attached on the table additional structure, can improve query performance; beneficial to make map-side-join operation.

(2) a more convenient and efficient to use sampling

(3) using CLUSTER BY columns, and the number of buckets to be divided is divided clause specifies the tub used.

create table bucketed_user(id int,name string) clustered by (id) into 4 buckets;

(4) the data bucket you can do to sort. Use SORTED BY clause.

create table bucketed_user(id int, name string) clustered by(id) sorted by (id asc) into 4 buckets;

(5) does not recommend our own points barrels, it is recommended to let Hive divided barrel.

Is first divided barrel, go down to the bucket filled with data.

(6) minutes before the bucket is filled the data necessary to provide hive.enforce.bucketing set to true . (Create a barrel is inserted into the tub when queried data from other tables, dynamic process)

insert overwrite table bucket_users select * from users;

(7) practically corresponds to the tub MapReduce output partition file: the same number of barrels and reduce tasks generated by a job.

Hive is also using the column values were hash , and then dividing by the number of barrels of ways to take more than this record of decision which should be stored in the bucket.

This is the principle HashPartitioner of MR is the same.

For sub-barrel Detailed refer to the following elements:

https://www.jianshu.com/p/4c39e944887b

4. The storage format

Hive manage storage table from two dimensions: "Line Format" (row format) and "File Format" (file format).

(1) Line format:

Storing data in a row format. In accordance with the terms of the hive, the definition defined by SerDe line format , i.e., serialization and deserialization. I.e. the query data, SerDe file in bytes of data in the form of rows deserialized as an object in the form of internal operation Hive data rows used. Hive when inserting data into a table, the sequence of the tool will Hive internal data line representation into a sequence of bytes written to the output file in the form and go.

(2) File Formats

The simplest file format is a plain text file, but you can also use the column-oriented and line-oriented binary file format. Binary files can be sequential file, Avro, RCFile, ORC, parquet file.

The default storage format is : delimited text, delimited text processing using LazySimpleSerDe default.

5. Data Import

Import data (1) Insert mode:

Multi-table insert, insert dynamic partitioning.

(2) Load mode introduced

(3)CATS方式：（CREATE TABLE … AS SELECT）

The basic idea is to check out the data, create a table

6. modify, and delete tables

Hive using the "read mode", so after creating the table, it is very flexible support modifications to the table definition . But generally we need to be vigilant, in many cases, be modified by the user to ensure that the data is in line with the new structure.

(1) Rename table

ALERT TABLE tablename RENAME TO new_tablename ;

(2) modify the column definition (only cited the example of adding columns, you can go to see more examples of official documents)

ALERT TABLE tablename ADD COLUMNS(colname STRING);

(3) Delete table

DROP TABLE; (for hosting remove metadata table is table data +; external remove metadata table only)

(4) cutting off the table (the table structure stored, the data table empty)

delete (delete) and truncate (cut) will reclaim the space occupied by the data, and the associated index. Only the table owner can truncate table.

TRUNCATE TABLE tablename [PARTITION partition_spec ];

7. Development: the solution Avro, RCFile, ORC, parquet storage structure, comparing similarities and differences.

The following blog describes the file format in the Hive, summarizes these characteristics file formats

https://blog.csdn.net/tyhawk/article/details/88801312#4_Hive_103

The following blog on Hive common storage formats were compared (mainly for storage space and query efficiency was tested)

https://blog.csdn.net/henrrywan/article/details/90719015

Qi in vain

Published 18 original articles · won praise 0 · Views 444

Private letter concerns