Hadoop Big Data Development Foundation series: Seven, Hive foundation

Hive foundation

 

First, what Hive that?

Hive is the essence: the HQL / SQL into MapReduce running on Hadoop, can be seen as a SQL parsing engine

Hadoop Hive is based on a data warehouse tool, you can map the structure of the data file to a table, and provides SQL-like query.

Hive table is HDFS file directory, a table corresponds to a directory name, if there is a partition, then the partition values corresponding subdirectory.

 

 

Hive Tutorial: hive wiki 

 

Two, Hive architecture:

1. User Interface:

(1) CLI: Hive will also start a copy when you start

(2) JDBC client: encapsulates Thrift, java application, the server can connect to the hive another process running in the host and port specified by

(3) ODBC Client :: ODBC driver allows applications that support ODBC protocol connection to the Hive.

2.Thrift server: socket-based communication, cross-language support

3. parser (parse the statement to be executed):

(1) compiler: parsing the statement, parsing, compilation, develop work plans inquiry

(2) Optimizer: Evolution assembly rules: the construction of the column, under pressure predicate

(3) Actuator: will perform all Job order.

4. yuan database:

Hive data consists of two parts: the data files and metadata. Metadata is used to store basic information Hive repository, it is stored in a relational database, such as mysql, derby. Metadata includes: a column attribute information database, table names, and table partitions and their attributes, tables, table of contents and other data is located.

 

Three, Hive operating mechanism

1. A user interface to a user connected Hive, published Hive SQL;

2.Hive parse the query and formulate a query plan ;

3.Hive turn the query into MapReduce jobs ;

4.Hive implementation of MapReduce in Hadoop jobs.

 

Four, Hive advantages and disadvantages

Location data warehouse, data analysis and calculation of the direction of deflection

1. Advantages:

(1) suitable for handling large quantities of data

(2) take advantage of the cluster CPU computing resources, storage resources, parallel computing

(3) Class SQL, generated automatically MapReduce

(4). Scalability

2. Disadvantages:

(1) a finite HQL expression .Hive

Low (2) .Hive efficiency: Hive MR job generated automatically, usually not intelligent; HQL tuning difficult, coarse grain size; poor controllability.

 

For inefficiencies: SparkSQL the emergence of the Sql effectively improve the operational efficiency of the analysis on Hadoop.

 

Five, Hive applicable scene

1. The mass data storage and data analysis

2. Data Mining

3. Not suitable for complex algorithms and calculations, not suitable for real-time query

 

Sixth, the use of Hive

(A) connection Hive

Use HiveServer2, Beeline, Cli connection

(Ii) .Hive data type

classification

Types of

description

Examples

Primitive types

BOOLEAN

true/false

TRUE

 

TINYINT

1-byte signed integer -128 to 127

1Y

 

SMALLINT

2-byte signed integer -32768 to 32767

1S

 

INT

With 4-byte signed integer

1

 

BIGINT

8-byte signed integer

1L

 

FLOAT

4-byte single-precision floating-point number 1.0

 

 

DOUBLE

8-byte double-precision floating-point

1.0

 

DEICIMAL

Arbitrary-precision signed decimal

1.0

 

STRING

String, variable length

“a”,’b’

 

VARCHAR

Variable-length strings

“a”,’b’

 

CHAR

Fixed-length string

“a”,’b’

 

BINARY

Byte array

Can not be represented

 

TIMESTAMP

Time stamp, nanosecond precision

122327493795

 

DATE

date

‘2016-03-29’

Complex type

ARRAY

An ordered set of the same type

array(1,2)

 

MAP

key-value, key must be a primitive type, value can be any type

map(‘a’,1,’b’,2)

 

STRUCT

Field set, may be different types

struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)

 

UNION

A value within a limited range of

create_union(1,’a’,63)

(C) .Hive operating table and

Data stored metadata composition +

Hive also has a database, the database can be created by CREATE DATABASE. The default library is the default library.

1 comprises two kinds of hosting and outer tables

(1) The data storage

Managed Table: directory data storage in the warehouse.

External table: Any directory data storage HDFS.

(2) The data deleted

Managed Table: Remove metadata and data. 

External table: delete only the metadata .

(3) Create table

Hosting Table: CREATE TABLE table_name (attr1 STRING);

外部表:CREATE  EXTERNAL TABLE table_name(attr1 STRING) LOCATION ‘path’;

 

A data partitioning approach can speed up queries. Table -> Partition -> barrel.

2. Partition (folder-level classification carried out)

(1) Partition Column Data are not actually stored, but the partition table nested directory under the directory.

Example:

data:

(2) The table may be partitioned (according to the above example is to partition the ID fan) in various dimensions.

(3) partitions can narrow your search query to improve efficiency.

(4) when the partition is to create a table with the PARTITION BY clause defined.

(5) loading data into a partition specified partition value used statements LOAD, and to be displayed.

(6) Use SHOW PARTITIONS statement to see what the next Hive partition table.

(7) using the SELECT statement to view data in a specified partition, Hive only scan the specified partition data.

 

(8) Hive table is divided into two partitions, static partitions and dynamic partitions. Static and dynamic partitions partitioning the time difference is to import data, partition name is entered manually, or data to determine the data partition (typically through naming conventions in accordance with the hive by a partition to help us hive generated automatically partitions ). For large bulk import data, it is clear that the use of dynamic partitioning is more simple and convenient

 

3. barrel (split classified within the file)

(1) barrel is attached on the table additional structure, can improve query performance. Conducive to do map-side-join ()

(2) a more convenient and efficient to use sampling

(3) using CLUSTER BY columns, and the number of buckets to be divided is divided clause specifies the tub used.

create table bucketed_user(id int,name string) clustered by (id) into 4 buckets;

(4) the data bucket you can do to sort. Use SORTED BY clause.

create table bucketed_user(id int, name string) clustered by(id) sorted by (id asc) into 4 buckets;

(5) does not recommend our own points barrels, it is recommended to let Hive divided barrel.

(6) minutes before the bucket is filled the data necessary to provide hive.enforce.bucketing set to true . (Create a barrel is inserted into the tub when queried data from other tables, dynamic process)

insert overwrite table bucket_users select * from users;

(7) practically corresponds to the tub MapReduce output partition file: the same number of barrels and reduce tasks generated by a job.

 

4. The storage format

Hive manage storage table from two dimensions: "Line Format" (row format) and "File Format" (file format).

(1) Line format:

    Storing data in a row format. In accordance with the terms of the hive, the definition defined by SerDe line format , i.e., serialization and deserialization. I.e. the query data, SerDe file in bytes of data in the form of rows deserialized as an object in the form of internal operation Hive data rows used. Hive when inserting data into a table, the sequence of the tool will Hive internal data line representation into a sequence of bytes written to the output file in the form and go.

(2) File Formats

    The simplest file format is a plain text file, but you can also use the column-oriented and line-oriented binary file format. Binary files can be sequential file, Avro, RCFile, ORC, parquet file.

5. Data Import

Introducing (1) Insert the data mode: Multi-sheet insertion, dynamic partitioning insert.

(2) Load mode introduced

(3) CATS way: check out the data, create a table

 

6. modify, and delete tables

Hive using the "read mode", so after creating the table, it is very flexible support modifications to the table definition . But generally we need to be vigilant, in many cases, to be sure that you modify the data to conform to the new structure.

(1) Rename table

(2) Modify the definition of the column

(3) Delete table

(4) cutting off the table (the table structure stored, the data table empty)

Published 18 original articles · won praise 0 · Views 445

Hive foundation

Guess you like

Origin blog.csdn.net/weixin_45678149/article/details/104943444