First, what Hive that?
Hive is the essence: the HQL / SQL into MapReduce running on Hadoop, can be seen as a SQL parsing engine
Hadoop Hive is based on a data warehouse tool, you can map the structure of the data file to a table, and provides SQL-like query.
Hive table is HDFS file directory, a table corresponds to a directory name, if there is a partition, then the partition values corresponding subdirectory.
Hive Tutorial: hive wiki
Two, Hive architecture:
1. User Interface:
(1) CLI: Hive will also start a copy when you start
(2) JDBC client: encapsulates Thrift, java application, the server can connect to the hive another process running in the host and port specified by
(3) ODBC Client :: ODBC driver allows applications that support ODBC protocol connection to the Hive.
2.Thrift server: socket-based communication, cross-language support
3. parser (parse the statement to be executed):
(1) compiler: parsing the statement, parsing, compilation, develop work plans inquiry
(2) Optimizer: Evolution assembly rules: the construction of the column, under pressure predicate
(3) Actuator: will perform all Job order.
4. yuan database:
Hive data consists of two parts: the data files and metadata. Metadata is used to store basic information Hive repository, it is stored in a relational database, such as mysql, derby. Metadata includes: a column attribute information database, table names, and table partitions and their attributes, tables, table of contents and other data is located.
Three, Hive operating mechanism
1. A user interface to a user connected Hive, published Hive SQL;
2.Hive parse the query and formulate a query plan ;
3.Hive turn the query into MapReduce jobs ;
4.Hive implementation of MapReduce in Hadoop jobs.
Four, Hive advantages and disadvantages
Location data warehouse, data analysis and calculation of the direction of deflection
1. Advantages:
(1) suitable for handling large quantities of data
(2) take advantage of the cluster CPU computing resources, storage resources, parallel computing
(3) Class SQL, generated automatically MapReduce
(4). Scalability
2. Disadvantages:
(1) a finite HQL expression .Hive
Low (2) .Hive efficiency: Hive MR job generated automatically, usually not intelligent; HQL tuning difficult, coarse grain size; poor controllability.
For inefficiencies: SparkSQL the emergence of the Sql effectively improve the operational efficiency of the analysis on Hadoop.
Five, Hive applicable scene
1. The mass data storage and data analysis
2. Data Mining
3. Not suitable for complex algorithms and calculations, not suitable for real-time query
Sixth, the use of Hive
(A) connection Hive
Use HiveServer2, Beeline, Cli connection
(Ii) .Hive data type
classification |
Types of |
description |
Examples |
Primitive types |
BOOLEAN |
true/false |
TRUE |
|
TINYINT |
1-byte signed integer -128 to 127 |
1Y |
|
SMALLINT |
2-byte signed integer -32768 to 32767 |
1S |
|
INT |
With 4-byte signed integer |
1 |
|
BIGINT |
8-byte signed integer |
1L |
|
FLOAT |
4-byte single-precision floating-point number 1.0 |
|
|
DOUBLE |
8-byte double-precision floating-point |
1.0 |
|
DEICIMAL |
Arbitrary-precision signed decimal |
1.0 |
|
STRING |
String, variable length |
“a”,’b’ |
|
VARCHAR |
Variable-length strings |
“a”,’b’ |
|
CHAR |
Fixed-length string |
“a”,’b’ |
|
BINARY |
Byte array |
Can not be represented |
|
TIMESTAMP |
Time stamp, nanosecond precision |
122327493795 |
|
DATE |
date |
‘2016-03-29’ |
Complex type |
ARRAY |
An ordered set of the same type |
array(1,2) |
|
MAP |
key-value, key must be a primitive type, value can be any type |
map(‘a’,1,’b’,2) |
|
STRUCT |
Field set, may be different types |
struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0) |
|
UNION |
A value within a limited range of |
create_union(1,’a’,63) |
(C) .Hive operating table and
Data stored metadata composition +
Hive also has a database, the database can be created by CREATE DATABASE. The default library is the default library.
1 comprises two kinds of hosting and outer tables
(1) The data storage
Managed Table: directory data storage in the warehouse.
External table: Any directory data storage HDFS.
(2) The data deleted
Managed Table: Remove metadata and data.
External table: delete only the metadata .
(3) Create table
Hosting Table: CREATE TABLE table_name (attr1 STRING);
外部表:CREATE EXTERNAL TABLE table_name(attr1 STRING) LOCATION ‘path’;
A data partitioning approach can speed up queries. Table -> Partition -> barrel.
2. Partition (folder-level classification carried out)
(1) Partition Column Data are not actually stored, but the partition table nested directory under the directory.
Example:
data:
(2) The table may be partitioned (according to the above example is to partition the ID fan) in various dimensions.
(3) partitions can narrow your search query to improve efficiency.
(4) when the partition is to create a table with the PARTITION BY clause defined.
(5) loading data into a partition specified partition value used statements LOAD, and to be displayed.
(6) Use SHOW PARTITIONS statement to see what the next Hive partition table.
(7) using the SELECT statement to view data in a specified partition, Hive only scan the specified partition data.
(8) Hive table is divided into two partitions, static partitions and dynamic partitions. Static and dynamic partitions partitioning the time difference is to import data, partition name is entered manually, or data to determine the data partition (typically through naming conventions in accordance with the hive by a partition to help us hive generated automatically partitions ). For large bulk import data, it is clear that the use of dynamic partitioning is more simple and convenient
3. barrel (split classified within the file)
(1) barrel is attached on the table additional structure, can improve query performance. Conducive to do map-side-join ()
(2) a more convenient and efficient to use sampling
(3) using CLUSTER BY columns, and the number of buckets to be divided is divided clause specifies the tub used.
create table bucketed_user(id int,name string) clustered by (id) into 4 buckets;
(4) the data bucket you can do to sort. Use SORTED BY clause.
create table bucketed_user(id int, name string) clustered by(id) sorted by (id asc) into 4 buckets;
(5) does not recommend our own points barrels, it is recommended to let Hive divided barrel.
(6) minutes before the bucket is filled the data necessary to provide hive.enforce.bucketing set to true . (Create a barrel is inserted into the tub when queried data from other tables, dynamic process)
insert overwrite table bucket_users select * from users;
(7) practically corresponds to the tub MapReduce output partition file: the same number of barrels and reduce tasks generated by a job.
4. The storage format
Hive manage storage table from two dimensions: "Line Format" (row format) and "File Format" (file format).
(1) Line format:
Storing data in a row format. In accordance with the terms of the hive, the definition defined by SerDe line format , i.e., serialization and deserialization. I.e. the query data, SerDe file in bytes of data in the form of rows deserialized as an object in the form of internal operation Hive data rows used. Hive when inserting data into a table, the sequence of the tool will Hive internal data line representation into a sequence of bytes written to the output file in the form and go.
(2) File Formats
The simplest file format is a plain text file, but you can also use the column-oriented and line-oriented binary file format. Binary files can be sequential file, Avro, RCFile, ORC, parquet file.
5. Data Import
Introducing (1) Insert the data mode: Multi-sheet insertion, dynamic partitioning insert.
(2) Load mode introduced
(3) CATS way: check out the data, create a table
6. modify, and delete tables
Hive using the "read mode", so after creating the table, it is very flexible support modifications to the table definition . But generally we need to be vigilant, in many cases, to be sure that you modify the data to conform to the new structure.
(1) Rename table
(2) Modify the definition of the column
(3) Delete table
(4) cutting off the table (the table structure stored, the data table empty)