I. Overview
1. Hive Hadoop based data warehouse management tools provided by Apache
2. Hive SQL language is provided to operate based Hadoop, MapReduce into the underlying SQL will be performed, so the efficiency will be relatively low
3. Hive adapted for offline processing
4. Hive requires the installation of the first node installation Hadoop, after extraction is completed, automatically find the environment variable HADOOP_HOME start when Hive
II. Comparison of the data warehouse and database
Data Warehouse Database
data amount <= GB> = TB
data of a single type - structural diversity - structured, semi-structured, unstructured
data sources relative to a single database, log, reptiles, ... Buried
transaction to provide a complete transaction (ACID ) weak / no transaction
redundant redundant streamline redundant artificially - copy of
the scene as captured online real-time data is generally stored historical data
system OLTP - online transaction processing the OLAP - online analytical processing
target end-oriented programmers, DBA final market-oriented, leadership, customers and other staff
Three .Hive features
1. In Hive, each database / table will correspond to a directory in the HDFS
2. Hive no primary key
3.Hive if required spacing between the symbol specified fields required to specify the construction of the table when the table once a established, it can not change the symbol interval
4. the three search data:
table t1 from the check data, the data is inserted into the specified table in table t2 and T3;
from t1 insert into t2 select * where id <= 4 insert into t3 select * where gender = 'male';
Check data from the table t1, data is written to the specified local directory;
insert overwrite local directory '/opt/hivedata' select * from t1 where id <= 2;
Table t1 check data from the specified data into a specified directory hdfs;
insert overwrite directory '/table' select * from t1;
5. Write files to a local directory or a directory in HDFS time can only overwrite
IV. Table Structure
1. internal and outer tables
a table inside: build their own table management does not exist in the original data on HDFS
b external table: Table Management needs to build on the already existing data HDFS
c internal table deletes the corresponding directory delete together. , but the external table without changing the original file is deleted
2. partition table
a. partition field in the original file does not exist, you need time to add data manually specify
b. the role of zoning is to classify data
create table cities(id int, name string) partitioned by(country string) row format delimited fields terminated by ' '; load data local inpath '/opt/hivedata/cn.txt' into table cities partition(country='china');
. c Each partition corresponds to a directory
d If you add partition condition at the time of the query, efficiency will be greatly improved;. If an inter-partition query, but will decrease the efficiency of
e manually create their own directory and will not be considered, said partition. , you need to manually add a partition
alter table cities add partition(country='japan') location '/user/hive/warehouse/hivedemo.db/cities/country=japan';
Or (may fail)
msck repair table cities;
f. If you have never partitioned table query data is inserted into the partitioned table, we need to open dynamic partitioning scheme
# Enable dynamic partitioning scheme
set hive.exec.dynamic.partition = true;
# Turn off strict mode
set hive.exec.dynamic.partition.mode = nostrict;
# Dynamic partitioning
insert into table t1 partition(class) select sid, sname, sclass from t1_tmp distribute by sclass;
g., when a partition may specify a plurality of fields, the first field is automatically included in the field
3. The kit of parts table
a. kit of parts for performing the role of the table data samples
b. division bucket mechanism is not turned on by default of
set hive.enforce.bucketing = true; create table t1_sam(id int, name string) clustered by(name) into 6 buckets row format delimited fields terminated by ' '; insert into table t1_sam select * from t1; select * from t1_sam tablesample(bucket 2 out of 3 on name);
c. points barrels barrels points table during the time can not load data from a local file, also can not be external table, only then insert a barrel query table from other tables
d. allows a partition table is divided into both barrels
V. Other
1.SerDe
A. To deal with the actual process data irregularities regular expression
b. Set the capture group regular expression data to deal with irregularities, when in use, a field in each table corresponding to a capture group, meaning the number of fields and the number of capture groups is consistent --- symbol interval is determined between the capture group
create table t4(ip string, datetime string, timezone string, request string,resource string, protocol string, stateid int)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties("input.regex" = "(.*) \-\- \\[(.*) (.*)\\] \"(.*) (.*) (.*)\" ([0-9]*) \-") stored as textfile;
2. index
a. Index role is the ability to increase the query rate
b. Indexes in the database is automatically created for the primary key index, Hive no primary key, so Hive default does not automatically indexed
c. In the Hive, the data can be indexed, but the need to specify the field index
# indexing table
create index s_index on table t1(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild in table t1_index;
# Generate an index for the table t1
alter index s_index on t1 rebuild;
Delete Index #
drop index s_index on t1;
3. view
. A
. B view is divided into: materialized view (view is to maintain on disk) and virtual view (the view is to maintain in memory)
only supports the virtual view c.Hive
create view score_view as select name, math from score;
. d view to follow a query, but in the Hive, create a view when the query was not triggered - which means, after you've created this view, as long as there is no operation of this view, then the view at this time no data is
4. metadata
a. Hive called metadata database names, table names, field names, indexes, views, partition field, fields, etc. sampling
b.Hive metadata is stored in a relational database. Relational database currently supports only two: Derby and MySQL. By default, the metadata is stored Hive in Derby
5.beeline
# remote connection methods:
sh beeline -u jdbc:hive2://hadoop01:10000/hivedemo -n root
# Where, -u indicates the connection address, -n represents the user name
VI. Data Types
1. split the data type of the basic and complex types
2. Complex Type:
a.array-- array type, and the corresponding array in Java set
# Examples
create table nums(num1 array<int>, num2 array<int>) row format delimited fields terminated by ' ' collection items terminated by ',';
# Non-empty query
select info['alex'] from infos where info['alex'] is not null;
b.map-- map type, in the corresponding Java type Map
# sentence
create table infos(id int, info map<string,int>) row format delimited fields terminated by ' ' map keys terminated by ':';
# Non-empty query
select info['alex'] from infos where info['alex'] is not null;
c.struct-- structure type, the corresponding objects in Java
# Examples
create external table score(info struct<name:string, chinese:int, math:int, english:int>) row format delimited collection items terminated by ' ' location '/score';
# Gets the specified property value
select info.chinese from score;
VII. Main function
1.concat_ws: a plurality of strings with a specified symbol splice
Example:
# TXT document content
mail 163 com news baidu com hive apache org
# Build the table
create table web(app string, name string, type string) row format delimited fields terminated by ' ';
# Splicing
select concat_ws('.', app, name, type) from web;
2.explode: extract the elements in the array are formed on a single line
Example:
# build table managing raw data
create external table words(word string) row format delimited fields terminated by ',' location '/words';
# Units of words with a space to be split into an array
select split(word, ' ') from words;
# In the array to each word line to split into separate statistical
select explode(split(word, ' ')) from words;
# The number of statistics
select w , count(w) from (select explode(split(word, ' ')) w from words)ws group by w;
3. Custom Functions - UDF - User Define Function