Hive study notes short version

 

 

 

I. Overview

1. Hive Hadoop based data warehouse management tools provided by Apache
2. Hive SQL language is provided to operate based Hadoop, MapReduce into the underlying SQL will be performed, so the efficiency will be relatively low
3. Hive adapted for offline processing
4. Hive requires the installation of the first node installation Hadoop, after extraction is completed, automatically find the environment variable HADOOP_HOME start when Hive

II. Comparison of the data warehouse and database

        Data Warehouse Database
data amount <= GB> = TB
data of a single type - structural diversity - structured, semi-structured, unstructured
data sources relative to a single database, log, reptiles, ... Buried
transaction to provide a complete transaction (ACID ) weak / no transaction
redundant redundant streamline redundant artificially - copy of
the scene as captured online real-time data is generally stored historical data
system OLTP - online transaction processing the OLAP - online analytical processing
target end-oriented programmers, DBA final market-oriented, leadership, customers and other staff

Three .Hive features

1. In Hive, each database / table will correspond to a directory in the HDFS
2. Hive no primary key
3.Hive if required spacing between the symbol specified fields required to specify the construction of the table when the table once a established, it can not change the symbol interval
4. the three search data:
table t1 from the check data, the data is inserted into the specified table in table t2 and T3;

from t1 insert into t2 select * where id <= 4 insert into t3 select * where gender = 'male';

Check data from the table t1, data is written to the specified local directory;

insert overwrite local directory '/opt/hivedata' select * from t1 where id <= 2;

Table t1 check data from the specified data into a specified directory hdfs;

insert overwrite directory '/table' select * from t1;

5. Write files to a local directory or a directory in HDFS time can only overwrite

IV. Table Structure

1. internal and outer tables
a table inside: build their own table management does not exist in the original data on HDFS
b external table: Table Management needs to build on the already existing data HDFS
c internal table deletes the corresponding directory delete together. , but the external table without changing the original file is deleted
2. partition table
a. partition field in the original file does not exist, you need time to add data manually specify
b. the role of zoning is to classify data

create table cities(id int, name string) partitioned by(country string) row format delimited fields terminated by ' ';
load data local inpath '/opt/hivedata/cn.txt' into table cities partition(country='china');

. c Each partition corresponds to a directory
d If you add partition condition at the time of the query, efficiency will be greatly improved;. If an inter-partition query, but will decrease the efficiency of
e manually create their own directory and will not be considered, said partition. , you need to manually add a partition

alter table cities add partition(country='japan') location '/user/hive/warehouse/hivedemo.db/cities/country=japan';

Or (may fail)

msck repair table cities;

f. If you have never partitioned table query data is inserted into the partitioned table, we need to open dynamic partitioning scheme
# Enable dynamic partitioning scheme

set hive.exec.dynamic.partition = true;

# Turn off strict mode

set hive.exec.dynamic.partition.mode = nostrict;

# Dynamic partitioning

insert into table t1 partition(class) select sid, sname, sclass from t1_tmp distribute by sclass;

g., when a partition may specify a plurality of fields, the first field is automatically included in the field
3. The kit of parts table
a. kit of parts for performing the role of the table data samples
b. division bucket mechanism is not turned on by default of

set hive.enforce.bucketing = true;
create table t1_sam(id int, name string) clustered by(name) into 6 buckets row format delimited fields terminated by ' ';
insert into table t1_sam select * from t1;
select * from t1_sam tablesample(bucket 2 out of 3 on name);

c. points barrels barrels points table during the time can not load data from a local file, also can not be external table, only then insert a barrel query table from other tables
d. allows a partition table is divided into both barrels

V. Other

1.SerDe
A. To deal with the actual process data irregularities regular expression
b. Set the capture group regular expression data to deal with irregularities, when in use, a field in each table corresponding to a capture group, meaning the number of fields and the number of capture groups is consistent --- symbol interval is determined between the capture group

create table t4(ip string, datetime string, timezone string, request string,resource string, protocol string, stateid int)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties("input.regex" = "(.*) \-\- \\[(.*) (.*)\\] \"(.*) (.*) (.*)\" ([0-9]*) \-") stored as textfile;

2. index
a. Index role is the ability to increase the query rate
b. Indexes in the database is automatically created for the primary key index, Hive no primary key, so Hive default does not automatically indexed
c. In the Hive, the data can be indexed, but the need to specify the field index
# indexing table

create index s_index on table t1(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild in table t1_index;

# Generate an index for the table t1

alter index s_index on t1 rebuild;

Delete Index #

drop index s_index on t1;

3. view
. A
. B view is divided into: materialized view (view is to maintain on disk) and virtual view (the view is to maintain in memory)
only supports the virtual view c.Hive

create view score_view as select name, math from score;

. d view to follow a query, but in the Hive, create a view when the query was not triggered - which means, after you've created this view, as long as there is no operation of this view, then the view at this time no data is
4. metadata
a. Hive called metadata database names, table names, field names, indexes, views, partition field, fields, etc. sampling
b.Hive metadata is stored in a relational database. Relational database currently supports only two: Derby and MySQL. By default, the metadata is stored Hive in Derby
5.beeline
# remote connection methods:

sh beeline -u
jdbc:hive2://hadoop01:10000/hivedemo -n root

# Where, -u indicates the connection address, -n represents the user name

VI. Data Types

1. split the data type of the basic and complex types
2. Complex Type:
a.array-- array type, and the corresponding array in Java set
# Examples

create table nums(num1 array<int>, num2 array<int>) row format delimited fields terminated by ' ' collection items terminated by ',';

# Non-empty query

select info['alex'] from infos where info['alex'] is not null;

b.map-- map type, in the corresponding Java type Map
# sentence

create table infos(id int, info map<string,int>) row format delimited fields terminated by ' ' map keys terminated by ':';

# Non-empty query

select info['alex'] from infos where info['alex'] is not null;

c.struct-- structure type, the corresponding objects in Java
# Examples

create external table score(info struct<name:string, chinese:int, math:int, english:int>) row format delimited collection items terminated by ' ' location '/score';

# Gets the specified property value

select info.chinese from score;

VII. Main function

1.concat_ws: a plurality of strings with a specified symbol splice
Example:
# TXT document content

mail 163 com
news baidu com
hive apache org

# Build the table

create table web(app string, name string, type string) row format delimited fields terminated by ' ';

# Splicing

select concat_ws('.', app, name, type) from web;

2.explode: extract the elements in the array are formed on a single line
Example:
# build table managing raw data

create external table words(word string) row format delimited fields terminated by ',' location '/words';

# Units of words with a space to be split into an array

select split(word, ' ') from words;

# In the array to each word line to split into separate statistical

select explode(split(word, ' ')) from words;

# The number of statistics

select w , count(w) from (select explode(split(word, ' ')) w from words)ws group by w;

3. Custom Functions - UDF - User Define Function

Guess you like

Origin www.cnblogs.com/fanqinglu/p/11837548.html