[Switch] Hadoop Eco --Hive-- inner table, an external table, the partition table

First, the inner and outer tables

When creating a table, not the modified internal external table (managed table), the modified external outer table (Table external); 

Internal table (MANAGED_TABLE): Directory Table to deploy specification hive, hive warehouse located directory / user / hive / warehouse in

External table (EXTERNAL_TABLE): List of Tables Table specified by the user to build their own

create external table t_access(ip string,url string,access_time string)

row format delimited

fields terminated by ','

location '/access/log';

External and internal differences in the characteristics table table:

1, the directory of the internal table of the VS table directory hive external repository directory specified by the user

2, drop an internal table: hive will clear the related metadata, data directory and delete the table

3, drop an external table: hive will only clear the related metadata, data on HDFS will not be deleted;

 4, to modify the internal tables will be modified to directly synchronize metadata, while the outer table structure and partition table is modified, the need to repair (MSCK REPAIR TABLE table_name;)

 A hive data warehouse, the bottom of the table, must be from an external system, in order not to affect the operation of the logic of the external system, can be built in the hive external directory table to map data produced by these external systems;

Then, the subsequent operations etl, various tables generated suggested managed_table

 

Second, the partition table

create table t_pv_log(ip string,url string,access_time string)
partitioned by(day string)
row format delimited
fields terminated by ','

The essence of the partition table: Create table directory partition as a data file subdirectory, so that when a query MR program can be processed for the data partition subdirectories, to reduce the scope of the data is read. For example, statistics daily pv, you can only read the date data. Then the table built for the partition table, the daily log data are stored.

Note, partition tables and fields can not be defined in the fields of conflict

For example, browsing history website produced each day, browsing history should be built to store a table, but, sometimes, we may only need to analyze the history of one day

In this case, the table can be built for the partition table, a partition introduced daily data therein;

Of course, daily directory partition, there should be a directory name (partition field)

Examples 2-1 of a partition field:

Examples are as follows:

1, create a table with partitions

create table t_access(ip string,url string,access_time string)
partitioned by(dt string)
row format delimited
fields terminated by ',';

 Note: Fields partition table can not be defined in the existing field

2, import data into the partition

load data local inpath '/root/access.log.2017-08-04.log' into table t_access partition(dt='20170804');

load data local inpath '/root/access.log.2017-08-05.log' into table t_access partition(dt='20170805');

 Note that this is for the local operation of the machine in terms of service hive.

3, queries against partitioned data

a, statistics total number of PV August 4:

select count(*) from t_access where dt='20170804';

Essence: that the partition field to use as a table field, you can use the where clause to specify the partition

b, all data tables Total PV:

select count(*) from t_access;

Essence: you can not specify the partition condition

A plurality of partitions Field Example 2-2

1, built table:

create table t_partition(id int,name string,age int)
partitioned by(department string,sex string,howold int)
row format delimited fields terminated by ',';

2, the guide data:

load data local inpath '/root/p1.dat' into table t_partition partition(department='xiangsheng',sex='male',howold=20);

 

Guess you like

Origin www.cnblogs.com/Jing-Wang/p/10904924.html