. 1, Hive data storage
Hive data storage based Hadoop HDFS
Hive no specific data storage format
Storage structures include: databases, files, tables, trying to
Hive default can load text files (TextFile) directly, also supports sequence file
When creating a table, designated Hive column delimiter data and line separator, Hive to parse the data.
2, Hive data model - Database
Similar to the traditional database DataBase
The default database "default"
Use the #hive command, without using the hive> use <database name>, the system default database. Can be used explicitly hive> use default;
Create a new database
hive > create database test_dw;
. 3, Hive data model - Table
- Table inside table
And database Table is conceptually similar
Each Table has a corresponding directory data is stored in the Hive. For example, a table Test, which is in HDFS path: $ HIVE_HOME / warehouse / test. warehouse by $ {hive.metastore.warehouse.dir} directory specified in the data warehouse in hive-site.xml
All Table data (not including External Table) are stored in this directory.
Internal table, the table is deleted, metadata and data will be deleted
Specific operation is as follows:
Create a data file inner_table.dat
Create a table
hive>create table inner_table (key string);
Download Data
hive>load data local inpath '/usr/local/inner_table.dat' into table inner_table;
View data
select * from inner_table
select count(*) from inner_table
Delete table
drop table inner_table
- External Table External Table
- Points already in existence in HDFS data, you can create Partition
- It internal table on the tissue metadata is the same, the actual data is stored are quite different
- Internal table creation process and data loading process (these two processes can be done in the same statement), in the process of loading the data, the actual data is moved to the data warehouse catalog; after the data to be accessed directly data warehouse directory to complete. When you delete a table, the table data and metadata will be deleted
- External table is only a process, create tables and load data at the same time to complete, and does not move into the data warehouse catalog, just to establish a link with the external data. When you delete an external table , the delete only the link
Specific examples are as follows:
CREATE EXTERNAL TABLE page_view
( viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination‘
)
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION 'hdfs://hadoop:9000/user/data/staging/page_view';
Create a data file external_table.dat
Create a table
hive>create external table external_table1 (key string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
location '/home/external';
In the Create a directory HDFS / Home / External:
#hadoop fs -put /home/external_table.dat /home/external
Download Data
LOAD DATA INPATH '/home/external_table1.dat' INTO TABLE external_table1;
View data
select * from external_table
select count(*) from external_table
Delete table
drop table external_table
Internal and external table difference table:
External table only deletes table information (metadata), do not delete data; internal table deletes table information and data information
- Partition partition table
Partition Table related commands:
SHOW TABLES; # View all the tables
SHOW TABLES '* TMP *'; # Support fuzzy queries
SHOW PARTITIONS TMP_TABLE; # to see what partition table
DESCRIBE TMP_TABLE; # View table structure
Partition Partition corresponding to the database columns of dense index
In Hive in a Partition table corresponds to a directory table, all data are stored in the Partition directory corresponding
For example: Test date and city table contains two Partition,
Corresponding to the date = 20130201, city = bj of HDFS subdirectory:
/warehouse/test/date=20130201/city=bj
Corresponds to the date = 20130202, city = sh of HDFS subdirectory;
/warehouse/test/date=20130202/city=sh
To do:
CREATE TABLE tmp_table #表名
(
title string, # Field Name Field Type
minimum_bid double,
quantity bigint,
have_invoice bigint
) COMMENT 'Notes: XXX' # table Notes
PARTITIONED BY (pt STRING) # partition table fields (if your files are very large, then, using the partition table can quickly filter out data fields divided by partition)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\ 001' # field is what separated the
STORED AS SEQUENCEFILE; # Which memory data, SEQUENCEFILE is hadoop native file compression format
Create a data file partition_table.dat
Create a table
create table partition_table(rectime string,msisdn string)
partitioned by(daytime string,city string)
row format delimited
fields terminated by '\t' stored as TEXTFILE;
Loading data into partitions
load data local inpath '/home/partition_table.dat' into table partition_table partition (daytime='2013-02-01',city='bj');
View data
select * from partition_table
select count(*) from partition_table
Delete table
drop table partition_table
alter table partition_table add partition (daytime='2013-02-04',city='bj');
By loading data load data
alter table partition_table drop partition (daytime='2013-02-04',city='bj')
Metadata, data file is deleted, but the directory daytime = 2013-02-04 still
- Bucket Table barrel table
Bucket hash table is a data value, and then into different file storage.
Create a table
create table bucket_table(id string) clustered by(id) into 4 buckets;
Download Data
set hive.enforce.bucketing = true;
insert into table bucket_table select name from stu;
insert overwrite table bucket_table select name from stu;
When the table data is loaded into the tub, the field will take the hash value, and then modulo the number of buckets. The corresponding data into the file.
Sampling inquiry
select * from bucket_table tablesample(bucket 1 out of 4 on id);
- Create a view
create view v1 AS select * from t1;
- Operating table
Modify table
Alter table target_tab add columns(cols,string);
Delete table
Drop table;
- Import Data
When data is loaded into the table, without any data conversion. Load operations are only data copy / move to a position corresponding to the Hive table.
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
Put a import Hive table to another table has been built Hive
INSERT OVERWRITE TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement FROM from_statement
CTAS
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
(col_name data_type, ...) …
AS SELECT …
例:create table new_external_test as select * from external_table1;
- Lookup table
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] | [ORDER BY col_list] ]
[LIMIT number]
Based on Partition query
Generally SELECT query is a full table scan. But if the partition table, a query can take advantage of partition pruning (input pruning) feature, similar to the "Partition Indexes" "scan only one table in the current implementation .Hive that part of it is concerned, only the partition assertion (Partitioned by ) appears in the most recent one from the fROM clause WHERE clause, will enable a partition pruning. For example, if page_views table (in days partition) using the date column partition, the following statement will read the partition as' 2008-03- 01 'data.
SELECT page_views.* FROM page_views WHERE page_views.date >= '2013-03-01' AND page_views.date <= '2013-03-01'
LIMIT Clause
Limit the number of records you can limit the query. The results of the query are selected at random. The following query records from five randomly query table t1:
SELECT * FROM t1 LIMIT 5
Top N queries
The following query query sales recorded the largest five sales representatives.
SET mapred.reduce.tasks = 1
SELECT * FROM sales SORT BY amount DESC LIMIT 5
- Connection Table
Import ac information table
hive> create table acinfo (name string,acip string) row format delimited fields terminated by '\t' stored as TEXTFILE;
hive> load data local inpath '/home/acinfo/ac.dat' into table acinfo;
En
select b.name,a.* from dim_ac a join acinfo b on (a.ac=b.acip) limit 10;
Left outer join
select b.name,a.* from dim_ac a left outer join acinfo b on a.ac=b.acip limit 10;