Hive table operation (the HIVE data storage, databases, tables, partitions, sub-barrels)

. 1, Hive data storage

Hive data storage based Hadoop HDFS

Hive no specific data storage format

Storage structures include: databases, files, tables, trying to

Hive default can load text files (TextFile) directly, also supports sequence file

When creating a table, designated Hive column delimiter data and line separator, Hive to parse the data.

 

2, Hive data model - Database

Similar to the traditional database DataBase

The default database "default"

Use the #hive command, without using the hive> use <database name>, the system default database. Can be used explicitly hive> use default;

 

Create a new database

hive > create database test_dw;

 

. 3, Hive data model - Table

  • Table inside table

And database Table is conceptually similar

Each Table has a corresponding directory data is stored in the Hive. For example, a table Test, which is in HDFS path: $ HIVE_HOME / warehouse / test. warehouse by $ {hive.metastore.warehouse.dir} directory specified in the data warehouse in hive-site.xml

All Table data (not including External Table) are stored in this directory.

Internal table, the table is deleted, metadata and data will be deleted

 

Specific operation is as follows:

Create a data file inner_table.dat

 

Create a table

hive>create table inner_table (key string);

Download Data

hive>load data local inpath '/usr/local/inner_table.dat' into table inner_table;

View data

select * from inner_table

select count(*) from inner_table

Delete table

drop table inner_table

 

  • External Table External Table
  1. Points already in existence in HDFS data, you can create Partition
  2. It internal table on the tissue metadata is the same, the actual data is stored are quite different
  3. Internal table creation process and data loading process (these two processes can be done in the same statement), in the process of loading the data, the actual data is moved to the data warehouse catalog; after the data to be accessed directly data warehouse directory to complete. When you delete a table, the table data and metadata will be deleted
  4. External table is only a process, create tables and load data at the same time to complete, and does not move into the data warehouse catalog, just to establish a link with the external data. When you delete an external table , the delete only the link

Specific examples are as follows:

CREATE EXTERNAL TABLE page_view

( viewTime INT,

 userid BIGINT,

 page_url STRING,

 referrer_url STRING,

 ip STRING COMMENT 'IP Address of the User',

 country STRING COMMENT 'country of origination‘

)

 COMMENT 'This is the staging page view table'

 ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'

 STORED AS TEXTFILE

 LOCATION 'hdfs://hadoop:9000/user/data/staging/page_view';

 

Create a data file external_table.dat

 

Create a table

hive>create external table external_table1 (key string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

location '/home/external';

 

In the Create a directory HDFS / Home / External:

#hadoop fs -put /home/external_table.dat /home/external

 

Download Data

LOAD DATA INPATH '/home/external_table1.dat' INTO TABLE external_table1;

 

View data

select * from external_table

select count(*) from external_table

 

Delete table

drop table external_table

 

Internal and external table difference table:

External table only deletes table information (metadata), do not delete data; internal table deletes table information and data information

 

  • Partition partition table

 

Partition Table related commands:

SHOW TABLES; # View all the tables

SHOW TABLES '* TMP *'; # Support fuzzy queries

SHOW PARTITIONS TMP_TABLE; # to see what partition table

DESCRIBE TMP_TABLE; # View table structure

 

Partition Partition corresponding to the database columns of dense index

In Hive in a Partition table corresponds to a directory table, all data are stored in the Partition directory corresponding

For example: Test date and city table contains two Partition,

Corresponding to the date = 20130201, city = bj of HDFS subdirectory:

/warehouse/test/date=20130201/city=bj

Corresponds to the date = 20130202, city = sh of HDFS subdirectory;

/warehouse/test/date=20130202/city=sh

 

To do:

CREATE TABLE tmp_table #表名

(

title string, # Field Name Field Type

minimum_bid     double,

quantity        bigint,

have_invoice    bigint

) COMMENT 'Notes: XXX' # table Notes

 PARTITIONED BY (pt STRING) # partition table fields (if your files are very large, then, using the partition table can quickly filter out data fields divided by partition)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\ 001' # field is what separated the

STORED AS SEQUENCEFILE; # Which memory data, SEQUENCEFILE is hadoop native file compression format

 

Create a data file partition_table.dat

Create a table

create table partition_table(rectime string,msisdn string)

partitioned by(daytime string,city string)

row format delimited

fields terminated by '\t' stored as TEXTFILE;

Loading data into partitions

load data local inpath '/home/partition_table.dat' into table partition_table partition (daytime='2013-02-01',city='bj');

View data

select * from partition_table

select count(*) from partition_table

Delete table

drop table partition_table

 

alter table partition_table add partition (daytime='2013-02-04',city='bj');

By loading data load data

 

alter table partition_table drop partition (daytime='2013-02-04',city='bj')

Metadata, data file is deleted, but the directory daytime = 2013-02-04 still

 

  • Bucket Table barrel table

Bucket hash table is a data value, and then into different file storage.

Create a table

create table bucket_table(id string) clustered by(id) into 4 buckets;

Download Data

set hive.enforce.bucketing = true;

insert into table bucket_table select name from stu;

insert overwrite table bucket_table select name from stu;

When the table data is loaded into the tub, the field will take the hash value, and then modulo the number of buckets. The corresponding data into the file.

 

Sampling inquiry

select * from bucket_table tablesample(bucket 1 out of 4 on id);

 

  • Create a view

create view v1 AS select * from t1;

 

  • Operating table

Modify table

Alter table target_tab add columns(cols,string);

 

Delete table

Drop table;

 

  • Import Data

When data is loaded into the table, without any data conversion. Load operations are only data copy / move to a position corresponding to the Hive table.

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
    INTO TABLE tablename
    [PARTITION (partcol1=val1, partcol2=val2 ...)]

Put a import Hive table to another table has been built Hive

INSERT OVERWRITE TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement FROM from_statement

CTAS

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

(col_name data_type, ...) …

AS SELECT …

例:create table new_external_test as  select * from external_table1;

 

  • Lookup table

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list]

[ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] | [ORDER BY col_list] ]

[LIMIT number]

 

Based on Partition query  

Generally SELECT query is a full table scan. But if the partition table, a query can take advantage of partition pruning (input pruning) feature, similar to the "Partition Indexes" "scan only one table in the current implementation .Hive that part of it is concerned, only the partition assertion (Partitioned by ) appears in the most recent one from the fROM clause WHERE clause, will enable a partition pruning. For example, if page_views table (in days partition) using the date column partition, the following statement will read the partition as' 2008-03- 01 'data.

 SELECT page_views.*    FROM page_views    WHERE page_views.date >= '2013-03-01' AND page_views.date <= '2013-03-01'

LIMIT Clause

Limit the number of records you can limit the query. The results of the query are selected at random. The following query records from five randomly query table t1:

SELECT * FROM t1 LIMIT 5

Top N queries

 

The following query query sales recorded the largest five sales representatives.

SET mapred.reduce.tasks = 1
  SELECT * FROM sales SORT BY amount DESC LIMIT 5

 

  • Connection Table

Import ac information table

hive> create table acinfo (name string,acip string)  row format delimited fields terminated by '\t' stored as TEXTFILE;

hive> load data local inpath '/home/acinfo/ac.dat' into table acinfo;

En

select b.name,a.* from dim_ac a join acinfo b on (a.ac=b.acip) limit 10;

Left outer join

select b.name,a.* from dim_ac a left outer join acinfo b on a.ac=b.acip limit 10;

Guess you like

Origin www.cnblogs.com/wendyw/p/11398170.html