Big Data Road week07 - day07 (Hive Hive grammar and structural design)

Hive architecture process (very important, with diagrams to understand memory) When a client submits a request that it be submitted to the Driver, the Driver got this request, first show, field names out, to verify the database metadata, that is, Metasore, if so, to return there, and then returned to the Complier Driver compiler, HQL resolve to carry out the task of transformation process MR, submitted after the execution of a return to the Driver MR task, and then submitted to the Hadoop cluster, to request and receive YRAN processed to produce a result, the results back to the Driver, Driver sends the results back to the client for display.

When writing a series of complex SQL statements, the compiler will SQL statement is converted into a plurality of N operators, MR becomes the task after stitching together of these operators
1, the compiler will operate into a SQL Hive Fu
2, the operator is Hive minimal processing unit
3, a representative of the operation of each operator HDFS or a job mapreduce

 

 

The client has, CLI, Client, WUI

========================================================================================================================================

Hive basic data types

Basic data types

  Integer TINYINT - micro integer, occupies only one byte can store an integer of 0-255.

  SMALLINT- small integer, 2 bytes, the storage range of -32768 to 32767.

  INT- integer 4 bytes, the memory range -2147483648 2147483647.

  BIGINT- Long, 8 bytes, the memory range -2 ^ 63 to 2 ^ 63-1.

  Boolean BOOLEAN - TRUE / FALSE

  Float FLOAT- single-precision floating point.

  DOUBLE- double precision floating point.

  STRING- string length is not set.

Date Type:

  1, Timestamp format "YYYY-MM-DD HH: MM: SS.fffffffff" (9 decimal digit accuracy)

  2, Date DATE value to describe a specific year / month / day, the format YYYY-MM-DD.

Complex data types: Structs, Maps, Arrays

 

A $ B Bitwise and only when the two are 1 is 1
A ^ B Bitwise XOR and there is only one time is 1 1

 

Complex data types:

 

How to use Hive the data structure conforming Maps, array, structs
 
. 1 the Array is used
 
to create a database table to array as the data type
Create Table Person (name String, work_locations array <String>)
the ROW the FORMAT the DELIMITED
the FIELDS TERMINATED BY '\ T'
COLLECTION ITEMS TERMINATED BY ',';
 
data
biansutao Beijing, Shanghai, Tianjin, Hangzhou
Linan changchu, chengdu, Wuhan
 
storage data
LOAD dATA LOCAL INPATH '/home/Hadoop/person.txt' OVERWRITE INTO TABLE person;
 
query
hive> select * from Person;
biansutao [ "Beijing", "Shanghai", "Tianjin", "Hangzhou"]
Linan [ "changchu", "chengdu", "Wuhan"]
Time taken: 0.355 seconds The
 
Hive> SELECT name from Person;
linan
biansutao
Time taken: 12.397 seconds
 
hive> select work_locations[0] from person;
changchu
beijing
Time taken: 13.214 seconds
 
hive> select work_locations from person;
["changchu","chengdu","wuhan"]
["beijing","shanghai","tianjin","hangzhou"]
Time taken: 13.755 seconds
 
hive> select work_locations[3] from person;
NULL
hangzhou
Time taken: 12.722 seconds
 
hive> select work_locations[4] from person;
NULL
NULL
Time taken: 15.958 seconds


2 Map的使用
 
创建数据库表 
create table score(name string, score map<string,int>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
 
Data
biansutao 'Mathematics': 80,' language ': 89,' English ': 95
Jobs' language': 60, 'mathematics': 80,' English ': 99
 
storage data
LOAD DATA LOCAL INPATH' / home / hadoop / score.txt 'OVERWRITE INTO TABLE score;
 
query
Hive> SELECT * from Score;
biansutao { "mathematics": 80, "language": 89, "English": 95}
Jobs { "language": 60, "mathematics": 80 "English": 99}
Time taken: 0.665 seconds The
 
Hive> SELECT name from Score;
Jobs
biansutao
Time taken: 19.778 seconds The
 
Hive> SELECT t.score Score from T;
{ "Chinese": 60, "mathematics": 80, " English ": 99}
{" Mathematics ": 80," language ": 89," English ": 95}
Time taken: 19.353 seconds The

Hive>select t.score['语文'] from score t;
60
89
Time taken: 13.054 seconds
 
hive> select t.score['英语'] from score t;
99
95
Time taken: 13.769 seconds
 
3 Struct的使用
 
创建数据表
CREATE TABLE test(id int,course struct<course:string,score:int>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ',';
 
数据
1 english,80
2 math,89
3 chinese,95
 
入库
LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test;
 
查询
hive> select * from test;
OK
1       {"course":"english","score":80}
2       {"course":"math","score":89}
3       {"course":"chinese","score":95}
Time taken: 0.275 seconds
 
hive> select course from test;
{"course":"english","score":80}
{"course":"math","score":89}
{"course":"chinese","score":95}
Time taken: 44.968 seconds
 
select t.course.course from test t; 
english
math
chinese
Time taken: 15.827 seconds
 
hive> select t.course.score from test t;
80
89
95
Time taken: 13.235 seconds
 
4 数据组合(不支持组合的复杂数据类型)
 
LOAD DATA LOCAL INPATH '/home/hadoop/test.txt' OVERWRITE INTO TABLE test;

create table test1(id int,a MAP<STRING,ARRAY<STRING>>)
row format delimited fields terminated by '\t' 
collection items terminated by ','
MAP KEYS TERMINATED BY ':';
 
1 english:80,90,70
2 math:89,78,86
3 chinese:99,100,82
 
LOAD DATA LOCAL INPATH '/home/hadoop/test1.txt' OVERWRITE INTO TABLE test1;

=============================================================================================================

DDL Programming:

Create a database create database xxxxx;

View database show databases;

Delete database drop database tmp;

Forced to delete database: drop database tmp cascade;

View Table: SHOW TABLES;

View meta-information table:

  desc test_table;

  describe extended test_table;

  describe formatted test_table; (use this mostly because this is the most detailed view)

View Jian table statement: show create table table_XXX

Rename table: alter table test_table rename to new_table;

Modified column data types: alter table lv_test change column colxx string;

Add, delete partition:

        alter table test_table add partition (pt=xxxx)

        alter table test_table drop if exists partition(...);

===========================================================================================================================

Hive time to load the data does not go check format, only when the query to check.
Hive default is no weakening of the transaction can be deemed not to have, while write data, while reading data,

 

Loading data on the HDFS to Hive go ( note that this is moved , after loading the last, the original file path of HDFS no)
Load Data the inpath '/ usr / Test / dianxin_data' INTO Table dianxin_1 Partition (Province = 'Zhejiang ');

Can be loaded from a local , ( note, is copied from the local load )
Load the inpath local Data '/ usr / local / Soft / Data / shujia006_hive / dianxin_data' INTO dianxin_1 Partition Table (Province = 'Nanjing');

Table Loading data table (note that this is copied the query to generate a result table)
Method 1: create table dianxin_test1 as select * from danxin_1 limit 10;

方式二:insert [overwrite] into table dianxin_test2 select * from dianxin_test;

================================================== ================================================== =======================
the difference between internal and external table table
after deleting the table, an internal table data files and table information are deleted.

External table to delete only the information table
CREATE EXTERNAL TABLE IF NOT EXISTS dianxin_like LIKE

EXTERNAL external tables is added, did not add is an internal table

Create [the EXTERNAL] Table vv_stat_fact
(
the userid String,
stat_date String,
tryvv int,
sucvv int,
the ptime a float
)
the PARTITIONED BY (non-mandatory; String create a partition table dt)
clustered by (the userid) // INTO be optionally buckets 3000; points bucket
rOW fORMAT DELIMITED FIELDS TERMINATED BY '\ t' // Required; Specifies the separator between the columns
STORED AS rcfile // non-mandatory; read the specified file format, textfile format default
location '/ testdata /'; // non-Required; designated file storage path on hdfs, if there is already a file, it will automatically load the default in the warehouse hive of

====================================================================
建表1:
create table user_bh
(
phone string,
jw string,
city_id string,
area_id string,
stay_time string,
start_time string,
end_time string,
date_time string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

Note: The default data is stored in the location / user / hive / warehouse /

====================================================================
建表2:
create table user_bh_rc
(
phone string,
jw string,
city_id string,
area_id string,
stay_time string,
start_time string,
end_time string,
date_time string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS rcfile

File storage format is rcfile, text data can not be loaded directly. Usually load data from other tables.

====================================================================
建表3:
create table user_bh_loc
(
phone string,
jw string,
city_id string,
area_id string,
stay_time string,
start_time string,
end_time string,
date_time string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/testdata/';

hive table will load all the data can be loaded in the storage directory. All file formats, the number of field separators to exactly the required hive hdfs corresponding directory.
hive is a read mode, that is, when reading the query time will check the file format. When storage is not verified. When the format is not uniform, the number of fields such as inconsistent, inconsistent separator, check the selected
time will be exceptions, there null.
================================================== ==================
built table. 4:
Create table T1 AS SELECT * from user_bh;
Create table T2 like user_bh; (just copy the table structure, does not copy data)

The difference between the outer table and inner table:
1. Delete the table, an internal table (ordinary table) meta-information and data directory will be deleted. External table just delete the meta-information, do not delete the original data
2 general use external tables to avoid accidental deletion
3 can be used as a temporary table (data on hdfs already exist, but the data is very important)

====================================================================
(创建一个外部表 external)
create EXTERNAL table user_bh_ext
(
phone string,
jw string,
city_id string,
area_id string,
stay_time string,
start_time string,
end_time string,
date_time string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

================================================== ==================
partition table: is actually based on the original on the table, add a partition field, for distinguishing data. Into a different subdirectory.
Table dianxin_1 Create
(
Phone String,
JW String,
city_id String,
area_id String,
stay_time String,
START_TIME String,
END_TIME String,
DATE_TIME String
) Partitioned by (Province String)
the ROW the FORMAT the DELIMITED the FIELDS TERMINATED BY '\ T'

Add a partition: Table must be defined in the construction of the table when it is a partition table, and specify the partition field.
Partition Table user_bh_part the Add ALTER (Provience = "shandong");
===================================== ===============================
company most commonly used partition is to partition by date. partitioned by (dt string)
can create a multi-level partitioning: generally up two partitions, partition too, affect the efficiency of query
the Create the Table dianxin_3
(
Phone String,
JW String,
city_id String,
area_id String,
stay_time String,
start_time String,
end_time String,
String DATE_TIME
) Partitioned by (Province String, String dt)
the ROW the FORMAT the DELIMITED the FIELDS TERMINATED bY '\ T'

Partition Table user_bh_part2 the Add ALTER (Provience = "Anhui", dt = "20.19122 million");
=============================== =====================================
partition features: time to build tables defined, avoid full table scans, improve query performance.
Usage: partition field to filter through, or cutting data.
Partition field up and in the same general field using sql.

Dynamic partitioning: for the original non-partitioned table data, automatic dynamic partitioning to the specified partition table.

create table user_bh_city
(
phone string,
jw string,
city_id string,
area_id string,
stay_time string,
start_time string,
end_time string,
date_time string
)partitioned by (city string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

insert into user_bh_city partition(city) select phone,jw,city_id,area_id,stay_time,start_time,end_time,date_time,city_id from user_bh_loc;


====================================================================

To re-
select distinct ename from emp limit 10;

Connecting
CONCAT_WS ()
============================================== ======================

创建结构化的
create table t(id struct<id1:int,id2:int,id3:int>,name array<string>,xx map<int,string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'

Text data preparation:
1,2,3 1,2,3,4,5 05: Lizhi En, 06: Wang Youhu

note:

ROW FORMAT DELIMITED must be separated before the other set, which is most statement delimiter set before

LINES TERMINATED BY must be separated after the other set, which is the delimiter statement last set, otherwise it will error

(The following wording return wrong)

hive> create table t (id struct<id1:int,id2:int,id3:int>,name array<string>,xx map<int,string>) 
    > row format delimited
    > fields terminated by '\t'
    > lines terminated by '\n'
    > collection items terminated by ','
    > map keys terminated by ':';
FAILED: ParseException line 5:0 missing EOF at 'collection' near ''\n''

Guess you like

Origin www.cnblogs.com/wyh-study/p/12080728.html