hadoop series five --hive operation

Create a library

hive has a default library:
library name: default
library directory: hdfs: // hdp20-01: 9000 / user / hive / warehouse

New library:
the Create Database databaseName;
the library built in hdfs will generate a library catalog:
hdfs: // hdp20-01: 9000 / the User / Hive / Warehouse / db_order.db

Show all library name:
Show Databases;

Delete a library

Delete library operations

Database dbname drop;
drop Database IF EXISTS dbname; if the inventory deleted

By default, hive can not delete a database that contains a table, there are two solutions:

1, manually delete the library at all tables, and then delete the library

2, using the cascade keyword

drop database if exists dbname cascade;

By default, that is, restrict drop database if exists myhive ==== drop database if exists myhive restrict

Founded table

The basic construction of the table statement

use db_order; - selecting the library
create table t_order (id string, create_time string, amount float, uid string);
the table completed, will generate a directory table belongs library directory
/ user / hive / warehouse / db_order . db / t_order
However, such a construction of the table, then, Hive considers the file table data field delimiter ^ a.

TABLE statement is correct construction:
Create Table t_order (ID String, String the create_time, a float AMOUNT, UID String)
Row DELIMITED the format
Fields terminated by ','; - format data divided in specified parts

This specifies that our table data file field separator is ","

Delete table

drop table t_order;
the effect of deleting the table is:
Hive will clear information on this table from the metadata database;
Hive will be deleted from the List of Tables Table hdfs in this;

Inner table and the outer table

Internal table (MANAGED_TABLE): Directory Table to deploy specification hive, hive warehouse located directory / user / hive / warehouse in

External table (EXTERNAL_TABLE): List of Tables Table designated by the user to build their own
Create External Table t_access (String IP, URL String, String Access_time)
Row DELIMITED the format
Fields terminated by ','
LOCATION '/ Access / log';

External and internal differences in the characteristics table table:

1. List the internal table of VS in the hive repository directory directory external table is specified by the user
2, drop an internal table: hive clears the associated metadata, and delete table data directory
3, drop an external table: hive only clears the related metadata;

A hive data warehouse, the bottom of the table, must be from an external system, in order not to affect the operation of the logic external system, in the hive can be built external table to map the data directory of these external systems produced;
then, the subsequent etl operation , various tables produced suggested managed_table

Partition Table

The essence of the partition table is: create a table in the data file directory partition subdirectories, so that at the time of the query, MR program may be processed for the data partition subdirectories, reduce the scope of data is read.

For example, browsing history website produced each day, browsing history should be built to store a table, but, sometimes, we may only need to analyze the history of a day's
time, you can build this table for the partition table, every day wherein a partition of the data lead;
of course, the daily partition directory, there should be a directory name (partition field)

Examples of a partition field

Examples are as follows:
1. Create a table with partitions

Table t_access Create (String IP, URL String, String Access_time)
Partitioned by (dt String) - where the properties specified partition
Row DELIMITED the format
Fields terminated by ',';
Note: partition field is not already present in the table definition field

2, import data into the partition
Load the inpath local Data '/root/access.log.2017-08-04.log' INTO t_access Partition Table (dt = '20,170,804');
Load the inpath local Data '/root/access.log .2017-08-05.log 'into table t_access partition (dt =' 20170805 ');

3, partitioned data queries for
a, statistics total number of PV August 4:
the SELECT COUNT (*) from t_access where dt = '20,170,804';
essence: that the partition field to use as a table field, you can use the where clause specified partition

b, total statistics of all data PV:
SELECT COUNT (*) from t_access;
Essence: conditions may not specify a partition

CTAS built table syntax

1, can be built by the existing Table:
Create Table t_user_2 like T_USER;
consistent with the new table structure definition t_user_2 T_USER source table, but no data

2, while the build table data insertion
Create Table t_access_user
AS
select IP, URL from t_access;
t_access_user will be built in accordance with the table select query field, while the results of the query to insert a new table

The data file into a table hive

Embodiment 1: A importing data:
manually hdfs command with the file in the directory table;

Embodiment 2: In the hive with the shell hive interactive command to import the data into a local directory table
hive> load data local inpath '/root/order.data.2 ' into table t_order;

Embodiment 3: hdfs import data file in the directory table with the command to the hive
hive> load data inpath '/access.log.2017-08-06.log' into table t_access partition (dt = '20170806');

Note: the difference between the guide and the guide HDFS local file of file:
local file into the table: Copy
hdfs file import table: Mobile

Export data table hive file to the specified path

1, the data table hive file import HDFS
INSERT Overwrite Directory '/ the root / Access-Data'
Row terminated by the format DELIMITED Fields ','
SELECT * from t_access;

2, the hive table data into the local disk file
INSERT Overwrite local Directory '/ the root / Access-Data'
Row terminated by the format DELIMITED Fields ','
SELECT * from limit t_access 100000;

hive file format

Temporarily did not know back then understand it

HIVE支持很多种文件格式: SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE
create table t_pq(movie string,rate int) stored as textfile;
create table t_pq(movie string,rate int) stored as sequencefile;
create table t_pq(movie string,rate int) stored as parquetfile;

Demo:
1, first built a text file stored in a table
the Create the Table t_access_text (ip String, String url, String Access_time)
Row format DELIMITED Fields terminated by ','
the Stored AS textfile;

Import text data into the table:
Load the inpath local Data '/ the root / Access-Data / 000000_0' INTO Table t_access_text;

2, to build a table of file storing sequence file:
Create Table t_access_seq (String IP, URL String, String Access_time)
Stored AS SequenceFile;

Queries from the text data into a table sequencefile table, generate the data file format is the sequencefile:
INSERT INTO t_access_seq
the SELECT * from t_access_text;

3, build a table of file storage parquet file:
Create Table t_access_parq (String IP, URL String, String Access_time)
Stored AS parquetfile;

type of data

Basic data types

Here Insert Picture DescriptionSQL and other languages, these are reserved words. Note that all of these data types are realized in the Java interface, so these types of behavior specific details and the corresponding Java type is exactly the same. For example, string type is implemented in Java String, float implementation is in Java float, and so on.

Composite type

Here Insert Picture Description

array array type

All the data in the array must be of the same type

Array type array similar to the array in Python
arrays: ARRAY <data_type> (Note : negative values and non-constant expressions are allowed as of Hive 0.14.)

Example: array type application
if there is need to use the following data tables to map hive:
Wolf 2. Wu: Wu Gang: Long mother 2017-08-16
Sansei III ten peach, Crystal Liu: itch, 2017-08-20
contemplated : If information is starred in an array to map more convenient

Built table:
Create Table t_movie (moive_name String, Actors array <String>, first_show DATE)
Row terminated by the format DELIMITED Fields ','
Collection items terminated by ':'; - collection element according to any cutting, designated in array data separation rule

Importing data:
Load the inpath local Data '/root/movie.dat' INTO Table t_movie;

Query:
SELECT * from t_movie;
SELECT moive_name, Actors [0] from t_movie; - a first check array
select moive_name, actors from t_movie where array_contains (actors, ' Gang'); - the investigation has Gang array
SELECT moive_name, size (actors) from t_movie; - check the length of array
select moive_name, size (actors) sort_array (array) - natural order sort the array and returns

map Type

Similar in Python map, the key to such
maps: MAP <primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)

  1. If the following data:
    . 1, zhangsan, Father: Xiaoming # Mother: xiaohuang # Brother: Xiaoxu, 28
    2, Lisi, Father: mayun # Mother: huangyi # Brother: Guanyu, 22 is
    . 3, wangwu, Father: wangjianlin # Mother: Inner Feeling #sister: Jingtian, 29
    4, mayun, Father: Mother mayongzhen #: Angelababy, 26

Can be described in a family member with said data type of a map

  1. Construction of the table statement:
    the Create the Table t_person (the above mentioned id int, String name, family_members the Map <String, String>, Age int)
    Row format DELIMITED Fields terminated by ','
    Collection items terminated by '#' - what elements in the collection by cutting ,, ie map the elements by cutting #
    map keys terminated by ':'; --key, value in what cut

  2. Query
    select * from t_person;

Take map fields specified key value
select id, name, family_members [ ' father'] as from t_person father;

Take all fields map Key
SELECT ID, name, map_keys (family_members) Relation from t_person AS;

取map字段的所有value
select id,name,map_values(family_members) from t_person;
select id,name,map_values(family_members)[0] from t_person;

Comprehensive: user information query has a brother of
the SELECT the above mentioned id, name, Father
from
(the SELECT the above mentioned id, name, family_members [ 'brother'] AS Father from t_person) tmp
the WHERE Father IS not null;

struct type

Field set, may be different from the type
1) If the following data:
1, zhangsan, 18 is: MALE: Beijing
2, Lisi, 28: FEMALE: Shanghai

Wherein the user information includes: Age: integer, gender: string, Address: string
envisions a field to describe the entire user information, may be employed struct

2) built table:
Create Table t_person_struct (ID int, String name, info struct Age: int, Sex: String, addr: String )
Row terminated by the format DELIMITED Fields ','
Collection items terminated by ':'; - the collection in such data partition, i.e. struct

3) 查询
select * from t_person_struct;
select id,name,info.age from t_person_struct;

hive query syntax

Tip: When a small amount of data query test allows hive mrjob will be submitted to the local operation is running, you can set the following parameters in the hive session:
hive> = the SET hive.exec.mode.local.auto to true;

The basic query example

select * from t_access;
select count(*) from t_access;
select max(ip) from t_access;

Conditions inquiry

* WHERE Access_time from t_access SELECT < '2017-08-06 15:30:20'
SELECT * WHERE Access_time from t_access < '2017-08-06 16:30:20' and IP> '192.168.33.3';
note: where with polymerizable function clause is not

6.3.

join the associated sample queries

If there a.txt file
A,. 1
B, 2
C,. 3
D,. 4

If there b.txt file
A, XX
B, YY
D, ZZ
E, PP

进行各种join查询:
1、inner join(join)
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name

结果:
±-------±-------±-------±-------±-+
| aname | anumb | bname | bnick |
±-------±-------±-------±-------±-+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
±-------±-------±-------±-------±-+

2、left outer join(left join)
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name

result:

Here Insert Picture Description

3、right outer join(right join)
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name

result:Here Insert Picture Description

4、full outer join(full join)
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
full join t_b b
on a.name=b.name;

result:Here Insert Picture Description

Packet aggregation group by

select dt,count(*),max(ip) as cnt from t_access group by dt;

select dt,count(*),max(ip) as cnt from t_access group by dt having dt>‘20170804’;

select
dt,count(*),max(ip) as cnt
from t_access
where url=‘http://www.edu360.cn/job’
group by dt having dt>‘20170804’;

Once the field group by clause, then the select clause can not have (packet field, aggregate functions) other than: Note

Why where you must be written in front of the group by the why behind the group by having only with conditions

Because, where the data is used for filtering before the actual execution of the query logic
having a group by the results of the polymerization after re-filtration;

Execution logic statement above:
. 1, WHERE the filter condition is not satisfied data
2, and the polymerization function group by data operations polymerization, polymerization results obtained
3, filter out the polymerization condition is not satisfied with the result of having the condition data

Subqueries

select id,name,father
from
(select id,name,family_members[‘brother’] as father from t_person) tmp
where father is not null;

Common built-in functions

Here incomplete, there is no, you can look here

Type Conversion Functions

select cast ( "5" as int ) from dual; - turn string int
SELECT Cast ( "2017-08-03" AS DATE); - turn string DATE
SELECT Cast (CURRENT_TIMESTAMP AS DATE); - a timestamp transfer date

Example:
11995-05-05 13:30:59 1200.3
21994-04-05 13:30:59 2200
31996-06-01 12:20:30 80000.5

create table t_fun(id string,birthday string,salary string)
row format delimited fields terminated by ‘,’;
select id,cast(birthday as date) as bir,cast(salary as float) from t_fun;

Math Functions

select round (5.4) from dual; ## 5 rounding
select round (5.1345,3) from dual; ## 5.135 retained decimal places
select ceil (5.4) from dual; // select ceiling (5.4) from dual; ## 6
select floor (5.4) from dual; ## 5 ## rounding?
select abs (-5.4) from dual; ## 5.4 ## calculates an absolute value of a
select greatest (3,5,6) from dual; ## 6 ## # input parameters at least two selecting the maximum value
select least (3, 5,6) from dual; # # enter at least two parameters for the minimum, the minimum is calculated several variables input
rand () ## returns a DOUBLE each row random number seed is a random type factors
example:
table with as follows:
Here Insert Picture Description
SELECT Greatest (Cast (AS Double S1), Cast (AS Double S2), Cast (S3 AS Double)) from t_fun2;
results:
± -------- ± - +
| _c0 |
± --- ± ----- - +
| 2000.0 |
| 9800.0 |
± -------- ± - +
Note: to select max (age) from t_person; aggregation function
select min (age) from t_person; aggregation function
different from the function which is polymerized

String Functions

substr (string, int start) ## for the string A, taken from the start position start string and returns
substring (string, int start)
Example: select substr ( "abcdefg", 2) from dual;

substr (string, int start, int len) ## to a binary / strings A, taken from the start position of the start of length length string and returns the
substring (string, int start, int len)
Example: select substr ( "abcdefg" , 2,3) from dual;

concat (string A, string B ... ) ## string concatenation
concat_ws (string SEP, string A, string B ...) ## and the middle of the splice further add a string delimiter SEP
example: select concat ( "ab", "xy ") from Dual;
. CONCAT_WS SELECT (" "," 192 "," 168 "," 33 is "," 44 is ") from Dual;

length(string A)
示例:select length(“192.168.33.44”) from dual;

split (string str, string pat)
Example: select split ( "192.168.33.44", ".") from dual; wrong because the number is a specific character regular syntax.
the SELECT Split ( "192.168.33.44", "\ . ") from dual;

upper (string str) ## rpm capital

length (string A) ## returns the length of the string

Aggregate function

count (*) ## counts the number of rows, including rows contain the NULL value,

The number of rows count (expr) ## Statistics provided non-NULL value of expression expr

count (DISTINCT expr [, expr ... ]) ## provides statistical value of the expression expr after the de-emphasis of non-NULL and
======================== =========================

sum (col) ## sum (col ), and represents a request of the designated column
sum (DISTINCT col) ## represents the column to find weight and
================= =================================

avg (col) ## represents the averaging of the designated column,
AVG (the DISTINCT COL) ## represents the average value of the column to find weight
=================== =============================

min (col) ## minimum requirements specified column

max (col) ## selecting the maximum value of the specified column

==============================================

Line transfer column function:

explode ()
if the following data:
1, zhangsan, Chemistry: Physics: Math: Language
2, lisi, chemical: Mathematics: Biology: Physiology: Health
3, wangwu, chemical: Languages: English: Sports: biological
mapped to a table:
Table t_stu_subject Create (ID int, String name, Subjects Array)
Row terminated by the format DELIMITED Fields ','
Collection items terminated by ':';

Use explode () array field "burst"

We then use the results of this explode, and to seek to re courses:
the SELECT DISTINCT tmp.sub
from
(the SELECT explode (Subjects) AS Sub from t_stu_subject) tmp;
Here Insert Picture DescriptionWe then use the results of this explode, and to seek to re courses :
SELECT DISTINCT tmp.sub
from
(SELECT the explode (Subjects) from AS Sub t_stu_subject) tmp;

Table generation function lateral view (do not remember, to re-look)

select id,name,tmp.sub
from t_stu_subject lateral view explode(subjects) tmp as sub;

Understand: lateral view equivalent to two tables in a join
table on the left: the original table is
the right table: the table is generated after explode (a collection of fields)
but: only the join between the data in the same row

So, you can easily do more query:
For example, the query biology class students enrolled in
the SELECT a.id, a.name, a.sub from
(the SELECT the above mentioned id, name, tmp.sub AS Sub lateral t_stu_subject View from the explode (Subjects ) Sub tmp AS) A
WHERE Sub = 'biological';

json_tuple function
obtained from a plurality of JSON string as a key and returns the tuple, and get_json_object except that this function can obtain a plurality of keys
Example:
SELECT json_tuple (JSON, 'Movie', 'Rate', 'timeStamp ',' uid ') as ( movie, rate, ts, uid) from t_rating_json;
produce results:

Here Insert Picture Description

Conditions control function

case when

语法:
CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2

WHEN conditionn THEN resultn
ELSE result
END

示例:
select id,name,
case
when age<28 then ‘youngth’
when age>27 and age<40 then ‘zhongnian’
else ‘old’
end
from t_user;

IF

select id,if(age>25,‘working’,‘worked’) from t_user;

select moive_name, if (array_contains (actors , ' Gang'), 'good cinema', 'ROM t_movie;
mg.cn/20191013211006107.png)

Analysis Function: row_number () over () - Packet TOPN

Data are as follows:
1,18, A, MALE
2,19, B, MALE
3,22, C, FEMALE
4,16, D, FEMALE
5,30, E, MALE
6, 26, F, FEMALE
need, each of the genders oldest two data

Implemented:
Use row_number functions, data tables grouped by gender, by age and labeled reverse order

hql Code:
SELECT ID, Age, name, Sex,
ROW_NUMBER () over (Order by Age Sex Partition by desc) AS Rank
from t_rownumber
Here Insert Picture Description
2 shall be final demand Then, the above results, the query Rank <=
SELECT ID, Age, name, Sex
from
(SELECT ID, Age, name, Sex,
ROW_NUMBER () over (Order by Age Sex Partition by desc) AS Rank
from t_rownumber) tmp
WHERE Rank <= 2;

sum over function

Requirements: The following data is time to do more than one line to order, do cumulative report
Here Insert Picture Description
the SELECT uid, month The, AMOUNT,
SUM (AMOUNT) over (ULD Partition by the Order rows by month The Current Row and the BETWEEN unbounded PRECEDING) AS accumulate
from t_access_amount - The uid packet, month to be sorted according to the order, all the previous rows back together v

hive achieve wordCount

SELECT word,count(1)
FROM(SELECT explode(split(sentence," "))AS word
from t_wc
) tmp
GROUP BY word;

Published 44 original articles · won praise 0 · Views 868

Guess you like

Origin blog.csdn.net/heartless_killer/article/details/102533965