The big vernacular explains the big data hive knowledge points in detail, Lao Liu is really attentive (2)

Insert picture description here

Lao Liu didn't dare to say how good his writing was, but he promised to explain in detail the content of his review in plain English as much as possible, and refused to use mechanical methods in the materials to have his own understanding!

1. Hive knowledge points (2)

Insert picture description here
Point 12: hive bucket table

Hive knowledge points are mainly practice. Many people think that basic commands don’t need to be memorized. However, if the basic commands are built on the ground, no matter how basic they are, they must practice and practice more.

In hive, bucketing is a finer-grained division relative to partitions. Among them, the partition is for the storage path of the data, and the bucket is for the data file. Lao Liu compares the two related pictures to understand the difference just mentioned.

The first is the changes after the table is partitioned: The
Insert picture description here
second is the changes after the table is partitioned :
Insert picture description here
According to these two pictures, you can roughly understand the difference between partitioning and bucketing.

Now that I have looked at these two pictures, it should be roughly clear what the binning is!

What is bucketing?

Bucketing is to distinguish the entire data content according to the hash value of a certain column attribute value, and data with the same hash value enters the same file.

To illustrate: For example, according to the name attribute divided into 3 buckets, that is, the hash value of the name attribute value is modulo 3, and the data is divided into buckets according to the result of the modulus.

The data record whose modulo result is 0 is stored in a file;

The data record whose modulo result is 1 is stored in a file;

The data record with the modulus result of 2 is stored in a file;
the data record with the modulus result of 3 is stored in a file;
as there are too many cases of the bucket table, you can search for a practice.

Point 13: hive modify the table structure

In fact, there is nothing to say about this point. It is mentioned in the information, and Lao Liu also talked about it. Just remember a few commands.

修改表的名称
alter table stu3 rename to stu4;

表的结构信息
desc formatted stu4;

Point 14: hive data import

This part is very important, because after the table is created, the thing to do is to import the data into the table. If you can't even use the basic commands for importing data, it is definitely unqualified. This is a very important basis!

1. Load data through load (must be written down)

通过load方式加载数据
load data local inpath '/kkb/install/hivedatas/score.csv' overwrite into table score3 partition(month='201806');

2. Load data by query (must be written down)

通过查询方式加载数据
create table score5 like score;
insert overwrite table score5 partition(month = '201806') select s_id,c_id,s_score from score;

Point 15: hive data export

1. Insert export

将查询的结果导出到本地
insert overwrite local directory '/kkb/install/hivedatas/stu' select * from stu;

将查询的结果格式化导出到本地
insert overwrite local directory '/kkb/install/hivedatas/stu2' row format delimited fields terminated by  ',' select * from stu;

将查询的结果导出到HDFS上(没有local)
insert overwrite  directory '/kkb/hivedatas/stu'  row format delimited fields terminated by  ','  select * from stu;

Point 16: Static Partition and Dynamic Partition

Hive has two kinds of partitions, one is static partition, that is, ordinary partition. The other is dynamic partitioning.

Static partition: When loading a partition table, load data to a partition table by query, and you must specify the partition field value.

Here is a small example to demonstrate the difference between the two.

1、创建分区表
use myhive;
create table order_partition(
order_number string,
order_price  double,
order_time string
)
partitioned BY(month string)
row format delimited fields terminated by '\t';

2、准备数据
cd /kkb/install/hivedatas
vim order.txt 
10001    100 2019-03-02
10002    200 2019-03-02
10003    300 2019-03-02
10004    400 2019-03-03
10005    500 2019-03-03
10006    600 2019-03-03
10007    700 2019-03-04
10008    800 2019-03-04
10009    900 2019-03-04

3、加载数据到分区表
load data local inpath '/kkb/install/hivedatas/order.txt' overwrite into table order_partition partition(month='2019-03');

4、查询结果数据    
select * from order_partition where month='2019-03';
结果为:
10001   100.0   2019-03-02      2019-03
10002   200.0   2019-03-02      2019-03
10003   300.0   2019-03-02      2019-03
10004   400.0   2019-03-03      2019-03
10005   500.0   2019-03-03      2019-03
10006   600.0   2019-03-03      2019-03
10007   700.0   2019-03-04      2019-03
10008   800.0   2019-03-04      2019-03
10009   900.0   2019-03-04      2019-03

Dynamic partition: automatically import data into different partitions of the table according to requirements, without manual specification.

If you need to insert data from multiple partitions at one time, you can use dynamic partitioning, instead of specifying the partition field, the system automatically queries.

The number of dynamic partitions is limited. It must be created from an existing table.

首先必须说的是,动态分区表一定是在已经创建的表里来创建
1、创建普通标
create table t_order(
    order_number string,
    order_price  double, 
    order_time   string
)row format delimited fields terminated by '\t';

2、创建目标分区表
create table order_dynamic_partition(
    order_number string,
    order_price  double    
)partitioned BY(order_time string)
row format delimited fields terminated by '\t';

3、准备数据
cd /kkb/install/hivedatas
vim order_partition.txt
10001    100 2019-03-02 
10002    200 2019-03-02
10003    300 2019-03-02
10004    400 2019-03-03
10005    500 2019-03-03
10006    600 2019-03-03
10007    700 2019-03-04
10008    800 2019-03-04
10009    900 2019-03-04

4、动态加载数据到分区表中
要想进行动态分区,需要设置参数
开启动态分区功能
set hive.exec.dynamic.partition=true; 
设置hive为非严格模式
set hive.exec.dynamic.partition.mode=nonstrict; 
insert into table order_dynamic_partition partition(order_time) select order_number,order_price,order_time from t_order;

5、查看分区
show partitions order_dynamic_partition;

Insert picture description here
The examples of static partitioning and dynamic partitioning are almost the same, so let’s understand.

Point 17: Basic query syntax of hive

Lao Liu said before that the basic query syntax of hive is very important. Many people think that you don't need to memorize it at all. Just look at the notes when you need it. But in Lao Liu's view, this is a very wrong idea.

There is a saying that the foundation is not strong and the ground is shaking. At the very least, we must master the commonly used query syntax.

1. Basic syntax query:

Insert picture description here
Because the limit statement and where statement are used a lot, take them out separately, so please remember!

limit 语句
select  * from score limit 5;

Next comes the where statement, which is taken out separately to show that the where statement is very important. We use the where statement to filter out rows that do not meet the conditions.

select  * from score where s_score > 60;

2. Group sentences

group by statement

The group by statement is usually used with aggregate functions to group the results of one or more columns, and then perform aggregation operations on each group. One important point to note is that the select field must be selected after the group by field, except for the aggregate functions max, min, and avg.

Give two small examples:

(1)计算每个学生的平均分数
select s_id,avg(s_score) from score group by s_id;

(2)计算每个学生最高的分数
select s_id,max(s_score) from score group by s_id;

having statement

Let me talk about the difference between having statement and where

① Where is to query the data for the columns in the table; having is for the columns in the query results to select the data.

② Grouping functions cannot be written after where, and grouping functions can be used after having.

③ Having is only used for group by grouping statistics statement.

Give two small examples:

求每个学生的平均分数
select s_id,avg(s_score) from score group by s_id;

求每个学生平均分数大于60的人
select s_id,avg(s_score) as avgScore from score group by s_id having avgScore > 60;

3. The join statement
Insert picture description here
Equivalent join

Normal SQL JOIN statements are supported in hive, but only equivalent joins are supported, and non-equivalent joins are not supported.

When using join, you can alias the table or not. The advantage of aliasing is that it can simplify queries and is convenient.

根据学生和成绩表,查询学生姓名对应的成绩
select * from stu left join score on stu.id = score.s_id;

合并老师与课程表
select * from teacher t join course c on t.t_id = c.t_id;

Inner join

When two tables are internally joined, only when data that matches the join condition exists in both tables, the data will be retained, and the join default is inner join.

select * from teacher t inner join course c  on t.t_id = c.t_id;

Left outer join

When performing a left outer join, all records in the table on the left of the join that meet the where clause will be returned.

查询老师对应的课程
select * from teacher t left outer join course c  on t.t_id = c.t_id;

Right outer join

When performing a right outer join, all records in the table on the right of the join that meet the where clause will be returned.

查询老师对应的课程
select * from teacher t right outer join course c  on t.t_id = c.t_id;

4. Sort

order by global ordering

When using order by to sort, asc means ascending order, which is the default; desc means descending order.

查询学生的成绩,并按照分数降序排列
select * from score  s order by s_score desc ;

2. hive summary

Hive knowledge point (2) is almost shared. This part is biased towards practice and requires good practice.
In Lao Liu's view, the concepts of bucket table and static partitioning and dynamic partitioning need to be remembered. The rest is the basic query operation of hive. Because there are too many commands, Lao Liu only shared some commonly used commands, limit Statements, where statements, grouping statements, join statements, etc. should be memorized.
Finally, if you feel that there is something wrong or wrong, you can contact the official account: Lao Liu who works hard to communicate. I hope to be helpful to students who are interested in big data development, and hope to get their guidance.

If you think the writing is good, give Lao Liu a thumbs up!

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/111041263