Lao Liu didn't dare to say how good his writing was, but he promised to explain in detail the content of his review in plain English as much as possible, and refused to use mechanical methods in the materials to have his own understanding!
1. Hive knowledge points (2)
Point 12: hive bucket table
Hive knowledge points are mainly practice. Many people think that basic commands don’t need to be memorized. However, if the basic commands are built on the ground, no matter how basic they are, they must practice and practice more.
In hive, bucketing is a finer-grained division relative to partitions. Among them, the partition is for the storage path of the data, and the bucket is for the data file. Lao Liu compares the two related pictures to understand the difference just mentioned.
The first is the changes after the table is partitioned: The
second is the changes after the table is partitioned :
According to these two pictures, you can roughly understand the difference between partitioning and bucketing.
Now that I have looked at these two pictures, it should be roughly clear what the binning is!
What is bucketing?
Bucketing is to distinguish the entire data content according to the hash value of a certain column attribute value, and data with the same hash value enters the same file.
To illustrate: For example, according to the name attribute divided into 3 buckets, that is, the hash value of the name attribute value is modulo 3, and the data is divided into buckets according to the result of the modulus.
The data record whose modulo result is 0 is stored in a file;
The data record whose modulo result is 1 is stored in a file;
The data record with the modulus result of 2 is stored in a file;
the data record with the modulus result of 3 is stored in a file;
as there are too many cases of the bucket table, you can search for a practice.
Point 13: hive modify the table structure
In fact, there is nothing to say about this point. It is mentioned in the information, and Lao Liu also talked about it. Just remember a few commands.
修改表的名称
alter table stu3 rename to stu4;
表的结构信息
desc formatted stu4;
Point 14: hive data import
This part is very important, because after the table is created, the thing to do is to import the data into the table. If you can't even use the basic commands for importing data, it is definitely unqualified. This is a very important basis!
1. Load data through load (must be written down)
通过load方式加载数据
load data local inpath '/kkb/install/hivedatas/score.csv' overwrite into table score3 partition(month='201806');
2. Load data by query (must be written down)
通过查询方式加载数据
create table score5 like score;
insert overwrite table score5 partition(month = '201806') select s_id,c_id,s_score from score;
Point 15: hive data export
1. Insert export
将查询的结果导出到本地
insert overwrite local directory '/kkb/install/hivedatas/stu' select * from stu;
将查询的结果格式化导出到本地
insert overwrite local directory '/kkb/install/hivedatas/stu2' row format delimited fields terminated by ',' select * from stu;
将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/kkb/hivedatas/stu' row format delimited fields terminated by ',' select * from stu;
Point 16: Static Partition and Dynamic Partition
Hive has two kinds of partitions, one is static partition, that is, ordinary partition. The other is dynamic partitioning.
Static partition: When loading a partition table, load data to a partition table by query, and you must specify the partition field value.
Here is a small example to demonstrate the difference between the two.
1、创建分区表
use myhive;
create table order_partition(
order_number string,
order_price double,
order_time string
)
partitioned BY(month string)
row format delimited fields terminated by '\t';
2、准备数据
cd /kkb/install/hivedatas
vim order.txt
10001 100 2019-03-02
10002 200 2019-03-02
10003 300 2019-03-02
10004 400 2019-03-03
10005 500 2019-03-03
10006 600 2019-03-03
10007 700 2019-03-04
10008 800 2019-03-04
10009 900 2019-03-04
3、加载数据到分区表
load data local inpath '/kkb/install/hivedatas/order.txt' overwrite into table order_partition partition(month='2019-03');
4、查询结果数据
select * from order_partition where month='2019-03';
结果为:
10001 100.0 2019-03-02 2019-03
10002 200.0 2019-03-02 2019-03
10003 300.0 2019-03-02 2019-03
10004 400.0 2019-03-03 2019-03
10005 500.0 2019-03-03 2019-03
10006 600.0 2019-03-03 2019-03
10007 700.0 2019-03-04 2019-03
10008 800.0 2019-03-04 2019-03
10009 900.0 2019-03-04 2019-03
Dynamic partition: automatically import data into different partitions of the table according to requirements, without manual specification.
If you need to insert data from multiple partitions at one time, you can use dynamic partitioning, instead of specifying the partition field, the system automatically queries.
The number of dynamic partitions is limited. It must be created from an existing table.
首先必须说的是,动态分区表一定是在已经创建的表里来创建
1、创建普通标
create table t_order(
order_number string,
order_price double,
order_time string
)row format delimited fields terminated by '\t';
2、创建目标分区表
create table order_dynamic_partition(
order_number string,
order_price double
)partitioned BY(order_time string)
row format delimited fields terminated by '\t';
3、准备数据
cd /kkb/install/hivedatas
vim order_partition.txt
10001 100 2019-03-02
10002 200 2019-03-02
10003 300 2019-03-02
10004 400 2019-03-03
10005 500 2019-03-03
10006 600 2019-03-03
10007 700 2019-03-04
10008 800 2019-03-04
10009 900 2019-03-04
4、动态加载数据到分区表中
要想进行动态分区,需要设置参数
开启动态分区功能
set hive.exec.dynamic.partition=true;
设置hive为非严格模式
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table order_dynamic_partition partition(order_time) select order_number,order_price,order_time from t_order;
5、查看分区
show partitions order_dynamic_partition;
The examples of static partitioning and dynamic partitioning are almost the same, so let’s understand.
Point 17: Basic query syntax of hive
Lao Liu said before that the basic query syntax of hive is very important. Many people think that you don't need to memorize it at all. Just look at the notes when you need it. But in Lao Liu's view, this is a very wrong idea.
There is a saying that the foundation is not strong and the ground is shaking. At the very least, we must master the commonly used query syntax.
1. Basic syntax query:
Because the limit statement and where statement are used a lot, take them out separately, so please remember!
limit 语句
select * from score limit 5;
Next comes the where statement, which is taken out separately to show that the where statement is very important. We use the where statement to filter out rows that do not meet the conditions.
select * from score where s_score > 60;
2. Group sentences
group by statement
The group by statement is usually used with aggregate functions to group the results of one or more columns, and then perform aggregation operations on each group. One important point to note is that the select field must be selected after the group by field, except for the aggregate functions max, min, and avg.
Give two small examples:
(1)计算每个学生的平均分数
select s_id,avg(s_score) from score group by s_id;
(2)计算每个学生最高的分数
select s_id,max(s_score) from score group by s_id;
having statement
Let me talk about the difference between having statement and where
① Where is to query the data for the columns in the table; having is for the columns in the query results to select the data.
② Grouping functions cannot be written after where, and grouping functions can be used after having.
③ Having is only used for group by grouping statistics statement.
Give two small examples:
求每个学生的平均分数
select s_id,avg(s_score) from score group by s_id;
求每个学生平均分数大于60的人
select s_id,avg(s_score) as avgScore from score group by s_id having avgScore > 60;
3. The join statement
Equivalent join
Normal SQL JOIN statements are supported in hive, but only equivalent joins are supported, and non-equivalent joins are not supported.
When using join, you can alias the table or not. The advantage of aliasing is that it can simplify queries and is convenient.
根据学生和成绩表,查询学生姓名对应的成绩
select * from stu left join score on stu.id = score.s_id;
合并老师与课程表
select * from teacher t join course c on t.t_id = c.t_id;
Inner join
When two tables are internally joined, only when data that matches the join condition exists in both tables, the data will be retained, and the join default is inner join.
select * from teacher t inner join course c on t.t_id = c.t_id;
Left outer join
When performing a left outer join, all records in the table on the left of the join that meet the where clause will be returned.
查询老师对应的课程
select * from teacher t left outer join course c on t.t_id = c.t_id;
Right outer join
When performing a right outer join, all records in the table on the right of the join that meet the where clause will be returned.
查询老师对应的课程
select * from teacher t right outer join course c on t.t_id = c.t_id;
4. Sort
order by global ordering
When using order by to sort, asc means ascending order, which is the default; desc means descending order.
查询学生的成绩,并按照分数降序排列
select * from score s order by s_score desc ;
2. hive summary
Hive knowledge point (2) is almost shared. This part is biased towards practice and requires good practice.
In Lao Liu's view, the concepts of bucket table and static partitioning and dynamic partitioning need to be remembered. The rest is the basic query operation of hive. Because there are too many commands, Lao Liu only shared some commonly used commands, limit Statements, where statements, grouping statements, join statements, etc. should be memorized.
Finally, if you feel that there is something wrong or wrong, you can contact the official account: Lao Liu who works hard to communicate. I hope to be helpful to students who are interested in big data development, and hope to get their guidance.
If you think the writing is good, give Lao Liu a thumbs up!