Foreword: Lao Liu dare not say how good his writing is, but he dare to make sure to explain the content of his review in great vernacular as much as possible, and refuse to use mechanical methods in the materials to have his own understanding!
1. Hive knowledge points (3)
Starting from this article, I decided to make some changes. Lao Liu mainly shared the key knowledge points of each module of big data on his blog, and explained these key contents in detail. The complete knowledge points of each module are shared on the public account: Hardworking Lao Liu. When there is a chance, use the video method to analyze and summarize the knowledge points shared each time, and then post an article for detailed explanation.
Now let’s start the main text, it’s the same sentence. Although these are commonly used functions of hive, many people don’t care, but we will encounter many business needs to use these functions in daily development. We must at least be familiar with some commonly used functions. .
The explode, row-to-column, and column-to-row in this article are the key points, and you need to master their examples. Due to hive's preference for practical operations, Lao Liu only explained these key knowledge points in detail.
2. Lateral view and explode in hive
Why is explode used?
In the actual development process, many complex array or map structures will be encountered. We will split these complex structures from one column into multiple rows according to some business requirements. At this time, we need to use explode. Maybe you still don't understand it very well. Let's just give an example to illustrate this exploit. Be sure to practice with Lao Liu. If you can't practice, it's like learning nothing!
需求:现在有数据格式如下
zhangsan child1,child2,child3,child4 k1:v1,k2:v2
lisi child5,child6,child7,child8 k3:v3,k4:v4
字段之间使用\t分割,需求将所有的child进行拆开成为一列
+----------+--+
| mychild |
+----------+--+
| child1 |
| child2 |
| child3 |
| child4 |
| child5 |
| child6 |
| child7 |
| child8 |
+----------+--+
将map的key和value也进行拆开,成为如下结果
+-----------+-------------+--+
| mymapkey | mymapvalue |
+-----------+-------------+--+
| k1 | v1 |
| k2 | v2 |
| k3 | v3 |
| k4 | v4 |
+-----------+-------------+--+
The first step: we first create a database, and use the database just created
create database hive_explode;
use hive_explode;
Step 2: After creating the database, we must start to create the hive table
createtable hive_explode.t3(name string,children array<string>,address Map<string,string>)
row format delimited fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':' stored as textFile;
Attention, everyone must look carefully. Due to our needs, we need to create the name, which is in string form, but the child is in array form, and the address is in map form. Although Lao Liu didn’t talk about it, this one is It's really important. According to the separators in these compound functions, our split code is written like this:
The usual space form is
row format delimited fields terminated by '\t'
The split form in the array is
collection items terminated by ','
The split form in the map is
map keys terminated by ':'
Please remember the difference here!
Step 3: Load data
cd /kkb/install/hivedatas/
vim maparray
数据内容格式如下
zhangsan child1,child2,child3,child4 k1:v1,k2:v2
lisi child5,child6,child7,child8 k3:v3,k4:v4
Then use hive to load the data
load data local inpath '/kkb/install/hivedatas/maparray' into table hive_explode.t3;
After we import the data, we can look at the situation in the table.
Step 4: Before importing the data into the table, the next step is to burst the data
Split all children into one column
SELECT explode(children) AS myChild FROM hive_explode.t3;
Then split the key and value of the map
SELECT explode(address) AS (myMapKey, myMapValue) FROM hive_explode.t3;
Since the lateral view is often used in row-to-column and column-to-row, we will not talk about the lateral view alone.
3. Row to column
The first thing to say about row-to-column and column-to-row is that they are very, very important, and there will be many needs for column conversion.
However, row-to-column and column-to-row do not convert one row to one column and one column to one row. Many materials have their own opinions on row-to-column and column-to-row, and they are often the opposite.
Lao Liu came from Shang Silicon Valley. Let's ignore this concept and just figure out their usage.
Row to column: It means to change the data in multiple columns into one column.
Use an example to demonstrate the row to column.
Then it is to classify people with the same constellation and blood type together, and the results are as follows:
射手座,A 老王|冰冰
白羊座,A 孙悟空|猪八戒
白羊座,B 宋宋
This also involves the concat function, let's talk about the connection function first:
concat(): returns the result of the concatenated input string, supports any number of input strings;
concat_ws(): This is to add a separator between the connected strings;
collect_set(): De-duplicate a field and generate an array type field.
Next, all we have to do is to create a table to import data.
1. Create a file, pay attention to the data using \t to split
cd /kkb/install/hivedatas
vim constellation.txt
孙悟空 白羊座 A
老王 射手座 A
宋宋 白羊座 B
猪八戒 白羊座 A
凤姐 射手座 A
2. Create hive table and load data
create table person_info(name string,constellation string,blood_type string)
row format delimited fields terminated by "\t";
3. Load data
load data local inpath '/kkb/install/hivedatas/constellation.txt' into table person_info;
You can query the situation after importing the table, select * from person_info.
4. Query data
Note that, according to our needs, the query result requires concat_ws operation.
select t1.base, concat_ws('|', collect_set(t1.name)) name from (select name, concat(constellation, "," , blood_type) base from person_info) t1 group by t1.base;
Lao Liu explained that since the constellation and blood type are connected by a comma, we should write the code like this
concat(constellation, "," , blood_type)
The next step is to find out all the people based on the condition that the constellation is the same as the blood type.
select name, concat(constellation, "," , blood_type) base from person_info
Name this temporary table t1, and then convert multiple rows to one row according to the needs of the people who are queried. Since the names are connected by |, we should write the code like this.
concat_ws('|', collect_set(t1.name))
The final query result is like this
select t1.base, concat_ws('|', collect_set(t1.name)) name from t1 group by t1.base;
4. Column to Row
In column to row, two very important functions are involved: explode and lateral view.
explode: Split the complex array or map structure in a hive column into multiple rows.
Lateral view: Generally used to split a row of data into multiple rows of data, on this basis, the split data can be aggregated.
for example:
The data content is as follows, and the fields are divided by \t
cd /kkb/install/hivedatas
vim movie.txt
《疑犯追踪》 悬疑,动作,科幻,剧情
《Lie to me》 悬疑,警匪,动作,心理,剧情
《战狼2》 战争,动作,灾难
Expand the array data in the movie classification, and the results are as follows:
《疑犯追踪》 悬疑
《疑犯追踪》 动作
《疑犯追踪》 科幻
《疑犯追踪》 剧情
《Lie to me》 悬疑
《Lie to me》 警匪
《Lie to me》 动作
《Lie to me》 心理
《Lie to me》 剧情
《战狼2》 战争
《战狼2》 动作
《战狼2》 灾难
This is a typical one-line conversion to multiple lines, using a combination of lateral view and explode.
The first step we need to do is to create a table based on the characteristics of the table and the characteristics of the data in the table. The category should be created as an array type.
create table movie_info(movie string, category array<string>)
row format delimited fields terminated by "\t"
collection items terminated by ",";
Next is to load the data
load data local inpath "/kkb/install/hivedatas/movie.txt" into table movie_info;
Finally, query the table according to the needs. Since the category is of the array type, now you need to explode the category with the lateral view, and then you can query the data.
select movie, category_name from movie_info
lateral view explode(category) table_tmp as category_name;
Among them, table_tmp is the table name, and category_name is the column name.
5. Summary
Lao Liu mainly talked about row-to-column and column-to-row, as well as the two functions explode and lateral view, and demonstrated them with two cases respectively. Everyone must practice with the cases. Just don’t practice. Bai Xue.
Finally, the complete hive knowledge point (3) is in the public account: Lao Liu who works hard. If you feel that there is something bad or wrong, you can contact Lao Liu for communication. I hope to be helpful to students who are interested in big data development, and hope to get their guidance.
If you think the writing is good, give Lao Liu a thumbs up!