Seventy-eight, Hive data warehouse actual operation (operation test)

Operation of Hive data warehouse:

  • Database creation and deletion
  • table creation, modification, deletion
  • Import and export of data in tables
  • Create, modify, and delete table partitions and buckets

content

Hive environment construction

Operation of Hive Data Warehouse

Operations on Hive Data Tables

Import and export of data in Hive


Hive environment construction

Centos installs hive3.1.2 (fine lecture) https://blog.csdn.net/m0_54925305/article/details/120554242?spm=1001.2014.3001.5502

Operation of Hive Data Warehouse

1. Create a database

hive> show databases;
OK
default
Time taken: 0.067 seconds, Fetched: 1 row(s)
hive> create database if not exists DB;
OK
Time taken: 0.064 seconds
hive> show databases;
OK
db
default
Time taken: 0.018 seconds, Fetched: 2 row(s)

2. View the information and path of the data warehouse DB

hive> describe database DB;
OK
db		hdfs://master:9000/user/hive/warehouse/db.db	root	USER	
Time taken: 0.065 seconds, Fetched: 1 row(s)

Operations on Hive Data Tables

Hive data tables are divided into two types: internal tables and external tables.


When Hive creates an internal table, it will move the data to the path pointed to by the data warehouse; if an external table is created, only the path where the data is located will be recorded, and no changes will be made to the location of the data. When deleting a table, the metadata and data of the inner table will be deleted together, while the outer table only deletes the metadata, not the data. In this way, external tables are relatively safer, data organization is more flexible, and it is convenient to share source data. External tables are often used in production.


The following describes the commands and usage of table operations in detail:
        the table name to be created cannot be the same as the existing table name, otherwise an error will be reported. Now let's show tables to view the existing table.

1. Create an internal table named cat with two fields cat_id and cat_name, and the character type is string

hive> create table cat(cat_id string,cat_name string);
OK
Time taken: 0.72 seconds
hive> show tables;
OK
cat
Time taken: 0.057 seconds, Fetched: 1 row(s)

2. Create an external table, the table name is cat2, there are two fields cat_id and cat_name, and the character type is string

hive> create external table if not exists goods(group_id string,group_name string) row format delimited fields terminated by '\t' location '/user/root/goods';
OK
Time taken: 0.155 seconds
hive> show tables;
OK
cat
goods
Time taken: 0.026 seconds, Fetched: 2 row(s)

Third, modify the table structure of the cat table. Add two fields group_id and cat_code to the cat table

hive> alter table cat add columns(group_id string,cat_code string);
OK
Time taken: 0.372 seconds
hive> desc cat;
OK
cat_id              	string              	                    
cat_name            	string              	                    
group_id            	string              	                    
cat_code            	string              	                    
Time taken: 0.087 seconds, Fetched: 4 row(s)

Fourth, modify the table name cat to cat2

hive> alter table cat rename to cat2;
OK
Time taken: 0.275 seconds

This command allows the user to rename the table without changing the location and partition name of the data.

5. Create a table with the same structure as the known table, create a table with the same structure as the cat table, named cat3,
use the like keyword here

hive> create table cat3 like cat2;
OK
Time taken: 1.391 seconds
hive> show tables;
OK
cat2
cat3
goods
Time taken: 0.047 seconds, Fetched: 3 row(s)
hive> desc cat3;
OK
cat_id              	string              	                    
cat_name            	string              	                    
group_id            	string              	                    
cat_code            	string              	                    
Time taken: 0.118 seconds, Fetched: 4 row(s)

Import and export of data in Hive

1. Import data from the local file system to the Hive table

First, create a cat_group table in Hive, containing two fields group_id and group_name, the character type is string, and "\t" is used as the delimiter, and check the result.

hive> create table cat_group(group_id string,group_name string) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.218 seconds
hive> show tables;
OK
cat2
cat3
cat_group
goods
Time taken: 0.048 seconds, Fetched: 4 row(s)

The [row format delimited] keyword is used to set the column delimiter supported by the created table when loading data.
The keyword [stored as textfile] is used to set the data type of the loaded data. The default is TEXTFILE. If the file data is plain text, use [stored as textfile], and then copy it directly from the local to HDFS, and Hive can recognize it directly. data.

2. Import the myhive file in the Linux local /input/hive/ directory into the cat_group table in Hive

hive> load data local inpath '/input/hive/myhive' into table cat_group;
Loading data to table db.cat_group
OK
Time taken: 1.081 seconds

Use the select statement to check whether the data is successfully imported into the cat_group table, and use the limit keyword to limit the output of 5 records.

hive> select * from cat_group limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 2.088 seconds, Fetched: 5 row(s)

3. Import data from HDFS into Hive

        1. First, open another operation window and create the /output/hive directory on HDFS

[root@master hive]# hadoop fs -mkdir /output/hive

        2. Upload the myhive file under the local /iutput/hive/ to the /output/hive of HDFS, and check whether the creation is successful

[root@master hive]# hadoop fs -put /input/hive/myhive /output/hive/
[root@master hive]# hadoop fs -ls /output/hive 
Found 1 items
-rw-r--r--   2 root supergroup         64 2022-03-05 22:19 /output/hive/myhive

        3. Create a table named cat_group1 in Hive. The table creation statement is as follows

hive> create table cat_group1(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.243 seconds

        4. Import the table cat_group in /output/hive under HDFS into the cat_group1 table in Hive, and view the results

hive> load data inpath '/output/hive/myhive'  into table cat_group1;
Loading data to table db.cat_group1
OK
Time taken: 0.539 seconds
hive> select * from cat_group1 limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.262 seconds, Fetched: 5 row(s)

        Note: The data import is successful.

        The difference between importing data from HDFS into Hive and importing local data into hive is that local is missing after loading data.

4. Query the corresponding data from other tables and import it into Hive

        1. First create a table named cat_group2 in Hive.

hive> create table cat_group2(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.111 seconds

        2. Import the data in the cat_group1 table into the cat_group2 table in the following two ways.

hive> insert into table cat_group2 select * from cat_group1;
Query ID = root_20220306040659_42572420-db7d-4412-bbc3-495abd9ce479
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0003, Tracking URL = http://master:8088/proxy/application_1646528951444_0003/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 04:07:31,799 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:07:51,642 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.89 sec
2022-03-06 04:08:00,165 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.47 sec
MapReduce Total cumulative CPU time: 3 seconds 470 msec
Ended Job = job_1646528951444_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-06-59_043_3456913091663343579-1/-ext-10000
Loading data to table db.cat_group2
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.47 sec   HDFS Read: 13409 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 470 msec
OK
Time taken: 63.711 seconds
hive> insert overwrite  table cat_group2 select * from cat_group1;
Query ID = root_20220306041024_bf920fd1-b42d-4ed7-ad7b-66955905fa19
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0004, Tracking URL = http://master:8088/proxy/application_1646528951444_0004/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 04:10:47,981 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:11:12,568 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.33 sec
2022-03-06 04:11:22,231 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.1 sec
MapReduce Total cumulative CPU time: 4 seconds 100 msec
Ended Job = job_1646528951444_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group2/.hive-staging_hive_2022-03-06_04-10-24_167_6531779411761470258-1/-ext-10000
Loading data to table db.cat_group2
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.1 sec   HDFS Read: 13494 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 100 msec
OK
Time taken: 60.895 seconds

        Note: insert overwrite will overwrite data

        3. Query table cat_group2

hive> select * from cat_group2 limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.33 seconds, Fetched: 5 row(s)

        4. When creating a table, query the corresponding data from other tables and insert it into the created table

Create table cat_group3 in Hive and get data directly from cat_group2.

hive> create table  cat_group3 as  select * from cat_group2;
Query ID = root_20220306041630_3200b863-b9b3-4c2e-ac0d-c7caff9b6611
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0005, Tracking URL = http://master:8088/proxy/application_1646528951444_0005/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 04:16:54,438 Stage-1 map = 0%,  reduce = 0%
2022-03-06 04:17:02,430 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.58 sec
MapReduce Total cumulative CPU time: 1 seconds 580 msec
Ended Job = job_1646528951444_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/.hive-staging_hive_2022-03-06_04-16-30_327_7813330832683742274-1/-ext-10002
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group3
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.58 sec   HDFS Read: 4969 HDFS Write: 133 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 580 msec
OK
Time taken: 34.65 seconds

        5. Query table cat_group3

hive> select * from cat_group3 limit 5;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.229 seconds, Fetched: 5 row(s)

Five, three common data export methods

        1. Export to local file system

Create the directory /output/hive locally and export the cat_group table in Hive to the local file system /output/hive/.

[root@master hive]# mkdir -p /output/hive/
hive> insert overwrite local directory '/output/hive/'
    > row format delimited fields terminated by '\t' select * from cat_group;
Query ID = root_20220306062829_b059a3f5-e4ad-4dd7-a000-e294c4ccbee2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0006, Tracking URL = http://master:8088/proxy/application_1646528951444_0006/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 06:28:51,743 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:29:00,515 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.59 sec
MapReduce Total cumulative CPU time: 1 seconds 590 msec
Ended Job = job_1646528951444_0006
Moving data to local directory /output/hive
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.59 sec   HDFS Read: 4738 HDFS Write: 64 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 590 msec
OK
Time taken: 32.116 seconds
[root@master out]# cd /output/hive/
[root@master hive]# ll
total 4
-rw-r--r--. 1 root root 64 Mar  6 06:29 000000_0
[root@master hive]# cat 000000_0 
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯

Note: The method is different from importing data into Hive. Insert into cannot be used to export data.

        2. Export data from Hive to HDFS

Import the data in the table cat_group in Hive into the /output/hive directory of HDFS.

hive> insert overwrite directory '/output/hive' 
    > row format delimited fields terminated by '\t' select group_id,
    > group_name from cat_group;
Query ID = root_20220306063621_b359d338-77ee-4571-a425-5415f9c6fb03
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1646528951444_0007, Tracking URL = http://master:8088/proxy/application_1646528951444_0007/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-03-06 06:36:41,866 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:36:55,679 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1646528951444_0007
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/output/hive/.hive-staging_hive_2022-03-06_06-36-21_452_7432529204143275493-1/-ext-10000
Moving data to directory /output/hive
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.75 sec   HDFS Read: 4772 HDFS Write: 64 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
Time taken: 36.494 seconds

View results on HDFS

[root@master hive]# hadoop fs -ls /output/hive
Found 1 items
-rw-r--r--   2 root supergroup         64 2022-03-06 06:36 /output/hive/000000_0

        3. Export to another table in Hive

Import the data in the table cat_group in Hive into cat_group4 (the fields and character types of the two tables are the same).
First create a table cat_group4 in Hive, with two fields group_id and group_name, the character type is string, and \t' is used as the separator.

hive> create table cat_group4(group_id string,group_name string)
    > row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.195 seconds

Then import the data in cat_group into cat_group4.

hive> insert into table cat_group4 select * from cat_group;
Query ID = root_20220306064421_722364dd-7475-4ae5-ba44-553f3df856e2
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646528951444_0008, Tracking URL = http://master:8088/proxy/application_1646528951444_0008/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646528951444_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-06 06:44:47,514 Stage-1 map = 0%,  reduce = 0%
2022-03-06 06:44:58,359 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.74 sec
2022-03-06 06:45:11,880 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
MapReduce Total cumulative CPU time: 3 seconds 400 msec
Ended Job = job_1646528951444_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/cat_group4/.hive-staging_hive_2022-03-06_06-44-21_318_6696628966307745769-1/-ext-10000
Loading data to table db.cat_group4
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.4 sec   HDFS Read: 13474 HDFS Write: 348 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 400 msec
OK
Time taken: 52.617 seconds

After the import is complete, view the data in the cat_group4 table.

hive> select * from cat_group4 limit 10;
OK
101	孙悟空
102	唐僧
103	猪八戒
104	沙僧
105	托马斯
Time taken: 0.249 seconds, Fetched: 5 row(s)

6. Operation of Hive partition table

To create a table partition, create a partition table goods in Hive, which contains two fields, goods_id and goods_status, the character type is string, the partition is cat_id, the character type is string, and "\t" is used as the separator.

hive> create table goods(goods_id string,goods_status string) partitioned by (cat_id string)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.107 seconds

View table goods structure

hive> desc goods;
OK
goods_id            	string              	                    
goods_status        	string              	                    
cat_id              	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
cat_id              	string              	                    
Time taken: 0.108 seconds, Fetched: 7 row(s)

Insert data into the partition table, insert the data in the table goods under the local /output/hive into the partition table goods.

[root@master hive]# cat goods 
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052
1020405 6       52052

Create a non-partitioned table goods_1 table in Hive to store the data in the table goods under the local /input/hive/.

hive> create table goods_1(goods_id string,goods_status string,cat_id string)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.179 seconds

Import the data in the table goods under the local /input/hive/ into the goods_1 table in Hive.

hive> load data local inpath '/input/hive/goods' into table goods_1;
Loading data to table db.goods_1
OK
Time taken: 0.511 seconds

Then import the data in the table goods_1 into the partition table goods

hive> insert into table db.goods partition(cat_id = '52052') select goods_id, goods_status from db.goods_1 where cat_id = '52052';
Query ID = root_20220307041832_30f47fc3-629d-4eda-821a-5f0c3a9edb0d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0002, Tracking URL = http://master:8088/proxy/application_1646636256603_0002/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-07 04:19:05,274 Stage-1 map = 0%,  reduce = 0%
2022-03-07 04:19:18,487 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.77 sec
2022-03-07 04:19:27,292 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.59 sec
MapReduce Total cumulative CPU time: 4 seconds 590 msec
Ended Job = job_1646636256603_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods/cat_id=52052/.hive-staging_hive_2022-03-07_04-18-32_060_6446641423854979060-1/-ext-10000
Loading data to table db.goods partition (cat_id=52052)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.59 sec   HDFS Read: 14777 HDFS Write: 320 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 590 msec
OK
Time taken: 59.931 seconds

View the data in the table goods 

hive> select goods_id, goods_status from goods;
OK
1624123	6
1020405	6
1020405	6
1020405	6
1020405	6
1020405	6
1020405	6
1020405	6
Time taken: 0.252 seconds, Fetched: 8 row(s)

Modify the table partition, change the partition column cat_id = 52050 in the partition table goods to cat_id = 52051, and view the modified partition name.

hive> alter table goods partition(cat_id=52052) rename to partition(cat_id=52051);
OK
Time taken: 0.678 seconds
hive> show partitions goods;
OK
cat_id=52051
Time taken: 0.139 seconds, Fetched: 1 row(s)

delete table partition

Before deleting the goods partition table, backup the goods table to a goods_2 table

hive> create table goods_2(goods_id string,goods_status string) partitioned by (cat_id string) row format delimited fields terminated by '\t';
OK
Time taken: 0.178 seconds
hive> insert into table goods_2 partition(cat_id='52052') select goods_id,goods_status from goods_1 where cat_id = '52052';
Query ID = root_20220307054238_db58a379-17f6-4ecb-86e0-402e0d7bbf54
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0003, Tracking URL = http://master:8088/proxy/application_1646636256603_0003/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-03-07 05:43:04,534 Stage-1 map = 0%,  reduce = 0%
2022-03-07 05:43:17,542 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.76 sec
2022-03-07 05:43:26,197 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.55 sec
MapReduce Total cumulative CPU time: 4 seconds 550 msec
Ended Job = job_1646636256603_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://master:9000/user/hive/warehouse/db.db/goods_2/cat_id=52052/.hive-staging_hive_2022-03-07_05-42-38_498_2225361888387483704-1/-ext-10000
Loading data to table db.goods_2 partition (cat_id=52052)
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.55 sec   HDFS Read: 14813 HDFS Write: 322 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
Time taken: 49.84 seconds

Delete the cat_id partition in the goods table

hive> alter table goods drop if exists partition(cat_id = '52051');
Dropped the partition cat_id=52051
OK
Time taken: 0.405 seconds
hive> show partitions goods;
OK
Time taken: 0.137 seconds

Seven, Hive bucket operation

Before creating a bucket, you need to set the hive.enforce.bucketing property to true to use Hive to identify the bucket.

1. Create a bucket

Create a table named goods_t, containing two fields goods_id and goods_status, the field type is string, partitioned by cat_id string, clustered by the goods_status column and arranged by the goods_id column, and divided into two buckets.

hive> create table goods_t(goods_id string, goods_status string) partitioned by (cat_id string) clustered by(goods_status) sorted by(goods_id) into 2 buckets;
OK
Time taken: 0.148 seconds

2. Set the environment variable set hive.enforce.bucketing=true;

hive> set hive.enforce.bucketing=true;

3. Insert the data in the goods_2 table into the goods_t table

hive> insert overwrite table goods_t partition(cat_id='52063') select goods_id,goods_status from goods_2;
Query ID = root_20220307060336_c76fa90c-ea59-4fa4-9dd5-654c843421fd
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0004, Tracking URL = http://master:8088/proxy/application_1646636256603_0004/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2022-03-07 06:04:01,531 Stage-1 map = 0%,  reduce = 0%
2022-03-07 06:04:12,389 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.73 sec
2022-03-07 06:04:29,170 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 4.23 sec
2022-03-07 06:04:30,371 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.99 sec
MapReduce Total cumulative CPU time: 7 seconds 410 msec
Ended Job = job_1646636256603_0004
Loading data to table db.goods_t partition (cat_id=52063)
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1646636256603_0005, Tracking URL = http://master:8088/proxy/application_1646636256603_0005/
Kill Command = /home/hadoop//bin/mapred job  -kill job_1646636256603_0005
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2022-03-07 06:04:54,726 Stage-3 map = 0%,  reduce = 0%
2022-03-07 06:05:07,008 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.75 sec
2022-03-07 06:05:16,566 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 3.93 sec
MapReduce Total cumulative CPU time: 3 seconds 930 msec
Ended Job = job_1646636256603_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 7.41 sec   HDFS Read: 19414 HDFS Write: 469 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 3.93 sec   HDFS Read: 11591 HDFS Write: 173 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 340 msec
OK
Time taken: 102.151 seconds

4. Sampling bucket table

hive> select * from goods_t tablesample(bucket 1 out of 2 on goods_status);
OK
Time taken: 0.281 seconds


 

Guess you like

Origin blog.csdn.net/m0_54925305/article/details/123310024