Big Data --hive points barrels query && compression

First, the sample queries and sub-barrel

1, sub-barrel table creation

---------------------------------------

hive (db_test)> create table stu_buck(id int,name string)
> clustered by(id)
> into 4 buckets
> row format delimited fields terminated by '\t';
OK
Time taken: 0.369 seconds

------------------------------------------------------------------------

hive (db_test)> desc formatted stu_buck;
OK
col_name data_type comment
# col_name data_type comment

id int
name string

# Detailed Table Information
Database: db_test
Owner: root
CreateTime: Thu Oct 03 12:14:15 CST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/db_test.db/stu_buck
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1570076055

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: 4
Bucket Columns: [id]
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.121 seconds, Fetched: 28 row(s)

------------------------------------------------------------------

2, the table data is loaded into the barrel points

2.1, create a common table

------------------------------------------------------------------

hive (db_test)> create table stu_comm(id int,name string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.181 seconds

---------------------------------------------------------------------

2.2, the local data is loaded into an ordinary table

-------------------------------------------------------------------------

hive (db_test)> load data local inpath '/root/hivetest/stu_buck' into table stu_comm;
Loading data to table db_test.stu_comm
Table db_test.stu_comm stats: [numFiles=1, totalSize=501]
OK
Time taken: 0.654 seconds

Hive (db_test)> SELECT * from stu_comm;
the OK
stu_comm.id stu_comm.name
1001 Zhang
1002 Doe
1003 Wangwu
1004 Zhao six
1005 Li Qi
1006 Zhao
1007 Huang Yueying
1008 Liang
1009 Sima Yi
1010 Zhang
1011 Guan
1012 Bei
1013 Cao
1014 Cao
1015 Pi
1016 YingZheng
1017 Han
1018 Quan
1019 Shangxiang
1020 Sun Bin
1021 bridge
1022 Joe
1023 Luban
1024 Gan
1025 white from
1026 Bai
1027 Li Xin
1028 Smurfit
1029 Yi
1030 Arthur
1031 Ahn'Qiraj
1032 daji
1033 Bu
1034 Zhang Bao
1035 Su
1036 Dong
1037 Ma Su
Jaap 1038
1039 XiaHouYuan
1040 Huang
Time taken: 0.081 seconds, Fetched: 40 row (s)

-----------------------------------------------------------------------------

2.3, the kit of parts is provided hive relevant attributes are data divided sub tub, a plurality of jobs for processing mapreduce

-------------------------------------------------------------------

hive (db_test)> set hive.enforce.bucketing=true;
hive (db_test)> set mapreduce.job.reduces=4;
hive (db_test)> set hive.enforce.bucketing;
hive.enforce.bucketing=true
hive (db_test)> set mapreduce.job.reduces;
mapreduce.job.reduces=4

---------------------------------------------------------

2.4, the table can not be sub-divided bucket by bucket load to load data in the file, you need to load data via the insert statement, that is to divide the job file through mapreduce

----------------------------------------------------------

hive (db_test)> insert into table stu_buck select id,name from stu_comm;
Query ID = root_20191003122918_48ead4a6-8f19-4f0a-8298-6a57b467bf47
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1570075776894_0001, Tracking URL = http://bigdata112:8088/proxy/application_1570075776894_0001/
Kill Command = /opt/module/hadoop-2.8.4/bin/hadoop job -kill job_1570075776894_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
2019-10-03 12:29:29,888 Stage-1 map = 0%, reduce = 0%
2019-10-03 12:29:38,338 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.7 sec
2019-10-03 12:29:45,839 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 3.17 sec
2019-10-03 12:29:47,932 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 4.74 sec
2019-10-03 12:29:52,077 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.54 sec
MapReduce Total cumulative CPU time: 6 seconds 540 msec
Ended Job = job_1570075776894_0001
Loading data to table db_test.stu_buck
Table db_test.stu_buck stats: [numFiles=4, numRows=40, totalSize=501, rawDataSize=461]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 6.54 sec HDFS Read: 15129 HDFS Write: 793 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 540 msec
OK
id name
Time taken: 34.6 seconds

 

 =================================================

3, the data points table barrel sample queries

3.1, respectively, query data file 4

-------------------------------------------------------------------------------------

// query data of the first file

Hive (db_test)> DFS -cat /db_test.db/stu_buck/000000_0 http://192.168.1.121:50070/;
CAT: No scheme for the FileSystem: HTTP
1040. Huang
1036 Dong
1032 daji
1028 Smurfit
1024 Gan
1020 SUN Bin
1016 YingZheng
1012 Bei
1008 Liang
1004 Zhao six
the Command Exit code with failed. 1 =
Query returned non-ZERO code:. 1, the cause is: null

=======================================

// query the contents of the second file

Hive (db_test)> the DFS -cat /db_test.db/stu_buck/000001_0 http://192.168.1.121:50070/;
CAT: No the FileSystem for scheme: HTTP
1005 Li Qi
1029 Easy
1037 Ma Su
1017 Hanshin
1001 John Doe
1033 Riboud
1009 Sima Yi
1013 Cao
1025 white from
1021 bridge
the Command Exit code with failed. 1 =
Query returned non-ZERO code:. 1, the cause is: null

===========================================

// query the contents of the third file

Hive (db_test)> DFS -cat /db_test.db/stu_buck/000002_0 http://192.168.1.121:50070/;
CAT: No scheme for the FileSystem: HTTP
1010 Fei
1038 Jaap
1022 Joe
1034 Zhang Bao
1002 John Doe
1026 Bai
1018 Quan
1030 Arthur
1014 Cao
1006 Zhao
the Command Exit code with failed. 1 =
Query returned non-ZERO code:. 1, the cause is: null

==============================================

// query the fourth file

Hive (db_test)> the DFS -cat /db_test.db/stu_buck/000003_0 http://192.168.1.121:50070/;
CAT: No the FileSystem for scheme: HTTP
1015 Pi
1007 Huang Yue-ying
1027 Li Xin
1023 Luban
1019 Shangxiang
1003 Wangwu
1011 Guan
1039 XiaHouYuan
1035 Su
1031 Ahn'Qiraj
the Command Exit code with failed. 1 =
Query returned non-ZERO code:. 1, the cause is: null

====================================================

3.2, sample query tables two data points barrel

---------------------------------------------------------------------

// query file 1 and file data content of 3

Hive (db_test)> SELECT * from stu_buck TABLESAMPLE (2 ON of OUT bucket. 1 ID);
the OK
stu_buck.id stu_buck.name
1040. Huang
1036 Dong
1032 daji
1028 Smurfit
1024 Gan
1020 Sun Bin
1016 YingZheng
1012 Bei
1008 Liang
Zhao six 1004
1010 Zhang
1038 Jaap
1022 Joe
1034 Zhang Bao
1002 John Doe
1026 Bai
1018 Quan
1030 Arthur
1014 Cao
1006 Zhao
Time taken: 0.077 seconds, Fetched: 20 row (s)

----------------------------------------------------------------------

// query file data file 2 and 4 of that content

Hive (db_test)> SELECT * from stu_buck TABLESAMPLE (OUT 2 of bucket 2 ON ID);
the OK
stu_buck.id stu_buck.name
1005 Li Qi
1029 Yi
1037 Ma Su
1017 Han
1001 Zhang
1033 Bu
1009 Sima Yi
1013 Cao
1025 white from
1021 Bridge
Pi 1015
1007 Huang Yueying
1027 Li Xin
1023 Luban
1019 Shangxiang
1003 Wang Wu
1011 Guan
1039 XiaHouYuan
1035 Su
1031 Ahn'Qiraj
Time taken: 0.097 seconds, Fetched: 20 row (s)

---------------------------------------------------------------------

// bucket. 4 ON OUT 2 of the foregoing figures can not be greater than id numbers behind

hive (db_test)> select * from stu_buck tablesample(bucket 4 out of 2 on id);
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

----------------------------------------------------------------------

Two, hive of compression and support segmentation

Compression format

tool

algorithm

File name extension

Whether segmentation

DEFAULT

no

DEFAULT

.deflate

no

Gzip

gzip

DEFAULT

.gz

no

bzip2

bzip2

bzip2

.bz2

Yes

LZO-

lzop

LZO-

.lzo

Yes

Snappy

no

Snappy

.snappy

no

To support multiple compression / decompression algorithm, Hadoop introduced encoder / decoder

Compression format

Corresponding encoder / decoder

DEFLATE

org.apache.hadoop.io.compress.DefaultCodec

gzip

org.apache.hadoop.io.compress.GzipCodec

bzip2

org.apache.hadoop.io.compress.BZip2Codec

LZO-

com.hadoop.compression.lzo.LzopCodec

Snappy

org.apache.hadoop.io.compress.SnappyCodec

 

 

1, open-ended output stage compression map (temporary settings to take effect), would need to set up permanent in the configuration file inside

------------------------------------------------------------------

1) Open hive intermediate transfer data compression , the default is false

hive (default)>set hive.exec.compress.intermediate=true;

2) Open the map mapreduce output compression , the default is false

hive (default)>set mapreduce.map.output.compress=true;

3) Set the compression map output data mapreduce

hive (default)>set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;

----------------------------------------------------------------

2, open end of the compression phase of the output reduce

---------------------------------------------------------

1) Open hive final output data compression , the default is false

hive (default)>set hive.exec.compress.output=true;

2) Open mapreduce final output data compression, the default is false

hive (default)>set mapreduce.output.fileoutputformat.compress=true;

3) Set the final data output compression mapreduce

hive (default)> set mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;

4) Set mapreduce outputs the compressed data as the final compressed block

hive (default)> set mapreduce.output.fileoutset mapreduce.output.fileoutputformat.compress.type=BLOCK;

putformat.compress.type=BLOCK;

===================================================================

Guess you like

Origin www.cnblogs.com/jeff190812/p/11619581.html