Hive basic command analysis

1. Partitioning of Hive

Command: create partition

create table t_sz_part(id int, name string)
partitioned by (country string)
row format delimited
fields terminated by ','

Insert data into the partition: load data local inpath '/home/hadoop/sz.dat' into table t_sz_part partition(country = 'China');

Note : First, when creating a partitioned table, you need to declare that the table is a partitioned table by the keyword partitioned by (name string), and it is partitioned according to the field name; secondly, when importing data into a partitioned table, you need to use the keyword partitioned by (name string). partition(country='China') shows which partition of the table the declared data is to be imported into. The so-called partition is to package records that meet certain conditions, make a mark, and improve efficiency during query, which is equivalent to classifying files by folder, and the folder name can be compared to the partition field. This partition field formally exists in the data table and will be displayed to the client during query, but it is not really stored in the data table file, it is a so-called pseudo column . Therefore, don't think that the columns that actually exist in the attribute table are partitioned according to the similarities and differences of attribute values. For example, the column country on which the partition is based does not really exist in the data table. It is a pseudo-column we added for the convenience of management. The value of this column is also specified by us, not according to the value after reading from the data table. partition it differently. We can't partition according to a column that actually exists in a data table, such as id.

2. Hive's bucketing function

Command: create bucket

create table t_bluk(id string, name string)
clustered by(id) sort by (id) into 4 buckets;

Analysis: clustered by(id) means that it is divided into 4 buckets according to id, and the buckets are sorted by id.

After the above command is executed, four subdirectories will be created under the corresponding hdfs file directory, such as:

Possible problems : When the command "insert into t_buck select * from other" is used, there are no four subdirectories but only one subdirectory in the t_buck directory. The following operations are required:

① Set the following variables:

#Set the variable, set the bucket to true, and set the number of reduce to the number of buckets
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;

② Use the "insert ... select ..." command to insert data into t_buck to finally generate four partitions.

额外说明:insert into t_buck select id,name from t_p distribute by (id) sort by (id);

 distribute by (id) specifies the partition field; sort by (id) specifies the sort field 

Distribute by (sno) sort by (sno asc) or Cluster by (field) can be used when sorting and bucketing fields are the same. cluster by is equivalent to bucketing + sorting (sort)

The difference between partitioning and bucketing : Partitioning is based on pseudo-columns, and bucketing is a finer-grained division relative to partitioning. Bucketing divides the entire data content according to the hash value of the attribute value of a certain column. If you want to divide it into 3 buckets according to the name attribute, you need to touch the hash value of the name attribute value to 3, and divide the data into buckets according to the modulo result. For example, the data record whose modulo result is 0 is stored in a file, the data whose modulo is 1 is stored in a file, and the data whose modulo is 2 is stored in a file. Unlike partitioning , partitioning is not based on the columns in the real data table file, but the pseudo-columns we specify, but the bucketing is based on the real columns in the data table instead of pseudo-columns. Therefore, when specifying the column on which the partition is based, you must specify the type of the column, because this column does not exist in the data table file, which is equivalent to creating a new column. The bucketing is based on a column that already exists in the table. The data type of this column is obviously known, so there is no need to specify the type of the column.

3. Analysis of join operation of Hive

 add later

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325128614&siteId=291194637