Hive random sampling - sampling inquiry

background

Will be very time-consuming when analyzing massive amounts of data and data modeling tasks, often carried out for the full amount of data mining analysis and occupancy cluster resources, under normal circumstances 只需要抽取一小部分数据进行分析及建模操作
Hive provides 数据取样(SAMPLING)functionality, data can be sampled according to certain rules, currently it supports 数据块抽样, 分桶抽样and 随机抽样, as shown below:

  1. Random sampling (RAND () function)
  • Using rand()函数random sampling, data samples returned limit restriction keywords, wherein the rand () function before the distribute和sort关键字data may be in the mapper and reducer phases are randomly distributed to ensure the
    Examples: Table app.table_name randomly selected and the date is the day (datekey = '2018 -11-14 ') data 100
 select * from app.table_name where datekey='2018-11-14' distribute by rand() sort by rand() limit 100;  
  • Ten million data in order by way of a random sample takes longer (undesirable)
    Example: 100 randomly selected data table
 select * from app.table_name order by rand() limit 100;

2, sampling data block (TABLESAMPLE () function)

1) tablesample (n percent) The hive table data 大小按比例to extract data, and stores the new hive to the table. Such as: extracting 10% of the original hive table data in a manner allowing Hive N random data rows, the percentage of the total amount of data (n percentage) or N bytes of data.
grammar:

SELECT * FROM <Table_Name> TABLESAMPLE(N PERCENT|ByteLengthLiteral|N ROWS) s;

(Note: discovered during testing, select statements with where conditions can not and does not support sub-queries can be resolved by creating a new middle random sampling table or use)

create table xxx_new as select * from xxx tablesample(10 percent) 

2) tablesample (n M) of the specified size of sample data, in units of M.
3) tablesample (n rows) specified number of lines sample data, where n represents each of the n-th row are taken task map data, map number hive by a simple table query acknowledgment (Keywords: number of mappers: x)

3, sampling data block (Block sampling)

Example: the percentage by volume of data samples

SELECT name FROM employees TABLESAMPLE(10 PERCENT) a;

Example: according to the data size of the sample

SELECT name FROM employees TABLESAMPLE(1M) a;

Example: according to the number of rows of data samples

SELECT * FROM source TABLESAMPLE(10 ROWS);

3, the sample for the bucket (Bucket table sampling)

The hive is actually carved bucket 某一个字段Hash取模, the bucket into the designated data, such as in the ID table table_1 into the tub 100, the algorithm is hash (id)% 100, so that, hash (id) = 0 data is 100% first into a bucket, record hash (id)% 100 = 1 is placed into the second tub. The key statement creates a sub-bucket table is: CLUSTER BY statement.
Syntax barrel sampling points:
the TABLESAMPLE (x OUT. OF BUCKET Y [colname The the ON])
where x is the number to be sampled barrels, numbered from the tub 1, colname columns represents samples, y represents the number of buckets.
For example: The first bucket in Table 10 were randomly divided into groups, wherein the data extraction

select * from table_01 tablesample(bucket 1 out of 10 on rand())

This is the best embodiment of the sample table bucket. RAND () function can be used to sample the entire row. If the sample columns using both CLUSTERED BY, more efficient use TABLESAMPLE statement.
grammar:

SELECT * FROM <Table_Name> TABLESAMPLE(BUCKET <specified bucket number to sample> OUT OF <total number of buckets> ON [colname|RAND()]) table_alias;

Example:

SELECT * FROM employees TABLESAMPLE(BUCKET 2 OUT OF 4 ON RAND()) table_alias;
SELECT * FROM xxxxxx_uid_online_buck TABLESAMPLE(bucket 1 out of 2 on uid); 

4, summary

And sampling the polymerization, the polymerization particular function, is the primary method of processing data in the large data processing. By the polymerization conditions and the function of a combination consisting of basic data processing or to complete any required packets, random sampling, sampling the data block, the sampling points are three buckets more common data sampling method.

Published 175 original articles · won praise 76 · Views 230,000 +

Guess you like

Origin blog.csdn.net/qq_29232943/article/details/104636172