Review: Talk about hive random sampling①

Review: Talk about hive random sampling①

Talking about big data

When the amount of data is large, sample the data and then perform model analysis. As a must-have product for data warehouse hive, how do we sample it?
Of course, another purpose of writing this article is to review the four bys of hive. Do you have an impression?
Hive: SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY
Welcome to click to read the original text and join Langjian Knowledge Planet.

Suppose you have a Hive table with 10 billion rows, and you want to effectively randomly sample a fixed number of rows of data-such as 10,000. The most obvious (and obviously wrong) method is:


select * from my_table
limit 10000;

If the table is not sorted, Hive does not guarantee the order of the data, but in practice, they are returned in the order they are in the file, so this is far from truly random. Then you can try:


select * from my_table
order by rand()
limit 10000;

This does provide truly random data, but the performance is not that good. In order to achieve total sorting, Hive must force all data to be transferred to a single reducer. The reducer will sort the entire data set. This is bad. Fortunately, Hive has a non-standard SQL "sort by" clause, which only sorts in a single reducer, and does not guarantee that the data is sorted across multiple reducers:


select * from my_table
sort by rand()
limit 10000;

This is much better, but I don't believe it is really random. The problem is that Hive's method of splitting data into multiple reducers is undefined. It may be truly random, it may be based on the file order, it may be based on certain values ​​in the data. How Hive implements the limit clause in reducers is also undefined. Maybe it gets the data from the reducer in order-that is, all the data in reducer 0, then all the data in reducer 1, and so on. Maybe it loops through them and mixes everything together.

In the worst case, assume that the reduce key is based on the data column, and the limit clause is the order of reducers. Then the sample will be very tilted.

The solution is another non-standard Hive feature: "distribute by". For queries where the reduce key is not determined by the query structure (no "group by", no join), the content of the reduce key can be specified accurately. If we randomly distribute and sort randomly in each reducer, then it doesn't matter how the "limit" function is.


select * from my_table
distribute by rand()
sort by rand()
limit 10000;

Finally, as the last optimization, you can do some filtering on the map-side. If the total size of the table is known, it is easy to set a random threshold condition for data filtering, as shown below:


select * from my_table
where rand() <= 0.0001
distribute by rand()
sort by rand()
limit 10000;

In this case, since the total size is 10 billion and the sample size is 10,000, I can easily calculate that the sample accounts for 0.000001 of the total data. However, if the where clause is "rand()<0.000001", the number of rows in the final output may be less than 10,000 rows. "Rand()<0.000002" may work, but it does depend on a very good implementation of rand(). In the end it is not important, because the bottleneck is the full table scan, not the data transmitted to the reducer.

Guess you like

Origin blog.51cto.com/15127544/2664907