Detailed explanation of HiveSQL quantile function percentile() + example code

Table of contents

foreword

1. percentile()

二、percentile_approx()

Pay attention, prevent getting lost, if there are any mistakes, please leave a message for advice, thank you very much



foreword

As a data analyst, the functions of each SQL database and the use of skills are full, especially the use of statistical functions. There are several ways to count the median, mode and quantile of the data. Generally, most of the actual business is to write SQL queries, because if you want to use Python Pandas for data analysis, it is still difficult to It is very troublesome to export the data and read it out, and then re-enter the output results. If you can directly handle simple problems on SQL, the efficiency is much higher than exporting for Pandas processing. This article mainly introduces the use of percentile quantile functions. The next few articles will mainly explain the use of statistical functions in each SQL in detail. Those who are interested and feel helpful can pay attention. The bloggers of this blog will maintain it for a long time. If there are any mistakes, please point them out in the comment area.


The calculation of quantiles in HiveSQL is mainly implemented through the two functions percentile() and percentile_approx() .

1. percentile()

Function usage syntax:

percentile(col, p)

Parameter description:
col: specifies the name of the column to be calculated, and the value of the column must be of type int.

p: specifies the obtained quantile value, the value range is [0,1], if it is 0.5, it is the median, if it is 0.75, it is the third quartile, and so on.

Example use:

SELECT percentile(num,0.2) as two_parts#取二分位数
FROM dbbasename.table

 

In addition, you can also enter p in the form of a sequence, and a sequence is also returned, including the percentile corresponding to the input sequence:

SELECT percentile(num,array(0.2,0.4,0.6)) as parts#取二分位数
FROM dbbasename.table

二、percentile_approx()

Function usage syntax:

percentile_approx(DOUBLE col,p,B) 

Find the approximate pth percentile, p must be between 0 and 1, the return type is double, but the col field supports floating point type. The parameter B controls the approximate precision of the memory consumption, the larger the B, the higher the precision of the result. The default value is 10000. When the number of distinct values ​​in the col field is less than B, the result is the exact percentile.

SELECT percentile_approx(num,0.2,9999) as two_parts#取二分位数
FROM dbbasename.table

 

Of course, the same can also output a sequence:

SELECT percentile_approx(num,array(0.25,0.5,0.75)) as parts#取二分位数
FROM dbbasename.table

 

Pay attention, prevent getting lost, if there are any mistakes, please leave a message for advice, thank you very much

That's all for this issue. I'm fanstuck. If you have any questions, feel free to leave a message to discuss. See you in the next issue.


Guess you like

Origin blog.csdn.net/master_hunter/article/details/126642158