Impala row_number()使用

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/jeffiny/article/details/80453276

1、需求:找到每个分组中pv最大的第一条数据,取10万条;

row_number 格式:ROW_NUMBER() OVER (partition BY COLUMN_A ORDER BY COLUMN_B ASC/DESC) rn

其中:

    rn 是排序的别名执行时每组的编码从1开始 
    partition by:类似hive的建表,分区的意思;COLUMN_A 是分组字段 

    order by :排序,默认是升序,加desc降序;COLUMN_B 是排序字段

注解:其实这个排序具体理解如下:

1、按照sql中的group 操作之后会得到很多的分类;

2、这个分类不变,按照 partition by 后的字段COLUMN_A 作为ROW_NUMBER要分组字段,如果字段相同,则会产生1到N的行号;

3 、orderby 的作用就是按照那个字段值来排序,以此产生是从到大到小还是从小到大的排序;

 本文查询用例:

select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv desc) AS rn from rong_getCustomers_data ro join (select msisdn,max(pv) pv from rong_getCustomers_data group by msisdn)st1 on ro.msisdn=st1.msisdn and ro.pv=st1.pv )stt  where rn <=1 and length(stt.msisdn)=11 order by stt.pv desc limit 100000;

查询结果:(正确取到数据)

扫描二维码关注公众号,回复: 3172223 查看本文章
***********************************************************************************
[evercloud17:21000] >  select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv desc) AS rn from rong_getCustomers_data ro join (select msisdn,max(pv) pv from rong_getCustomers_data group by msisdn)st1 on ro.msisdn=st1.msisdn and ro.pv=st1.pv )stt  where rn <=1 and length(stt.msisdn)=11 order by stt.pv desc limit 10;
Query: select md5(stt.msisdn),stt.os_code from (select ro.msisdn,ro.imei,ro.pv,ro.os_code,ROW_NUMBER() OVER(PARTITION BY ro.msisdn ORDER BY ro.pv desc) AS rn from rong_getCustomers_data ro join (select msisdn,max(pv) pv from rong_getCustomers_data group by msisdn)st1 on ro.msisdn=st1.msisdn and ro.pv=st1.pv )stt  where rn <=1 and length(stt.msisdn)=11 order by stt.pv desc limit 10
+----------------------------------+---------+
| default.md5(stt.msisdn)          | os_code |
+----------------------------------+---------+
| f2f15925d59d38675612bcf70dc114e9 | 1       |
| 38b63fee0f74c0ffdf3572fd4a9d5e64 | 1       |
| 94b03cf77619a0e9e0309ad1c0a1a46b | 1       |
| a5ff2002635605e8554ee324f7051932 | 3       |
| d7a83461d3c12ebb7b1ed17eb5f64dd2 | 1       |
| f037d0028f8da6d83695ec6e54229db2 | 1       |
| d2c955d29caba4bba41f280f0aee88b7 | 1       |
| 448f4cc33a538a2c6c9fc003f34ca569 | 1       |
| 337c229108d579fea706b57b3212d690 | 1       |
| 39c2352f194d5b2cf077f8708d4a6665 | 1       |
+----------------------------------+---------+
Fetched 10 row(s) in 3.54s

2、测试其有聚合函数时候的写法和正确性;

2.1 聚合函数直接写:

select label,ecid,eid,findedname,ppv  from (select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv,ROW_NUMBER() OVER(PARTITION BY ap.label ORDER BY sum(pv) desc) AS rn  from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid  where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)sts where rn <=50 ; 

其中rownum 的order by 字段用聚合函数:就是用的分组后聚合---按照原来的分组聚合,不是按照COLUMN_A  分组聚合;

2.2 为了测试正确性--分开写,先统计后再用row_number;

select * from (select label,ecid,eid,findedname,ppv,ROW_NUMBER() OVER(PARTITION BY label ORDER BY ppv desc) AS rn from( select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid  where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)stt)sttt where rn <=50; 

2.3、测试结果:

(1)、整体写的结果:

Query: select label,ecid,eid,findedname,ppv  from (select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv,ROW_NUMBER() OVER(PARTITION BY ap.label ORDER BY sum(pv) desc) AS rn  from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid  where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)sts where rn <=50 ; 
+----------+------+------+-----------------------------------------------+-------------+----+
| label    | ecid | eid  | findedname                                    | ppv         | rn |
+----------+------+------+-----------------------------------------------+-------------+----+
| 金融理财 | 9    | 3    | 支付宝                                        | 1274317627  | 1  |
| 金融理财 | 9    | 6    | 财付通                                        | 258357291   | 2  |
| 金融理财 | 21   | 8108 | 同花顺炒股票                                  | 242589336   | 3  |
| 金融理财 | 9    | 5    | 招商银行                                      | 203848183   | 4  |
| 金融理财 | 16   | 7128 | 平安普惠                                      | 178164508   | 5  |
| 金融理财 | 16   | 3    | 同花顺股票开户                                | 142253306   | 6  |
| 金融理财 | 16   | 51   | 财富                                          | 131791277   | 7  |
| 金融理财 | 21   | 7512 | 百度钱包                                      | 104820340   | 8  |
| 金融理财 | 9    | 9    | 中国建设银行                                  | 77952913    | 9  |
| 金融理财 | 9    | 12   | 农行掌上银行                                  | 74952353    | 10 |

(2)、分开写结果:

Query: select * from (select label,ecid,eid,findedname,ppv,ROW_NUMBER() OVER(PARTITION BY label ORDER BY ppv desc) AS rn from( select ap.label,st.ecid,st.eid,ap.findedname,sum(pv) ppv from statistic_day_app_user st join app_label_class ap on st.ecid=ap.ecid and st.eid=ap.eid  where find_in_set(ap.label,'旅游出行,网上购物,金融理财')>0 and day>=20180806 and day<=20180906 group by ap.label,st.ecid,st.eid,ap.findedname)stt)sttt where rn <=50
+----------+------+------+-----------------------------------------------+-------------+----+
| label    | ecid | eid  | findedname                                    | ppv         | rn |
+----------+------+------+-----------------------------------------------+-------------+----+
| 金融理财 | 9    | 3    | 支付宝                                        | 1274317627  | 1  |
| 金融理财 | 9    | 6    | 财付通                                        | 258357291   | 2  |
| 金融理财 | 21   | 8108 | 同花顺炒股票                                  | 242589336   | 3  |
| 金融理财 | 9    | 5    | 招商银行                                      | 203848183   | 4  |
| 金融理财 | 16   | 7128 | 平安普惠                                      | 178164508   | 5  |
| 金融理财 | 16   | 3    | 同花顺股票开户                                | 142253306   | 6  |
| 金融理财 | 16   | 51   | 财富                                          | 131791277   | 7  |
| 金融理财 | 21   | 7512 | 百度钱包                                      | 104820340   | 8  |
| 金融理财 | 9    | 9    | 中国建设银行                                  | 77952913    | 9  |
| 金融理财 | 9    | 12   | 农行掌上银行                                  | 74952353    | 10 |

结论:两种写法均正确;

猜你喜欢

转载自blog.csdn.net/jeffiny/article/details/80453276