spark count(distinct)over() data processing

spark count(distinct)over() data processing

Business description

There is such a business that needs to filter out the data of the same device with different accounts, as well as the data of different devices with the same account, leaving only the one-to-one data of the device and the account:
If you look at the relational database, A is many pairs to B, you need to find Out and keep A one-to-one B.

data preparation

Device id account number
1 a
2 b
2 c
3 b
3 d
4 d
5 e
5 e
/*
从数据上看只有(1,a),(5,e)满足一对一的要求,
设备2存在b/c两个账号,设备3存在b/d两个账号,账号d存在3、4两个设备
*/
-- 首先模拟hive处理数据, hive --version  =>  Hive 2.1.1-cdh6.2.0
WITH da AS(
SELECT 1 dev_id, 'a' acc UNION ALL
SELECT 2 dev_id, 'b' acc UNION ALL
SELECT 2 dev_id, 'c' acc UNION ALL
SELECT 3 dev_id, 'b' acc UNION ALL
SELECT 3 dev_id, 'd' acc UNION ALL
SELECT 4 dev_id, 'd' acc UNION ALL
SELECT 5 dev_id, 'e' acc UNION ALL
SELECT 5 dev_id, 'e' acc)
SELECT dev_id, acc FROM
(SELECT dev_id ,--设备
acc , --账号
COUNT(DISTINCT dev_id) OVER(PARTITION BY acc) sadd_cnt, --相同账号不同设备个数
COUNT(DISTINCT acc) OVER(PARTITION BY dev_id) sdda_cnt  --相同设备不同账号个数
from da) t where sadd_cnt = 1 and  sdda_cnt = 1;

Insert picture description here

problem appear

Port hiveQL to spark operation

spark.sql(
  s"""
     |SELECT dev_id, acc FROM
     |(SELECT dev_id ,--设备
     |acc , --账号
     |COUNT(DISTINCT dev_id) OVER(PARTITION BY acc) sadd_cnt, --相同账号不同设备个数
     |COUNT(DISTINCT acc) OVER(PARTITION BY dev_id) sdda_cnt  --相同设备不同账号个数
     |from (SELECT 1 dev_id, 'a' acc UNION ALL
     |SELECT 2 dev_id, 'b' acc UNION ALL
     |SELECT 2 dev_id, 'c' acc UNION ALL
     |SELECT 3 dev_id, 'b' acc UNION ALL
     |SELECT 3 dev_id, 'd' acc UNION ALL
     |SELECT 4 dev_id, 'd' acc UNION ALL
     |SELECT 5 dev_id, 'e' acc UNION ALL
     |SELECT 5 dev_id, 'e' acc) )
     |t where sadd_cnt = 1 and  sdda_cnt = 1
   """.stripMargin).show()

Insert picture description here

Solve the problem

Refer to hive to solve count(distinct)over()

WITH da AS (
SELECT 1 dev_id, 'a' acc UNION ALL
SELECT 2 dev_id, 'b' acc UNION ALL
SELECT 2 dev_id, 'c' acc UNION ALL
SELECT 3 dev_id, 'b' acc UNION ALL
SELECT 3 dev_id, 'd' acc UNION ALL
SELECT 4 dev_id, 'd' acc UNION ALL
SELECT 5 dev_id, 'e' acc UNION ALL
SELECT 5 dev_id, 'e' acc)
SELECT dev_id, acc FROM
(SELECT  dev_id ,--设备
acc , --账号
SIZE(COLLECT_SET( dev_id) OVER(PARTITION BY acc)) sadd_cnt, --相同账号不同设备个数
SIZE(COLLECT_SET( acc) OVER(PARTITION BY dev_id)) sdda_cnt  --相同设备不同账号个数
FROM da) t WHERE sadd_cnt = 1 AND  sdda_cnt = 1;

Insert picture description here

Use spark scala code instead

spark.sql(
  s"""
     |WITH da AS (
     |SELECT 1 dev_id, 'a' acc UNION ALL
     |SELECT 2 dev_id, 'b' acc UNION ALL
     |SELECT 2 dev_id, 'c' acc UNION ALL
     |SELECT 3 dev_id, 'b' acc UNION ALL
     |SELECT 3 dev_id, 'd' acc UNION ALL
     |SELECT 4 dev_id, 'd' acc UNION ALL
     |SELECT 5 dev_id, 'e' acc UNION ALL
     |SELECT 5 dev_id, 'e' acc)
     |SELECT dev_id, acc FROM
     |(SELECT  dev_id ,--设备
     |acc , --账号
     |SIZE(COLLECT_SET( dev_id) OVER(PARTITION BY acc)) sadd_cnt, --相同账号不同设备个数
     |SIZE(COLLECT_SET( acc) OVER(PARTITION BY dev_id)) sdda_cnt  --相同设备不同账号个数
     |FROM da) t WHERE sadd_cnt = 1 AND  sdda_cnt = 1
   """.stripMargin).show()

Insert picture description here

Reference site

HIVE----count(distinct) over() cannot use the solution https://www.cnblogs.com/luckyfruit/p/13093203.html

Guess you like

Origin blog.csdn.net/dbc_zt/article/details/110499440