spark count(distinct)over() data processing
Business description
There is such a business that needs to filter out the data of the same device with different accounts, as well as the data of different devices with the same account, leaving only the one-to-one data of the device and the account:
If you look at the relational database, A is many pairs to B, you need to find Out and keep A one-to-one B.
data preparation
Device id | account number |
---|---|
1 | a |
2 | b |
2 | c |
3 | b |
3 | d |
4 | d |
5 | e |
5 | e |
/*
从数据上看只有(1,a),(5,e)满足一对一的要求,
设备2存在b/c两个账号,设备3存在b/d两个账号,账号d存在3、4两个设备
*/
-- 首先模拟hive处理数据, hive --version => Hive 2.1.1-cdh6.2.0
WITH da AS(
SELECT 1 dev_id, 'a' acc UNION ALL
SELECT 2 dev_id, 'b' acc UNION ALL
SELECT 2 dev_id, 'c' acc UNION ALL
SELECT 3 dev_id, 'b' acc UNION ALL
SELECT 3 dev_id, 'd' acc UNION ALL
SELECT 4 dev_id, 'd' acc UNION ALL
SELECT 5 dev_id, 'e' acc UNION ALL
SELECT 5 dev_id, 'e' acc)
SELECT dev_id, acc FROM
(SELECT dev_id ,--设备
acc , --账号
COUNT(DISTINCT dev_id) OVER(PARTITION BY acc) sadd_cnt, --相同账号不同设备个数
COUNT(DISTINCT acc) OVER(PARTITION BY dev_id) sdda_cnt --相同设备不同账号个数
from da) t where sadd_cnt = 1 and sdda_cnt = 1;
problem appear
Port hiveQL to spark operation
spark.sql(
s"""
|SELECT dev_id, acc FROM
|(SELECT dev_id ,--设备
|acc , --账号
|COUNT(DISTINCT dev_id) OVER(PARTITION BY acc) sadd_cnt, --相同账号不同设备个数
|COUNT(DISTINCT acc) OVER(PARTITION BY dev_id) sdda_cnt --相同设备不同账号个数
|from (SELECT 1 dev_id, 'a' acc UNION ALL
|SELECT 2 dev_id, 'b' acc UNION ALL
|SELECT 2 dev_id, 'c' acc UNION ALL
|SELECT 3 dev_id, 'b' acc UNION ALL
|SELECT 3 dev_id, 'd' acc UNION ALL
|SELECT 4 dev_id, 'd' acc UNION ALL
|SELECT 5 dev_id, 'e' acc UNION ALL
|SELECT 5 dev_id, 'e' acc) )
|t where sadd_cnt = 1 and sdda_cnt = 1
""".stripMargin).show()
Solve the problem
Refer to hive to solve count(distinct)over()
WITH da AS (
SELECT 1 dev_id, 'a' acc UNION ALL
SELECT 2 dev_id, 'b' acc UNION ALL
SELECT 2 dev_id, 'c' acc UNION ALL
SELECT 3 dev_id, 'b' acc UNION ALL
SELECT 3 dev_id, 'd' acc UNION ALL
SELECT 4 dev_id, 'd' acc UNION ALL
SELECT 5 dev_id, 'e' acc UNION ALL
SELECT 5 dev_id, 'e' acc)
SELECT dev_id, acc FROM
(SELECT dev_id ,--设备
acc , --账号
SIZE(COLLECT_SET( dev_id) OVER(PARTITION BY acc)) sadd_cnt, --相同账号不同设备个数
SIZE(COLLECT_SET( acc) OVER(PARTITION BY dev_id)) sdda_cnt --相同设备不同账号个数
FROM da) t WHERE sadd_cnt = 1 AND sdda_cnt = 1;
Use spark scala code instead
spark.sql(
s"""
|WITH da AS (
|SELECT 1 dev_id, 'a' acc UNION ALL
|SELECT 2 dev_id, 'b' acc UNION ALL
|SELECT 2 dev_id, 'c' acc UNION ALL
|SELECT 3 dev_id, 'b' acc UNION ALL
|SELECT 3 dev_id, 'd' acc UNION ALL
|SELECT 4 dev_id, 'd' acc UNION ALL
|SELECT 5 dev_id, 'e' acc UNION ALL
|SELECT 5 dev_id, 'e' acc)
|SELECT dev_id, acc FROM
|(SELECT dev_id ,--设备
|acc , --账号
|SIZE(COLLECT_SET( dev_id) OVER(PARTITION BY acc)) sadd_cnt, --相同账号不同设备个数
|SIZE(COLLECT_SET( acc) OVER(PARTITION BY dev_id)) sdda_cnt --相同设备不同账号个数
|FROM da) t WHERE sadd_cnt = 1 AND sdda_cnt = 1
""".stripMargin).show()
Reference site
HIVE----count(distinct) over() cannot use the solution https://www.cnblogs.com/luckyfruit/p/13093203.html