SQL | 解决一个实际问题

1 背景

最近实习工作遇到了一个数据提取中的问题,特此记录一下,其实当时完全可以用Python处理,不过想锻炼一下自己SQL能力,所以直接提数过程中一步到位!

2 需求

原始部分数据长这样:

app_id device_uuid first_start_bg
0 2S5Wcx 000101E6-463B-4912-8FC0-575C430A514D app
1 2S5Wcx 00018582-CA28-43AC-A241-53461789673E app
2 2S5Wcx 0001EC8F-FBF3-4E68-B6A1-27BE87984077 app
3 2S5Wcx 0001FEF0-B860-4A29-96B8-7CF20F508BBB push
4 2S5Wcx 000277A9-2276-4D6C-B03F-1C9A815A07AE app
5 2S5Wcx 00028895-C316-4A8A-AE15-FC3F48E410B3 push
6 2S5Wcx 0002A870-AB1D-4FE0-BB34-BF158F901465 app
7 2S5Wcx 0005542B-EEC7-462A-912E-253B74643687 app
8 2S5Wcx D00FD1E1-B885-4EA3-9049-0748841EE17D app
9 2S5Wcx D02CDB1C-748D-4973-B269-740D27713B64 app
10 2S5Wcx 00008F69-3419-4677-8AEC-B9AAC66C16FE app
  • 其中first_start_bg这个字段表示:用户首次启动app是从哪一个来源来的,有app,push等等
  • mentor想要实现的是:**统计一个用户首次启动不同来源的次数!**作为新的变量!

3 思路

  • 试想,如果在已有的表格加上一列count计数,即一个用户首次启动的来源,以及对应一个时间段内对应的次数!这时候问题是不是就能解决了?
  • 一旦有了count列了,这个问题就变成了行转列的问题!参考之前博客文章:SQL | 行转列和列转行

那如何统计次数呢?

  • 可以考虑根据对分析的目标字段first_start_bg进行groupby! 【不同场景这个变量会有差异】
  • 分组之后统计每一个组内day的不同个数,即加个distinct,因为我们统计的单位是以天为单位!【不同场景这个变量会有差异】
  • 不同first_start_bg对应的次数就等于第二步统计的day的个数!于是大功告成!

4 SQL实现

4.1 第一步

select
    device_uuid,
    app_id,
    first_start_bg,
    count(distinct day) as first_start_day_count
from
    (
    select
        day,
        app_id,
        device_uuid,
        first_start_bg	
    from
        t1
    where
        day>=20190304 and day<=20190310
        and app_id in ('2x1kfBk63z','2S5Wcx') 
    )a
group by
    device_uuid,
    app_id,
    first_start_bg

结果为:

device_uuid first_start_bg first_start_day_count
0 641004239 app 1
1 1145635593 push 2
2 000001F6-1F32-4E4A-B74C-8CD740572643 push 5
3 0000EFAB-A42B-4D6B-A4F0-8F02D84B3C89 app 7
4 00019AA3-617F-48DC-AC52-33436C6A13EF app 3
5 0002476A-F198-4A15-9A4E-E3D0638DA51C app 4
6 000262ED-2E7C-4D97-BA1E-07EBE7FB1DFC app 7
7 0002F2E3-4D36-4D71-AAD3-438A6860CBF2 app 4
8 00041656-4C0B-449F-8318-B4D99E16A259 app 1
9 0005F1A4-9CA5-4599-843C-AEF4C955C43F push 2

4.2 第二步

select 
	device_uuid,
	app_id,
	sum(case when first_start_bg = 'app' then first_start_day_count else 0 end) as app_first_count,
	sum(case when first_start_bg = 'push' then first_start_day_count else 0 end) as push_first_count,
	sum(case when first_start_bg = 'ug' then first_start_day_count else 0 end) as ug_first_count,
	sum(case when first_start_bg = 'back' then first_start_day_count else 0 end) as back_first_count

from
	(
	select
		device_uuid,
		app_id,
		first_start_bg,
		count(distinct day) as first_start_day_count
	from
		(
		select
			day,
			app_id,
			device_uuid,
			first_start_bg	
		from
			edm_user_device_activity_day
		where
			day>=20190304 and day<=20190310
			and app_id in ('2x1kfBk63z','2S5Wcx') 
		)a
	group by
		device_uuid,
		app_id,
		first_start_bg
	)b
group by
	device_uuid,
	app_id

结果为:

device_uuid app_first_count push_first_count ug_first_count back_first_count
0 00.0c.e7.41.0a.01 1 0 0 0
1 6446738 0 0 5 0
2 00067E04-C950-4E27-BFC2-BAF3C0D9A31C 3 0 0 0
3 0007602B-F75D-4F85-B3A4-B985FF8B76C4 2 1 0 0
4 0007C915-BF4C-4136-911A-6D328F2EED65 7 0 0 0
5 0007E5CA-E956-433C-997E-78544138318E 1 0 0 0
6 00082AA8-C9F6-4D33-A8A1-6E8C9B6B7DCB 4 0 0 0
7 00082E68-0D8E-464B-BF43-2DACA0151C75 1 0 0 0

大功告成!所以就两步解决问题!思路理清了实现就不是问题!

5 参考资料

猜你喜欢

转载自blog.csdn.net/qq_27782503/article/details/90478132