【网站点击流数据分析】07-统计分析

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/shenchaohao12321/article/details/82975362

每一种统计指标都可以跟各维度表进行叉乘,从而得出各个维度的统计结果 ,为了在前端展示时速度更快,每一个指标都事先算出各维度结果存入mysql。

提前准备好维表数据,在hive仓库中创建相应维表,如:

时间维表:

create table v_time(year string,month string,day string,hour string)
row format delimited
fields terminated by ',';

load data local inpath '/home/hadoop/v_time.txt' into table v_time;

v_time.txt

2013,09,18,00
2013,09,18,01
2013,09,18,02
2013,09,18,03
2013,09,18,04
2013,09,18,05
2013,09,18,06
。。。

 在实际生产中,究竟需要哪些统计指标通常由相关数据需求部门人员提出,而且会不断有新的统计需求产生,以下为网站流量分析中的一些典型指标示例。

1、PV统计

1.1、多维度统计PV总量

1.1.1、时间维度

--计算指定的某个小时pvs
select count(*),month,day,hour from dw_click.ods_weblog_detail group by month,day,hour;

--计算该处理批次(一天)中的各小时pvs
drop table dw_pvs_hour;
create table dw_pvs_hour(month string,day string,hour string,pvs bigint) partitioned by(datestr string);

insert into table dw_pvs_hour partition(datestr='2016-03-18')
select a.month as month,a.day as day,a.hour as hour,count(1) as pvs from ods_weblog_detail a
where  a.datestr='2016-03-18'  group by a.month,a.day,a.hour;

或者用时间维表关联

维度:日

drop table dw_pvs_day;
create table dw_pvs_day(pvs bigint,month string,day string);

insert into table dw_pvs_day
select count(1) as pvs,a.month as month,a.day as day  from dim_time a
join ods_weblog_detail b
on b.dd='18/Sep/2013' and a.month=b.month and a.day=b.day
group by a.month,a.day;

--或者,从之前算好的小时结果中统计
insert into table dw_pvs_day
select sum(pvs) as pvs,month,day from dw_pvs_hour group by month,day having day='18';

维度:月

drop table t_display_pv_month;
create table t_display_pv_month (pvs bigint,month string);
insert into table t_display_pv_month
select count(*) as pvs,a.month from t_dim_time a
join t_ods_detail_prt b on a.month=b.month group by a.month;

1.1.2、按终端维度统计pv总量

注:探索数据中的终端类型

select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Mozilla%' limit 200;

终端维度:uc

drop table t_display_pv_terminal_uc;
create table t_display_pv_ terminal_uc (pvs bigint,mm string,dd string,hh string);

终端维度:chrome

drop table t_display_pv_terminal_chrome;
create table t_display_pv_ terminal_ chrome (pvs bigint,mm string,dd string,hh string);

终端维度:safari

drop table t_display_pv_terminal_safari;
create table t_display_pv_ terminal_ safari (pvs bigint,mm string,dd string,hh string);

1.1.3、按栏目维度统计pv总量

栏目维度:job

栏目维度:news

栏目维度:bargin

栏目维度:lane

1.2、人均浏览页数

需求描述:比如,今日所有来访者,平均请求的页面数。

总页面请求数/去重总人数

drop table dw_avgpv_user_d;
create table dw_avgpv_user_d(
day string,
avgpv string);

insert into table dw_avgpv_user_d
select '2013-09-18',sum(b.pvs)/count(b.remote_addr) from
(select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='2013-09-18' group by remote_addr) b;

1.3、按referer维度统计pv总量

需求:按照来源及时间维度统计PVS,并按照PV大小倒序排序。

 按照小时粒度统计,查询结果存入:( "dw_pvs_referer_h" )

drop table dw_pvs_referer_h;
create table dw_pvs_referer_h(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);

insert into table dw_pvs_referer_h partition(datestr='2016-03-18')
select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
from ods_weblog_detail
group by http_referer,ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,pv_referer_cnt desc;

按小时粒度统计各来访域名的产生的pv数并排序

drop table dw_ref_host_visit_cnts_h;
create table dw_ref_host_visit_cnts_h(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);

insert into table dw_ref_host_visit_cnts_h partition(datestr='2016-03-18')
select ref_host,month,day,hour,count(1) as ref_host_cnts
from ods_weblog_detail
group by ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,ref_host_cnts desc;

注:还可以按来源地域维度、访客终端维度等计算。

1.4、统计pv总量最大的来源TOPN  (取分组TOP)

需求描述:按照时间维度,比如,统计一天内各小时产生最多pvs的来源top N。

需要用到row_number函数

以下语句对每个小时内的来访host次数倒序排序标号:

select ref_host,ref_host_cnts,concat(month,hour,day),
row_number() over (partition by concat(month,hour,day) order by ref_host_cnts desc) as od
from dw_ref_host_visit_cnts_h;

效果如下:

2、受访分析

统计每日最热门的页面top10

drop table dw_pvs_d;
create table dw_pvs_d(day string,url string,pvs string);

insert into table dw_pvs_d
select '2013-09-18',a.request,a.request_counts from
(select request as request,count(request) as request_counts from ods_weblog_detail where datestr='2013-09-18' group by request having request is not null) a
order by a.request_counts desc limit 10;

结果如下:

注:还可继续得出各维度交叉结果

3、访客分析

3.1 独立访客

需求描述:按照时间维度比如小时来统计独立访客及其产生的pvCnts

对于独立访客的识别,如果在原始日志中有用户标识,则根据用户标识即很好实现;

此处,由于原始日志中并没有用户标识,以访客IP来模拟,技术上是一样的,只是精确度相对较低。

时间维度:时

drop table dw_user_dstc_ip_h;
create table dw_user_dstc_ip_h(
remote_addr string,
pvs      bigint,
hour     string);

insert into table dw_user_dstc_ip_h
select remote_addr,count(1) as pvs,concat(month,day,hour) as hour
from ods_weblog_detail
Where datestr='2013-09-18'
group by concat(month,day,hour),remote_addr;

在此结果表之上,可以进一步统计出,每小时独立访客总数,每小时请求次数topn访客等

如每小时独立访客总数:

select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;

统计每小时请求次数topn的独立访客

时间维度:月

select remote_addr,count(1) as counts,month
from ods_weblog_detail
group by month,remote_addr;

时间维度:日

select remote_addr,count(1) as counts,concat(month,day) as day
from ods_weblog_detail
Where dd='18/Sep/2013'
group by concat(month,day),remote_addr;

注:还可以按来源地域维度、访客终端维度等计算

3.2、每日新访客

需求描述:将每天的新访客统计出来

实现思路:创建一个去重访客累积表,然后将每日访客对比累积表。

时间维度:日

--历日去重访客累积表
drop table dw_user_dsct_history;
create table dw_user_dsct_history(
day string,
ip string
)
partitioned by(datestr string);

--每日新用户追加到累计表
drop table dw_user_dsct_history;
create table dw_user_dsct_history(
day string,
ip string
)
partitioned by(datestr string);

--每日新用户追加到累计表
insert into table dw_user_dsct_history partition(datestr='2013-09-19')
select tmp.day as day,tmp.today_addr as new_ip from
(
select today.day as day,today.remote_addr as today_addr,old.ip as old_addr
from
(select distinct remote_addr as remote_addr,"2013-09-19" as day from ods_weblog_detail where datestr="2013-09-19") today
left outer join
dw_user_dsct_history old
on today.remote_addr=old.ip
) tmp
where tmp.old_addr is null;

验证:

select count(distinct remote_addr) from ods_weblog_detail;
-- 1005

select count(1) from dw_user_dsct_history where prtflag_day='18/Sep/2013';
--845

select count(1) from dw_user_dsct_history where prtflag_day='19/Sep/2013';
--160

时间维度:月

类似日粒度算法。

注:还可以按来源地域维度、访客终端维度等计算。

4、Visit分析(点击流模型)

4.1、回头/单次访客统计

需求描述:查询今日所有回头访客及其访问次数。

实现思路:上表中出现次数>1的访客,即回头访客;反之,则为单次访客

drop table dw_user_returning;
create table dw_user_returning(
day string,
remote_addr string,
acc_cnt string)
partitioned by (datestr string);

insert overwrite table dw_user_returning partition(datestr='2013-09-18')
select tmp.day,tmp.remote_addr,tmp.acc_cnt
from
(select '2013-09-18' as day,remote_addr,count(session) as acc_cnt from click_stream_visit group by remote_addr) tmp
where tmp.acc_cnt>1;

4.2、人均访问频次

需求:统计出每天所有用户访问网站的平均次数(visit)。

总visit数/去重总用户数

select sum(pagevisits)/count(distinct remote_addr) from click_stream_visit partition(datestr='2013-09-18');

6、关键路径转化率分析——漏斗模型

转化:在一条指定的业务流程中,各个步骤的完成人数及相对上一个步骤的百分比。

6.1 需求分析

6.2 模型设计

定义好业务流程中的页面标识,下例中的步骤为:

Step1、  /item%

Step2、  /category

Step3、  /index

Step4、  /order

6.3 开发实现

分步骤开发:

6.3.1、查询每一个步骤的总访问人数


create table dw_oute_numbs as
select 'step1' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where request like '/item%'
union
select 'step2' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where request like '/category%'
union
select 'step3' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where request like '/order%'
union
select 'step4' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where request like '/index%';

6.3.2、查询每一步骤相对于路径起点人数的比例

思路:利用join

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from route_num rn
inner join
route_num rr

select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr) tmp
where tmp.rrstep='step1';

 

6.3.3、查询每一步骤相对于上一步骤的漏出率

select tmp.rrstep as rrstep,tmp.rrnumbs/tmp.rnnumbs as ration
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from route_num rn
inner join
route_num rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1

 

6.3.4、汇总以上两种指标

select abs.step,abs.numbs,abs.ratio as abs_ratio,rel.ratio as rel_ratio
from
(
select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from route_num rn
inner join
route_num rr) tmp
where tmp.rrstep='step1'
) abs
left outer join
(
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from route_num rn
inner join
route_num rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step

 

 

猜你喜欢

转载自blog.csdn.net/shenchaohao12321/article/details/82975362