数据仓库设计

2.本项目中数据仓库的设计（注：采用星型模型）
	1.事实表设计

	2.维度表设计

	注意： 
		维度表的数据一般要结合业务情况自己写脚本按照规则生成，也可以使用工具生成，方便后续的关联分析。 
		比如一般会事前生成时间维度表中的数据，跨度从业务需要的日期到当前日期即可，具体根据你的分析粒度，
		可以生成年、季、月、周、天、时等相关信息，用于分析。



3.模块开发----ETL
	ETL 工作的实质就是从各个数据源提取数据，对数据进行转换，并最终加载填充数据到数据仓库维度建模后的表中。
	只有当这些维度/事实表被填充好，ETL工作才算完成。 
	本项目的数据分析过程在 hadoop 集群上实现，主要应用 hive 数据仓库工具，因此，采集并经过预处理后的数据，需要加载到 hive 数据仓库中，以进行后续的分析过程。
	
	1.创建 ODS 层数据表 
		1.原始日志数据表 
			1.drop table if exists ods_weblog_origin; 
			2.create table ods_weblog_origin( 
				valid string, 
				remote_addr string, 
				remote_user string, 
				time_local string, 
				request string, 
				status string, 
				body_bytes_sent string, 
				http_referer string, 
				http_user_agent string) 
			  partitioned by (datestr string) 
			  row format delimited fields terminated by '\001'; 

		2.点击流模型 pageviews 模型表 
			1.drop table if exists ods_click_pageviews; 
			2.create table ods_click_pageviews( 
				session string, 
				remote_addr string, 
				remote_user string, 
				time_local string, 
				request string, 
				visit_step string, 
				page_staylong string, 
				http_referer string, 
				http_user_agent string, 
				body_bytes_sent string, 
				status string) 
			  partitioned by (datestr string) 
			  row format delimited fields terminated by '\001'; 

		3.点击流模型 visit 模型表 
			1.drop table if exist ods_click_stream_visit; 
			2.create table ods_click_stream_visit( 
				session     string, 
				remote_addr string, 
				inTime      string, 
				outTime     string, 
				inPage      string, 
				outPage     string, 
				referal     string, 
				pageVisits  int) 
			  partitioned by (datestr string) 
			  row format delimited fields terminated by '\001'; 

	2.导入 ODS 层数据 
		1.数据导入：load data inpath '/weblog/preprocessed/' overwrite into table ods_weblog_origin partition(datestr='20130918'); 	
		2.查看分区：show partitions ods_weblog_origin;  
		3.统计导入的数据总数：select count(*) from ods_weblog_origin; 
		4.点击流模型的两张表（pageviews、visit 模型表）数据导入操作同上。 
		5.注：生产环境中应该将数据 load 命令，写在脚本中，然后配置在 azkaban 中定时运行，注意运行的时间点，应该在预处理数据完成之后。 
	
	3.生成 ODS 层明细宽表 
		1.需求实现 
			整个数据分析的过程是按照数据仓库的层次分层进行的，总体来说，是从 ODS 原始数据中整理出一些中间表
			（比如，为后续分析方便，将原始数据中的时间、url 等非结构化数据作结构化抽取，将各种字段信息进行细化，形成明细表），
			然后再在中间表的基础之上统计出各种指标数据。

		2.ETL 实现：建明细表 ods_weblog_detail  
			1.drop table ods_weblog_detail; 
			2.create table ods_weblog_detail( 
				valid string, --有效标识 
				remote_addr     string, # 来源 IP 
				remote_user     string, # 用户标识 
				time_local      string, # 访问完整时间 
				daystr          string, # 访问日期 
				timestr         string, # 访问时间 
				month           string, # 访问月 
				day             string, # 访问日 
				hour            string, # 访问时 
				request         string, # 请求的 url 
				status          string, # 响应码 
				body_bytes_sent string, # 传输字节数 
				http_referer    string, # 来源 url 
				ref_host        string, # 来源的 host 
				ref_path        string, # 来源的路径 
				ref_query       string, # 来源参数 query 
				ref_query_id    string, # 来源参数 query 的值 
				http_user_agent string) # 客户终端标识 
			  partitioned by(datestr string); 

			3.通过查询插入数据到明细宽表 ods_weblog_detail 中 
				1.抽取 refer_url 到中间表 t_ods_tmp_referurl，也就是将来访 url 分离出 host、path、query、query id。 
				2.drop table if exists t_ods_tmp_referurl; 
				3.create table t_ods_tmp_referurl as 
				  SELECT a.*,b.* 
				  FROM ods_weblog_origin a  
				  LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b 
				  as host, path, query, query_id; 
				4.LATERAL VIEW 用于和 split, Explode 等 UDTF 一起使用，它能够将一列数据拆成 多行数据。 
				5.UDTF(User-Defined Table-Generating Functions) ：
					用来解决 输入一行 输出多行(On-to-many maping) 的需求。
					Explode 也是拆列函数，比如 Explode (ARRAY) ，array 中的每个元素生成一行。 

			4.抽取转换 time_local 字段到中间表明细表 t_ods_tmp_detail 
				1.drop table if exists t_ods_tmp_detail; 
				2.create table t_ods_tmp_detail as  
				  select b.*,substring(time_local,0,10) as daystr, 
					substring(time_local,12) as tmstr, 
					substring(time_local,6,2) as month, 
					substring(time_local,9,2) as day, 
					substring(time_local,11,3) as hour 
				  from t_ods_tmp_referurl b; 

			5.以上语句可以合成一个总的语句 
				insert into table shizhan.ods_weblog_detail partition(datestr='2013-09-18') 
				select c.valid,c.remote_addr,c.remote_user,c.time_local, 
					substring(c.time_local,0,10) as daystr, 
					substring(c.time_local,12) as tmstr, 
					substring(c.time_local,6,2) as month, 
					substring(c.time_local,9,2) as day, 
					substring(c.time_local,11,3) as hour, 
					c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent 
				from (SELECT a.valid,a.remote_addr,a.remote_user,a.time_local, a.request,a.status,a.body_bytes_sent,a.http_referer,
					    a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id  
				       FROM shizhan.ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 
					    'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query, ref_query_id) c; 


4.模块开发----统计分析 
	数据仓库建设好以后，用户就可以编写 Hive SQL 语句对其进行访问并对其中数据进行分析。 
	在实际生产中，究竟需要哪些统计指标通常由数据需求相关部门人员提出，而且会不断有新的统计需求产生，以下为网站流量分析中的一些典型指标示例。  
	注：每一种统计指标都可以跟各维度表进行钻取。

	1.流量分析 
		1.多维度统计 PV 总量
			1.按时间维度 
				1.计算每小时 pvs，注意 gruop by 语法 
				  select count(*) as pvs,month,day,hour from ods_weblog_detail group by month,day,hour; 
		
				2.方式一：直接在 ods_weblog_detail 单表上进行查询 
					1.计算该处理批次（一天）中的各小时 pvs 
						1.drop table dw_pvs_everyhour_oneday; 
						2.create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) 
						  partitioned by(datestr string); 
						3.insert into table dw_pvs_everyhour_oneday partition(datestr='20130918') 
						  select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a 
						  where  a.datestr='20130918' group by a.month,a.day,a.hour; 
 
					2.计算每天的 pvs 
						1.drop table dw_pvs_everyday; 
						2.create table dw_pvs_everyday(pvs bigint,month string,day string); 
						3.insert into table dw_pvs_everyday 
						  select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a 
						  group by a.month,a.day; 

				3.方式二：与时间维表关联查询 
					1.维度：日 
						1.drop table dw_pvs_everyday; 
						2.create table dw_pvs_everyday(pvs bigint,month string,day string); 
						3.insert into table dw_pvs_everyday 
						  select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a 
						  join ods_weblog_detail b  
						  on a.month=b.month and a.day=b.day 
						  group by a.month,a.day; 
 
					2.维度：月 
						1.drop table dw_pvs_everymonth; 
						2.create table dw_pvs_everymonth (pvs bigint,month string); 
						3.insert into table dw_pvs_everymonth 
						4.select count(*) as pvs,a.month from (select distinct month from t_dim_time) a 
						  join ods_weblog_detail b on a.month=b.month group by a.month; 
 
					3.另外，也可以直接利用之前的计算结果。比如从之前算好的小时结果中统计每一天的 
						insert into table dw_pvs_everyday 
						select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';

			2.按终端维度 
				1.数据中能够反映出用户终端信息的字段是 http_user_agent。 
				2.User Agent 也简称 UA。
					1.它是一个特殊字符串头，是一种向访问网站提供所使用的浏览器类型及版本、操作系统及版本、浏览器内核、等信息的标识。
					2.例如：User-Agent,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) 
					        Chrome/58.0.3029.276 Safari/537.36 
					3.上述 UA 信息就可以提取出以下的信息： 
						chrome 58.0、浏览器 chrome、浏览器版本 58.0、系统平台 windows、浏览器内核 webkit 
				3.可以用下面的语句进行试探性统计，当然这样的准确度不是很高。 
					select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Chrome%' limit 200; 

			3.按栏目维度 
				网站栏目可以理解为网站中内容相关的主题集中。 
				体现在域名上来看就是不同的栏目会有不同的二级目录。
				比如某网站网址为 www.xxxx.cn，旗下栏目可以通过如下方式访问： 
					栏目维度：../job 
					栏目维度：../news 
					栏目维度：../sports 
					栏目维度：../technology 
				那么根据用户请求 url 就可以解析出访问栏目，然后按照栏目进行统计分析。 
			

			4.按 referer 维度
				1.统计每小时各来访 url 产生的 pv 量 
					1.drop table dw_pvs_referer_everyhour; 
					2.create table dw_pvs_referer_everyhour(
						referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) 
					  partitioned by(datestr string); 
					3.insert into table dw_pvs_referer_everyhour partition(datestr='20130918') 
					  select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt 
					  from ods_weblog_detail  
					  group by http_referer,ref_host,month,day,hour  
					  having ref_host is not null 
					  order by hour asc,day asc,month asc,pv_referer_cnt desc; 

				2.统计每小时各来访 host 的产生的 pv 数并排序 
					1.drop table dw_pvs_refererhost_everyhour; 
					2.create table dw_pvs_refererhost_everyhour(
						ref_host string,month string,day string,hour string,ref_host_cnts bigint) 
					  partitioned by(datestr string); 
					3.insert into table dw_pvs_refererhost_everyhour partition(datestr='20130918') 
					  select ref_host,month,day,hour,count(1) as ref_host_cnts 
					  from ods_weblog_detail 
					  group by ref_host,month,day,hour  
					  having ref_host is not null 
					  order by hour asc,day asc,month asc,ref_host_cnts desc; 

 				3.注：还可以按来源地域维度、访客终端维度等计算

		2.人均浏览量 
			1.需求描述：统计今日所有来访者平均请求的页面数。 
			2.人均浏览量也称作人均浏览页数，该指标可以说明网站对用户的粘性。 
			  人均页面浏览量表示用户某一时段平均浏览页面的次数。 
			  计算方式：总页面请求数/去重总人数 
			  remote_addr表示不同的用户。 
			  可以先统计出不同 remote_addr 的 pv量， 然后累加（sum）所有 pv 作为总的页面请求数，再 count 所有 remote_addr 作为总的去重总人数。 
			3.总页面请求数/去重总人数 
				1.drop table dw_avgpv_user_everyday; 
				2.create table dw_avgpv_user_everyday(day string, avgpv string); 
 				3.insert into table dw_avgpv_user_everyday 
				  select '20130918',sum(b.pvs)/count(b.remote_addr) from 
				  (select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b; 

		3.统计 pv 总量最大的来源 TOPN (分组 TOP) 
			1.需求描述：统计每小时各来访 host 的产生的 pvs 数最多的前 N 个（topN） 。 
			2.row_number()函数 
				1.语法：row_number() over (partition by xxx order by xxx) rank。
				2.rank 为分组的别名，相当于新增一个字段为 rank。 
				3.partition by 用于分组，比方说依照 sex 字段分组 
				4.order by 用于分组内排序，比方说依照 sex 分组，组内按照 age 排序 
				5.排好序之后，为每个分组内每一条分组记录从 1 开始返回一个数字 
				6.取组内某个数据，可以使用 “where 表名.rank > x” 之类的语法去取 
			3.以下语句对每个小时内的来访 host 次数倒序排序(从大到小)标号: 
				select ref_host,ref_host_cnts,concat(month,day,hour), 
				row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od 
				from dw_pvs_refererhost_everyhour; 
			4.效果如下：

	2.受访分析（从页面的角度分析） 
		1.各页面访问统计 
			主要是针对数据中的 request 进行统计分析，比如各页面 PV ，各页面 UV 等。 
			以上指标无非就是根据页面的字段 group by。
			例如：统计各页面 pv 
				select request as request,count(request) as request_counts from ods_weblog_detail 
				group by request having request is not null order by request_counts desc limit 20;

		2.热门页面统计 
			统计每日最热门的页面 top10 
				1.drop table dw_hotpages_everyday; 
				2.create table dw_hotpages_everyday(day string,url string,pvs string); 
				3.insert into table dw_hotpages_everyday 
				  select '20130918',a.request,a.request_counts from 
				  (
					select request as request,count(request) as request_counts from ods_weblog_detail where datestr='20130918' 
					group by request having request is not null
				  ) a order by a.request_counts desc limit 10;  

	3.访客分析 
		1.独立访客 
			1.需求描述：按照时间维度，比如：小时来统计独立访客及其产生的 pv。 
			2.对于独立访客的识别，如果在原始日志中有用户标识，则根据用户标识即很好实现；
			  此处，由于原始日志中并没有用户标识，以访客 IP 来模拟，技术上是一样的，只是精确度相对较低。 
			3.时间维度：时 
				1.drop table dw_user_dstc_ip_h; 
				2.create table dw_user_dstc_ip_h(remote_addr string, pvs bigint, hour string); 
				3.insert into table dw_user_dstc_ip_h  
				  select remote_addr,count(1) as pvs,concat(month,day,hour) as hour from ods_weblog_detail 
				  Where datestr='20130918' 
				  group by concat(month,day,hour),remote_addr; 
				4.在此结果表之上，可以进一步统计，如每小时独立访客总数： 
					select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour; 
			4.时间维度：日 
				select remote_addr,count(1) as counts,concat(month,day) as day 
				from ods_weblog_detail 
				Where datestr='20130918' 
				group by concat(month,day),remote_addr; 

			5.时间维度：月 
				select remote_addr,count(1) as counts,month  
				from ods_weblog_detail 
				group by month,remote_addr;

	4.每日新访客 
		1.需求：将每天的新访客统计出来。 
		2.实现思路：创建一个去重访客累积表，然后将每日访客对比累积表。

		3.历日去重访客累积表 
			1.drop table dw_user_dsct_history; 
			2.create table dw_user_dsct_history(day string, ip string)  
			  partitioned by(datestr string); 

		4.每日新访客表 
			1.drop table dw_user_new_d; 
			2.create table dw_user_new_d (day string, ip string)  
			  partitioned by(datestr string); 
 
		5.每日新用户插入新访客表 
			1.insert into table dw_user_new_d partition(datestr='20130918') 
			  select tmp.day as day,tmp.today_addr as new_ip 
			  from(
				select today.day as day,today.remote_addr as today_addr,old.ip as old_addr 
			         from (
					select distinct remote_addr as remote_addr,"20130918" as day 
				      	from ods_weblog_detail where datestr="20130918"
				      ) today left outer join dw_user_dsct_history old on today.remote_addr=old.ip
			       ) tmp 
			  where tmp.old_addr is null;  
 
		6.每日新用户追加到累计表 
			insert into table dw_user_dsct_history partition(datestr='20130918') 
			select day,ip from dw_user_new_d where datestr='20130918'; 

		7.验证查看： 
			select count(distinct remote_addr) from ods_weblog_detail; 
			select count(1) from dw_user_dsct_history where datestr='20130918'; 
			select count(1) from dw_user_new_d where datestr='20130918'; 

		8.注：还可以按来源地域维度、访客终端维度等计算 

	5.访客 Visit 分析（点击流模型） 
		1.回头/单次访客统计 
			1.需求：查询今日所有回头访客及其访问次数。

			2.实现思路：上表中 “出现次数 > 1” 的访客，即回头访客；反之，则为单次访客。 
				1.drop table dw_user_returning; 
				2.create table dw_user_returning(day string, remote_addr string, acc_cnt string) 
				  partitioned by (datestr string); 
				3.insert overwrite table dw_user_returning partition(datestr='20130918') 
				  select tmp.day,tmp.remote_addr,tmp.acc_cnt 
				  from (select '20130918' as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit group by remote_addr) tmp 
				  where tmp.acc_cnt > 1; 

		2.人均访问频次 
			1.需求：统计出每天所有用户访问网站的平均次数（visit） 
			2.总 visit 数/去重总用户数 
				select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918'; 

	6.关键路径转化率分析（漏斗模型） 
		1.需求分析 
			转化：在一条指定的业务流程中，各个步骤的完成人数及相对上一个步骤的百分比。

		2.模型设计 
			定义好业务流程中的页面标识，下例中的步骤为： 
				Step1、  /item 
				Step2、  /category 
				Step3、  /index 
				Step4、  /order

		3.开发实现 
			1.查询每一个步骤的总访问人数：查询每一步人数存入 dw_oute_numbs 
				1.create table dw_oute_numbs as  
				  select 'step1' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews 
				  where datestr='20130920' and request like '/item%' 
				  union 
				  select 'step2' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews 
				  where datestr='20130920' and request like '/category%' 
				  union 
				  select 'step3' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews 
				  where datestr='20130920' and request like '/order%' 
				  union 
				  select 'step4' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews 
				  where datestr='20130920' and request like '/index%'; 
			  注：UNION 将多个 SELECT 语句的结果集合并为一个独立的结果集。

			2.查询每一步骤相对于路径起点人数的比例 
			  思路：级联查询，利用自 join 
				1.dw_oute_numbs 跟自己 join 
					select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  
					from dw_oute_numbs rn inner join dw_oute_numbs rr; 

				2.每一步的人数/第一步的人数==每一步相对起点人数比例 
					select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio 
					from ( 
						select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs
						from dw_oute_numbs rn inner join dw_oute_numbs rr
					     ) tmp where tmp.rrstep='step1'; 

			3.查询每一步骤相对于上一步骤的漏出率：自 join 表过滤出每一步跟上一步的记录 
				1.select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  
				  from dw_oute_numbs rn inner join dw_oute_numbs rr 
				  where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1; 

				2.select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate 
				  from ( 
					select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  
					from dw_oute_numbs rn inner join dw_oute_numbs rr
				        ) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1; 

			4.汇总以上两种指标 
				select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate 
				from ( 
					select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate 
					from ( 
						select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs 
						from dw_oute_numbs rn inner join  dw_oute_numbs rr
					      ) tmp where tmp.rrstep='step1' 
				      ) 
				abs left outer join 
				( 
					select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate 
					from ( 
						select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  
						from dw_oute_numbs rn inner join dw_oute_numbs rr
					      ) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1 
				) rel on abs.step=rel.step;

网站流量日志分析--模块开发--ETL--创建ODS层表

1.时间同步命令：ntpdate ntp6.aliyun.com
2.启动 mysql 版的 Hive，本地路径下启动hive
  	1.本地连接方式：
		cd /root/hive/bin
		./hive
	2.外部Linux连接访问当前Linux下的hive：（注意使用外部连接方式时必须先启动hiveserver2服务器）
		1.后台模式启动hiveserver2服务器：
			cd /root/hive/bin
			nohup ./hiveserver2 1>/var/log/hiveserver.log 2>/var/log/hiveserver.err & 
			然后会返回hiveserver2服务器的进程号
		2.外部Linux连接访问当前Linux下的hive
			cd /root/hive/bin
			./beeline -u jdbc:hive2://NODE1:10000 -n root 
			然后输入NODE1所在linux的用户名和密码

3.本地模式：
	# 设置本地模式(仅需当前机器)执行查询语句，不设置的话则需要使用yarn集群(多台集群的机器)执行查询语句
	# 本地模式只推荐在开发环境开启，以便提高查询效率，但在生产上线环境下应重新设置为使用yarm集群模式
	set hive.exec.mode.local.auto=true;

4.创建数据库：
	create database itheima;
	use itheima;

5.创建表：
	1.原始数据表：对应mr清洗完之后的数据，而不是原始日志数据
		1.drop table if exists ods_weblog_origin;
		2.create table ods_weblog_origin(
			valid string,
			remote_addr string,
			remote_user string,
			time_local string,
			request string,
			status string,
			body_bytes_sent string,
			http_referer string,
			http_user_agent string)
		  partitioned by (datestr string)
		  row format delimited fields terminated by '\001';

	2.点击流pageview表
		1.drop table if exists ods_click_pageviews;
		2.create table ods_click_pageviews(
			session string,
			remote_addr string,
			remote_user string,
			time_local string,
			request string,
			visit_step string,
			page_staylong string,
			http_referer string,
			http_user_agent string,
			body_bytes_sent string,
			status string)
		  partitioned by (datestr string)
		  row format delimited fields terminated by '\001';

	3.点击流visit表
		1.drop table if exists ods_click_stream_visit;
		2.create table ods_click_stream_visit(
			session     string,
			remote_addr string,
			inTime      string,
			outTime     string,
			inPage      string,
			outPage     string,
			referal     string,
			pageVisits  int)
			partitioned by (datestr string)
			row format delimited fields terminated by '\001';

	4.维度表示例：
		1.drop table if exists t_dim_time;
		2.create table t_dim_time(date_key int,year string,month string,day string,hour string) row format delimited fields terminated by ',';

	5.show tables;

网站流量日志分析--模块开发--ETL--导入ODS层数据

1.hdfs中创建指定目录，准备用于存储数据文件
	hdfs dfs -mkdir -p /weblog/preprocessed
	hdfs dfs -mkdir -p /weblog/clickstream/pageviews
	hdfs dfs -mkdir -p /weblog/clickstream/visits
	hdfs dfs -mkdir -p /weblog/dim_time

2.浏览器查看hdfs文件系统：192.168.25.100:50070

3.把要导入的数据文件先上传到指定位置
	hdfs dfs -put /root/hivedata/weblog/output/part-m-00000 /weblog/preprocessed
	hdfs dfs -put /root/hivedata/weblog/pageviews/part-r-00000 /weblog/clickstream/pageviews
	hdfs dfs -put /root/hivedata/weblog/visitout/part-r-00000 /weblog/clickstream/visits
	hdfs dfs -put /root/hivedata/weblog/dim_time_dat.txt /weblog/dim_time

4.把hdfs文件系统路径下的数据文件导入到hive数据库表中：
	1.把 清洗结果数据 导入到 源数据表ods_weblog_origin 
		load data inpath '/weblog/preprocessed/' overwrite into table ods_weblog_origin partition(datestr='20130918');
		show partitions ods_weblog_origin; # 显示结果 datestr=20130918 
		select count(*) from ods_weblog_origin; # 显示结果 13770

	2.把 点击流模型pageviews数据 导入到 ods_click_pageviews表
		load data inpath '/weblog/clickstream/pageviews' overwrite into table ods_click_pageviews partition(datestr='20130918');
		select count(*) from ods_click_pageviews; # 显示结果 76 

	3.把 点击流模型visit数据 导入到 ods_click_stream_visit表
		load data inpath '/weblog/clickstream/visits' overwrite into table ods_click_stream_visit partition(datestr='20130918');
		select count(*) from ods_click_stream_visit; # 显示结果 57  

	4.把 dim_time_dat.txt 导入到 时间维度表 
		load data inpath '/weblog/dim_time' overwrite into table t_dim_time;
		select count(*) from t_dim_time; # 显示结果 29

网站流量日志分析--模块开发--ETL--ODS明细宽表

1.创建表明细宽表 ods_weblog_detail
	1.drop table ods_weblog_detail;
	2.create table ods_weblog_detail(
		valid           string, --有效标识
		remote_addr     string, --来源IP
		remote_user     string, --用户标识
		time_local      string, --访问完整时间
		daystr          string, --访问日期
		timestr         string, --访问时间
		month           string, --访问月
		day             string, --访问日
		hour            string, --访问时
		request         string, --请求的url
		status          string, --响应码
		body_bytes_sent string, --传输字节数
		http_referer    string, --来源url
		ref_host        string, --来源的host
		ref_path        string, --来源的路径
		ref_query       string, --来源参数query
		ref_query_id    string, --来源参数query的值
		http_user_agent string --客户终端标识
	  )partitioned by(datestr string);

2.抽取refer_url到中间表  t_ods_tmp_referurl
	1.drop table if exists t_ods_tmp_referurl;
	2.create table t_ods_tmp_referurl as
		SELECT a.*,b.*
		FROM ods_weblog_origin a 
		LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id; 
	3.解析：
		regexp_replace(字段名, "\"", "")：把双引号 替换为 空字符串
		parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id')：将来访url值分离出四列值：host、path、query、query_id

3.创建中间表明细表 t_ods_tmp_detail，并且抽取转换time_local字段到中间表明细表 t_ods_tmp_detail
	1.drop table if exists t_ods_tmp_detail;
	2.create table t_ods_tmp_detail as 
	  select b.*,substring(time_local,0,10) as daystr,
		substring(time_local,12) as tmstr,
		substring(time_local,6,2) as month,
		substring(time_local,9,2) as day,
		substring(time_local,11,3) as hour
	  From t_ods_tmp_referurl b;

4.把查询数据 插入到明细宽表ods_weblog_detail中
	insert into table ods_weblog_detail partition(datestr='20130918')
	select c.valid,c.remote_addr,c.remote_user,c.time_local,
		substring(c.time_local,0,10) as daystr,
		substring(c.time_local,12) as tmstr,
		substring(c.time_local,6,2) as month,
		substring(c.time_local,9,2) as day,
		substring(c.time_local,11,3) as hour,
		c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
	from
	(SELECT a.valid,a.remote_addr,a.remote_user,a.time_local,
		a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id 
		FROM ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b 
		as ref_host, ref_path, ref_query, ref_query_id) c;

网站流量日志分析--模块开发--统计分析--时间&来访维度统计pvs

1.流量分析
	1.计算每小时pvs，注意gruop by语句的语法
		select count(*) as pvs,month,day,hour from ods_weblog_detail group by month,day,hour;

	2.多维度统计PV总量
		1.第一种方式：直接在ods_weblog_detail单表上进行查询
			1.计算该处理批次（一天）中的各小时pvs
				1.drop table dw_pvs_everyhour_oneday;
				2.create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);
				3.insert into table dw_pvs_everyhour_oneday partition(datestr='20130918')
			  	  select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a


			2.计算每天的pvs
				1.drop table dw_pvs_everyday;
				2.create table dw_pvs_everyday(pvs bigint,month string,day string);
				3.insert into table dw_pvs_everyday
		  	  	  select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a
		  	  	  group by a.month,a.day;

		2.第二种方式：与时间维表关联查询
			1.维度：日
				1.drop table dw_pvs_everyday;
				2.create table dw_pvs_everyday(pvs bigint,month string,day string);
				3.insert into table dw_pvs_everyday
				  select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a
				  join ods_weblog_detail b 
				  on a.month=b.month and a.day=b.day
				  group by a.month,a.day;

			2.维度：月
				1.drop table dw_pvs_everymonth;
				2.create table dw_pvs_everymonth (pvs bigint,month string);
				3.insert into table dw_pvs_everymonth
				  select count(*) as pvs,a.month from (select distinct month from t_dim_time) a
				  join ods_weblog_detail b on a.month=b.month group by a.month;

		3.另外，也可以直接利用之前的计算结果。比如从之前算好的小时结果中统计每一天的
			insert into table dw_pvs_everyday
			select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';

网站流量日志分析--模块开发--统计分析--了解其他维度&人均浏览量

1.按照来访维度统计pv
	1.统计每小时各来访url产生的pv量，查询结果存入：( "dw_pvs_referer_everyhour" )
		1.drop table dw_pvs_referer_everyhour;
		2.create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);
		3.insert into table dw_pvs_referer_everyhour partition(datestr='20130918')
		  select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
		  from ods_weblog_detail 
		  group by http_referer,ref_host,month,day,hour 
		  having ref_host is not null
		  order by hour asc,day asc,month asc,pv_referer_cnt desc;

+-------------------------------------------------------------+----------------------------------------+---------------------------------+-------------------------------+--------------------------------+------------------------------------------+-----------------------------------+--+
|            dw_pvs_referer_everyhour.referer_url             | dw_pvs_referer_everyhour.referer_host  | dw_pvs_referer_everyhour.month  | dw_pvs_referer_everyhour.day  | dw_pvs_referer_everyhour.hour  | dw_pvs_referer_everyhour.pv_referer_cnt  | dw_pvs_referer_everyhour.datestr  |
+-------------------------------------------------------------+----------------------------------------+---------------------------------+-------------------------------+--------------------------------+------------------------------------------+-----------------------------------+--+
| "http://blog.fens.me/r-density/"                            | blog.fens.me                           | 09                              | 19                            |  00                            | 26                                       | 20130918                          |
| "http://blog.fens.me/r-json-rjson/"                         | blog.fens.me                           | 09                              | 19                            |  00                            | 21                                       | 20130918                          |
| "http://blog.fens.me/vpn-pptp-client-ubuntu/"               | blog.fens.me                           | 09                              | 19                            |  00                            | 20                                       | 20130918                          |
| "http://blog.fens.me/hadoop-mahout-roadmap/"                | blog.fens.me                           | 09                              | 19                            |  00                            | 20                                       | 20130918                          |
| "http://blog.fens.me/hadoop-zookeeper-intro/"               | blog.fens.me                           | 09                              | 19                            |  00                            | 20                                       | 20130918                          |
| "http://www.fens.me/"                                       | www.fens.me                            | 09                              | 19                            |  00                            | 12                                       | 20130918                          |
| "http://h2w.iask.cn/jump.php?url=http%3A%2F%2Fwww.fens.me"  | h2w.iask.cn                            | 09                              | 19                            |  00                            | 5                                        | 20130918                          |
| "https://www.google.com.hk/"                                | www.google.com.hk                      | 09                              | 19                            |  00                            | 3                                        | 20130918                          |
| "http://angularjs.cn/A0eQ"                                  | angularjs.cn                           | 09                              | 19                            |  00                            | 2                                        | 20130918                          |
| "http://blog.fens.me/about/"                                | blog.fens.me                           | 09                              | 19                            |  00                            | 2                                        | 20130918                          |
+-------------------------------------------------------------+----------------------------------------+---------------------------------+-------------------------------+--------------------------------+------------------------------------------+-----------------------------------+--+


	2.统计每小时各来访host的产生的pv数并排序
		1.drop table dw_pvs_refererhost_everyhour;
		2.create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);
		3.insert into table dw_pvs_refererhost_everyhour partition(datestr='20130918')
		  select ref_host,month,day,hour,count(1) as ref_host_cnts
		  from ods_weblog_detail 
		  group by ref_host,month,day,hour 
		  having ref_host is not null
		  order by hour asc,day asc,month asc,ref_host_cnts desc;
+----------------------------------------+-------------------------------------+-----------------------------------+------------------------------------+---------------------------------------------+---------------------------------------+--+
| dw_pvs_refererhost_everyhour.ref_host  | dw_pvs_refererhost_everyhour.month  | dw_pvs_refererhost_everyhour.day  | dw_pvs_refererhost_everyhour.hour  | dw_pvs_refererhost_everyhour.ref_host_cnts  | dw_pvs_refererhost_everyhour.datestr  |
+----------------------------------------+-------------------------------------+-----------------------------------+------------------------------------+---------------------------------------------+---------------------------------------+--+
| blog.fens.me                           | 09                                  | 19                                |  00                                | 111                                         | 20130918                              |
| www.fens.me                            | 09                                  | 19                                |  00                                | 13                                          | 20130918                              |
| h2w.iask.cn                            | 09                                  | 19                                |  00                                | 6                                           | 20130918                              |
| www.google.com.hk                      | 09                                  | 19                                |  00                                | 3                                           | 20130918                              |
| angularjs.cn                           | 09                                  | 19                                |  00                                | 3                                           | 20130918                              |
| cnodejs.org                            | 09                                  | 19                                |  00                                | 1                                           | 20130918                              |
| www.leonarding.com                     | 09                                  | 19                                |  00                                | 1                                           | 20130918                              |
| www.itpub.net                          | 09                                  | 19                                |  00                                | 1                                           | 20130918                              |
| blog.fens.me                           | 09                                  | 19                                |  01                                | 89                                          | 20130918                              |
| cos.name                               | 09                                  | 19                                |  01                                | 3                                           | 20130918                              |
+----------------------------------------+-------------------------------------+-----------------------------------+------------------------------------+---------------------------------------------+---------------------------------------+--+

网站流量日志分析--模块开发--统计分析--分组TopN（rowNumber）

1.统计pv总量最大的来源TOPN
	1.需求：按照时间维度，统计一天内各小时产生最多pvs的来源topN
	2.row_number函数
		select ref_host,ref_host_cnts,concat(month,day,hour),
		row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od 
		from dw_pvs_refererhost_everyhour;

+-------------------------+----------------+----------+-----+--+
|        ref_host         | ref_host_cnts  |   _c2    | od  |
+-------------------------+----------------+----------+-----+--+
| blog.fens.me            | 68             | 0918 06  | 1   |
| www.angularjs.cn        | 3              | 0918 06  | 2   |
| www.google.com          | 2              | 0918 06  | 3   |
| www.baidu.com           | 1              | 0918 06  | 4   |
| cos.name                | 1              | 0918 06  | 5   |
| blog.fens.me            | 711            | 0918 07  | 1   |
| www.google.com.hk       | 20             | 0918 07  | 2   |
| www.angularjs.cn        | 20             | 0918 07  | 3   |
| www.dataguru.cn         | 10             | 0918 07  | 4   |


	3.综上可以得出
		1.drop table dw_pvs_refhost_topn_everyhour;
		2.create table dw_pvs_refhost_topn_everyhour(
		  hour string,
		  toporder string,
		  ref_host string,
		  ref_host_cnts string
		  )partitioned by(datestr string);
		3.insert into table dw_pvs_refhost_topn_everyhour partition(datestr='20130918')
		  select t.hour,t.od,t.ref_host,t.ref_host_cnts from
		  (select ref_host,ref_host_cnts,concat(month,day,hour) as hour,
		  row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od 
		  from dw_pvs_refererhost_everyhour) t where od<=3;

+-------------------------------------+-----------------------------------------+-----------------------------------------+----------------------------------------------+----------------------------------------+--+
| dw_pvs_refhost_topn_everyhour.hour  | dw_pvs_refhost_topn_everyhour.toporder  | dw_pvs_refhost_topn_everyhour.ref_host  | dw_pvs_refhost_topn_everyhour.ref_host_cnts  | dw_pvs_refhost_topn_everyhour.datestr  |
+-------------------------------------+-----------------------------------------+-----------------------------------------+----------------------------------------------+----------------------------------------+--+
| 0918 06                             | 1                                       | blog.fens.me                            | 68                                           | 20130918                               |
| 0918 06                             | 2                                       | www.angularjs.cn                        | 3                                            | 20130918                               |
| 0918 06                             | 3                                       | www.google.com                          | 2                                            | 20130918                               |
| 0918 07                             | 1                                       | blog.fens.me                            | 711                                          | 20130918                               |
| 0918 07                             | 2                                       | www.google.com.hk                       | 20                                           | 20130918                               |
| 0918 07                             | 3                                       | www.angularjs.cn                        | 20                                           | 20130918                               |
| 0918 08                             | 1                                       | blog.fens.me                            | 1556                                         | 20130918                               |
| 0918 08                             | 2                                       | www.fens.me                             | 26                                           | 20130918                               |
| 0918 08                             | 3                                       | www.baidu.com                           | 15                                           | 20130918                               |
| 0918 09                             | 1                                       | blog.fens.me                            | 1047                                         | 20130918                               |
+-------------------------------------+-----------------------------------------+-----------------------------------------+----------------------------------------------+----------------------------------------+--+



2.人均浏览页数
	1.需求描述：统计今日所有来访者平均请求的页面数。
	2.总页面请求数/去重总人数
		1.drop table dw_avgpv_user_everyday;
		2.create table dw_avgpv_user_everyday(
			day string,
			avgpv string);
		3.insert into table dw_avgpv_user_everyday
		  select '20130918',sum(b.pvs)/count(b.remote_addr) from
		  (select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b;

各页面访问统计

各页面PV
	select request as request,count(request) as request_counts from
	ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;

网站流量日志分析--模块开发--受访分析--热门页面

热门页面统计
统计每日最热门的页面top10
	1.drop table dw_hotpages_everyday;
	2.create table dw_hotpages_everyday(day string,url string,pvs string);
	3.insert into table dw_hotpages_everyday
	  select '20130918',a.request,a.request_counts from
	  (select request as request,count(request) as request_counts from ods_weblog_detail where datestr='20130918' group by request having request is not null) a
	  order by a.request_counts desc limit 10;

网站流量日志分析--模块开发--访客开发--独立访客&新访客

1.独立访客
	1.需求：按照时间维度来统计独立访客及其产生的pv量
	2.时间维度：时
		1.drop table dw_user_dstc_ip_h;
		2.create table dw_user_dstc_ip_h(
			remote_addr string,
			pvs      bigint,
			hour     string);
		3.insert into table dw_user_dstc_ip_h 
		  select remote_addr,count(1) as pvs,concat(month,day,hour) as hour 
		  from ods_weblog_detail
		  where datestr='20130918'
		  group by concat(month,day,hour),remote_addr;

	3.在上述基础之上，可以继续分析，比如每小时独立访客总数
		select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;
 
+---------------+----------+--+
| dstc_ip_cnts  |   hour   |
+---------------+----------+--+
| 19            | 0918 06  |
| 98            | 0918 07  |
| 129           | 0918 08  |
| 149           | 0918 09  |
| 107           | 0918 10  |
| 54            | 0918 11  |
| 52            | 0918 12  |
| 71            | 0918 13  |
| 62            | 0918 14  |
| 72            | 0918 15  |
| 93            | 0918 16  |
| 55            | 0918 17  |


	4.时间维度：日
		select remote_addr,count(1) as counts,concat(month,day) as day
		from ods_weblog_detail
		where datestr='20130918'
		group by concat(month,day),remote_addr;

+------------------+---------+-------+--+
|   remote_addr    | counts  |  day  |
+------------------+---------+-------+--+
| 1.162.203.134    | 1       | 0918  |
| 1.202.186.37     | 28      | 0918  |
| 1.202.222.147    | 1       | 0918  |
| 1.202.70.78      | 1       | 0918  |
| 1.206.126.5      | 1       | 0918  |
| 1.34.23.44       | 1       | 0918  |
| 1.80.249.223     | 5       | 0918  |
| 1.82.139.173     | 24      | 0918  |
| 101.226.102.97   | 1       | 0918  |
| 101.226.166.214  | 1       | 0918  |
| 101.226.166.216  | 1       | 0918  |
| 101.226.166.222  | 1       | 0918  |
| 101.226.166.235  | 2       | 0918  |
| 101.226.166.236  | 1       | 0918  |
| 101.226.166.237  | 2       | 0918  |

	5.时间维度： 月
		select remote_addr,count(1) as counts,month 
		from ods_weblog_detail
		group by month,remote_addr;

+------------------+---------+--------+--+
|   remote_addr    | counts  | month  |
+------------------+---------+--------+--+
| 1.162.203.134    | 1       | 09     |
| 1.202.186.37     | 35      | 09     |
| 1.202.222.147    | 1       | 09     |
| 1.202.70.78      | 1       | 09     |
| 1.206.126.5      | 34      | 09     |
| 1.34.23.44       | 1       | 09     |
| 1.80.245.79      | 1       | 09     |
| 1.80.249.223     | 5       | 09     |
| 1.82.139.173     | 24      | 09     |
| 101.226.102.97   | 1       | 09     |
| 101.226.166.214  | 1       | 09     |


2.每日新访客
	1.需求：将每天的新访客统计出来。
	2.历日去重访客累积表
		1.drop table dw_user_dsct_history;
		2.create table dw_user_dsct_history(
			day string,
			ip string
		  ) partitioned by(datestr string);

	3.每日新访客表
		1.drop table dw_user_new_d;
		2.create table dw_user_new_d (
			day string,
			ip string
		  ) partitioned by(datestr string);

	4.每日新用户插入新访客表
		insert into table dw_user_new_d partition(datestr='20130918')
		select tmp.day as day,tmp.today_addr as new_ip from
		(
			select today.day as day,today.remote_addr as today_addr,old.ip as old_addr 
			from 
			(select distinct remote_addr as remote_addr,"20130918" as day from ods_weblog_detail where datestr="20130918") today
				left outer join 
				dw_user_dsct_history old
				on today.remote_addr=old.ip
			) tmp where tmp.old_addr is null;

	5.每日新用户追加到累计表
		insert into table dw_user_dsct_history partition(datestr='20130918')
		select day,ip from dw_user_new_d where datestr='20130918';

	6.验证：
		select count(distinct remote_addr) from ods_weblog_detail; # 结果值显示为 1027 
		select count(1) from dw_user_dsct_history where datestr='20130918';  # 结果值显示为 1027 
		select count(1) from dw_user_new_d where datestr='20130918'; # 结果值显示为 1027

网站流量日志分析--模块开发--访客开发--回头客&人均频次（点击流模型）

1.回头/单次访客统计
	1.drop table dw_user_returning;
	2.create table dw_user_returning(
		day string,
		remote_addr string,
		acc_cnt string)
	  partitioned by (datestr string);
	3.insert overwrite table dw_user_returning partition(datestr='20130918')
	  select tmp.day,tmp.remote_addr,tmp.acc_cnt
	  from (select '20130918' as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit group by remote_addr) tmp where tmp.acc_cnt>1;

2.人均访问频次
	select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20130918'; # 结果值显示为 1.4339622641509433

网站流量日志分析--模块开发--转化分析--漏斗模型转化率分步实现

1.漏斗模型原始数据click-part-r-00000
	1.hdfs dfs -put /root/hivedata/weblog/click-part-r-00000 /weblog/clickstream/pageviews
	2.load data inpath '/weblog/clickstream/pageviews/click-part-r-00000' overwrite into table ods_click_pageviews partition(datestr='20130920');
	3.select * from ods_click_pageviews where datestr='20130920' limit 10;

+---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+-----------------------------------+---------------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+
|      ods_click_pageviews.session      | ods_click_pageviews.remote_addr  | ods_click_pageviews.remote_user  | ods_click_pageviews.time_local  | ods_click_pageviews.request  | ods_click_pageviews.visit_step  | ods_click_pageviews.page_staylong  | ods_click_pageviews.http_referer  |           ods_click_pageviews.http_user_agent           | ods_click_pageviews.body_bytes_sent  | ods_click_pageviews.status  | ods_click_pageviews.datestr  |
+---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+-----------------------------------+---------------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:15:42             | /item/HZxEY8vF               | 1                               | 340                                | /item/qaLW7pa5                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:21:22             | /item/IyA5hVop               | 2                               | 1                                  | /item/MQtiwwhj                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:21:23             | /item/RDqibwBo               | 3                               | 44                                 | /item/RCbNqxIy                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:22:07             | /item/IzrJixZc               | 4                               | 101                                | /item/RCbNqxIy                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:23:48             | /item/yrZqXxfN               | 5                               | 19                                 | /item/1Wvc1NeH                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:24:07             | /item/hWBn8VCg               | 6                               | 442                                | /item/LwOziljH                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:31:29             | /item/1nQESbrT               | 7                               | 348                                | /item/GFDdR8SR                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:37:17             | /item/c                      | 8                               | 2                                  | /category/d                       | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:37:19             | /item/a                      | 9                               | 11                                 | /category/c                       | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
| 47826dd6-be71-42df-96b2-14ff65425975  |                                  | -                                | 2013-09-20 00:37:30             | /item/X2b5exuV               | 10                              | 348                                | /item/N2Pos96N                    | Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36  | 1800                                 | 200                         | 20130920                     |
+---------------------------------------+----------------------------------+----------------------------------+---------------------------------+------------------------------+---------------------------------+------------------------------------+-----------------------------------+---------------------------------------------------------+--------------------------------------+-----------------------------+------------------------------+--+


2.查询每一个步骤的总访问人数
	create table dw_oute_numbs as 
	select 'step1' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/item%'
	union
	select 'step2' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/category%'
	union
	select 'step3' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/order%'
	union
	select 'step4' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20130920' and request like '/index%';

3.查询每一步骤相对于路径起点人数的比例
	1.级联查询，自己跟自己join
		select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
		inner join 
		dw_oute_numbs rr;

	自join后结果如下图所示：
+---------+----------+---------+----------+--+
| rnstep  | rnnumbs  | rrstep  | rrnumbs  |
+---------+----------+---------+----------+--+
| step1   | 1029     | step1   | 1029     |
| step2   | 1029     | step1   | 1029     |
| step3   | 1028     | step1   | 1029     |
| step4   | 1018     | step1   | 1029     |
| step1   | 1029     | step2   | 1029     |
| step2   | 1029     | step2   | 1029     |
| step3   | 1028     | step2   | 1029     |
| step4   | 1018     | step2   | 1029     |
| step1   | 1029     | step3   | 1028     |
| step2   | 1029     | step3   | 1028     |
| step3   | 1028     | step3   | 1028     |
| step4   | 1018     | step3   | 1028     |
| step1   | 1029     | step4   | 1018     |
| step2   | 1029     | step4   | 1018     |
| step3   | 1028     | step4   | 1018     |
| step4   | 1018     | step4   | 1018     |
+---------+----------+---------+----------+--+

	2.每一步的人数/第一步的人数==每一步相对起点人数比例
		select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
		from
		(
			select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
			inner join 
			dw_oute_numbs rr
		) tmp where tmp.rrstep='step1';

4.查询每一步骤相对于上一步骤的漏出率
	1.首先通过自join表过滤出每一步跟上一步的记录
		select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
		inner join 
		dw_oute_numbs rr
		where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;

+---------+----------+---------+----------+--+
| rnstep  | rnnumbs  | rrstep  | rrnumbs  |
+---------+----------+---------+----------+--+
| step1   | 1029     | step2   | 1029     |
| step2   | 1029     | step3   | 1028     |
| step3   | 1028     | step4   | 1018     |
+---------+----------+---------+----------+--+

	2.然后就可以非常简单的计算出每一步相对上一步的漏出率
		select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate
		from
		(
			select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
			inner join 
			dw_oute_numbs rr
		) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;

5.汇总以上两种指标
	select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
	from 
	(
		select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
		from
		(
			select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
			inner join 
			dw_oute_numbs rr
		) tmp where tmp.rrstep='step1'
	) abs
	left outer join
	(
		select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
		from
		(
			select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
			inner join 
			dw_oute_numbs rr
		) tmp where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
	) rel on abs.step=rel.step;

网站流量日志分析--模块开发--转化分析--级联求和（累加）

1.创建表
	create table t_access_times(username string,month string,salary int)
	row format delimited fields terminated by ',';

2.导入数据
	1.hdfs dfs -put /root/hivedata/weblog/t_access_times.dat /weblog
	2.load data inpath '/weblog/t_access_times.dat' overwrite into table t_access_times;
	3.select * from t_access_times limit 10;

3.第一步：先求个用户的月总金额
	select username,month,sum(salary) as salary from t_access_times group by username,month;

+-----------+----------+---------+--+
| username  |  month   | salary  |
+-----------+----------+---------+--+
| A         | 2015-01  | 33      |
| A         | 2015-02  | 10      |
| B         | 2015-01  | 30      |
| B         | 2015-02  | 15      |
+-----------+----------+---------+--+

4.第二步：将月总金额表 自己连接 自己连接
	select A.*,B.* FROM
	(select username,month,sum(salary) as salary from t_access_times group by username,month) A 
	inner join 
	(select username,month,sum(salary) as salary from t_access_times group by username,month) B
	on A.username=B.username
	where B.month <= A.month;

+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username  | a.month  | a.salary  | b.username  | b.month  | b.salary  |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A           | 2015-01  | 33        | A           | 2015-01  | 33        |
| A           | 2015-01  | 33        | A           | 2015-02  | 10        |
| A           | 2015-02  | 10        | A           | 2015-01  | 33        |
| A           | 2015-02  | 10        | A           | 2015-02  | 10        |
| B           | 2015-01  | 30        | B           | 2015-01  | 30        |
| B           | 2015-01  | 30        | B           | 2015-02  | 15        |
| B           | 2015-02  | 15        | B           | 2015-01  | 30        |
| B           | 2015-02  | 15        | B           | 2015-02  | 15        |
+-------------+----------+-----------+-------------+----------+-----------+--+

5.第三步：从上一步的结果中
	进行分组查询，分组的字段是a.username a.month
	求月累计值：将b.month <= a.month的所有b.salary求和即可
	select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate 
	from 
	(select username,month,sum(salary) as salary from t_access_times group by username,month) A 
	inner join 
	(select username,month,sum(salary) as salary from t_access_times group by username,month) B 
	on A.username=B.username 
	where B.month <= A.month 
	group by A.username,A.month 
	order by A.username,A.month;

网站/APP 流量分析、用户访问分析

数据仓库设计

猜你喜欢