Mapreduce代码:https://github.com/pickLXJ/analysisSogou.git
Log日志:https://pan.baidu.com/s/112P_hR9FlQq7htyTVjxgwg
一、日志格式
搜狗格式查询https://www.sogou.com/labs/resource/q.php
原始数据
20111230000418 e686beaf83faa9a106b1a023923edd74 黑镜头 9 2 http://bbs.tiexue.net/post_4161367_1.html
20111230000418 5467c699d1ae4a61b6d53bb2fe83c04a 搜索 WWW.MMPPTV.COM 6 3 http://9bc947d.he.artseducation.com.cn/
20111230000418 55623d0852a5161063c6d01f0856a814 百里挑一主题歌是什么 5 1 http://zhidao.baidu.com/question/169708995
20111230000418 8d737be3a9c125181bdd422287bee65f 钻石价格查询 4 2 http://tool.wozuan.com/
20111230000419 bbe344592ade912de81595d2ec140c0d 眉山电信 9 1 http://www.aibang.com/detail/1232487017-414995109
20111230000419 df79cc0c9a4c9faa1656023c5c12265e 好看的高干文 8 2 http://www.tianya.cn/publicforum/content/funinfo/1/1643841.shtml
20111230000419 ec0363079f36254b12a5e30bdc070125 AQVOX 8 7 http://www.erji.net/simple/index.php?t122047.html
二、数据清洗
脚本去除空白数据,转化部分数据
扩展脚本 (年月日)
vim log-extend.sh
[root@bigdata000 ~]#log-extend.sh /home/samba/sample/file/sogou.500w.utf8 /home/samba/sample/file/sogou_log.txt
过滤脚本(过滤搜索为空)
Vim log-filter.sh
#!/bin/bash
#infile=/home/sogou_log.txt
infile=$1
#outfile=/home/sogou_log.txt.flt
outfile=$2
awk -F "\t" '{if($2 != "" && $3 != "" && $2 != " " && $3 != " ") print $0}' $infile > $outfile
[root@bigdata000 ~]# log-filter.sh /home/samba/sample/file/sogou_log.txt /home/samba/sample/file/sogou_log.txt.flt
基于HIve构建日志数据的数据仓库
- 创建数据库
hive> create database sogou;
- 使用数据库
Hive> use sogou;
- 创建扩展 4 个字段(年、月、日、小时)数据的外部表:
hive> CREATE EXTERNAL TABLE sogou_data(
ts string,
uid string,
keyword string,
rank int,
sorder int,
url string,
year int,
month int,
day int,
hour int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
OK
Time taken: 0.412 seconds
- Hive表加载本地数据
load data local inpath '/home/samba/sample/file/sogou_log.txt.flt' into table sogou_data;
- 创建带分区的表:
hive> CREATE EXTERNAL TABLE sogou_partitioned_data(
ts string,
uid string,
keyword string,
rank int,
sorder int,
url string)
> PARTITIONED BY(year int,month int,day int,hour int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
- 设置动态分区
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> INSERT OVERWRITE TABLE sogou_partitioned_data partition(year,month,day,hour) SELECT * FROM sogou_data;
查询测试
- 查询前十个数据:
> select * from sogou_data limit 10;
OK
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/ 2011 11 23 0
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1 2011 11 23 0
20111230000007 b97920521c78de70ac38e3713f524b50 本本联盟 1 1 http://www.bblianmeng.com/ 2011 11 23 0
20111230000008 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1 http://lib.scnu.edu.cn/ 2011 11 23 0
20111230000008 f2f5a21c764aebde1e8afcc2871e086f 在线代理 2 1 http://proxyie.cn/ 2011 11 23 0
20111230000009 96994a0480e7e1edcaef67b20d8816b7 伟大导演 1 1 http://movie.douban.com/review/1128960/ 2011 11 23 0
20111230000009 698956eb07815439fe5f46e9a4503997 youku 1 1 http://www.youku.com/ 2011 11 23 0
20111230000009 599cd26984f72ee68b2b6ebefccf6aed 安徽合肥365房产网 1 1 http://hf.house365.com/ 2011 11 23 0
20111230000010 f577230df7b6c532837cd16ab731f874 哈萨克网址大全 1 1 http://www.kz321.com/ 2011 11 23 0
20111230000010 285f88780dd0659f5fc8acc7cc4949f2 IQ数码 1 1 http://www.iqshuma.com/ 2011 11 23 0
Time taken: 2.522 seconds, Fetched: 10 row(s)
- 查询用户搜索的内容
> select * from sogou_data limit 10;
OK
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/ 2011 11 23 0
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1 2011 11 23 0
20111230000007 b97920521c78de70ac38e3713f524b50 本本联盟 1 1 http://www.bblianmeng.com/ 2011 11 23 0
20111230000008 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1 http://lib.scnu.edu.cn/ 2011 11 23 0
20111230000008 f2f5a21c764aebde1e8afcc2871e086f 在线代理 2 1 http://proxyie.cn/ 2011 11 23 0
20111230000009 96994a0480e7e1edcaef67b20d8816b7 伟大导演 1 1 http://movie.douban.com/review/1128960/ 2011 11 23 0
20111230000009 698956eb07815439fe5f46e9a4503997 youku 1 1 http://www.youku.com/ 2011 11 23 0
20111230000009 599cd26984f72ee68b2b6ebefccf6aed 安徽合肥365房产网 1 1 http://hf.house365.com/ 2011 11 23 0
20111230000010 f577230df7b6c532837cd16ab731f874 哈萨克网址大全 1 1 http://www.kz321.com/ 2011 11 23 0
20111230000010 285f88780dd0659f5fc8acc7cc4949f2 IQ数码 1 1 http://www.iqshuma.com/ 2011 11 23 0
Time taken: 2.522 seconds, Fetched: 10 row(s)
hive> select * from sogou_data where uid='6961d0c97fe93701fc9c0d861d096cd9';
OK
20111230000008 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1 http://lib.scnu.edu.cn/ 2011 11 23 0
20111230065007 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1 http://lib.scnu.edu.cn/ 2011 11 23 0
Time taken: 0.653 seconds, Fetched: 2 row(s)
hive>
- 查询总条数
hive> select count(*) from sogou_partitioned_data;
Query ID = root_20181214010000_020e4437-b637-4861-bac3-21be3a0754b5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1544683093139_0001, Tracking URL = http://bigdata000:8088/proxy/application_1544683093139_0001/
Kill Command = /app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job -kill job_1544683093139_0001
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
Ended Job = job_1544683093139_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 70.68 sec HDFS Read: 573691364 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 10 seconds 680 msec
OK
5000000
Time taken: 236.402 seconds, Fetched: 1 row(s)
hive>
hive> select count(*) from sogou_partitioned_data;
- 非空查询条数
> select count(*) from sogou_partitioned_data where keyword is not null and keyword!='';
Query ID = root_20181214010606_d8a11bd2-3cbc-482b-ba0d-27bf65d1589c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1544683093139_0002, Tracking URL = http://bigdata000:8088/proxy/application_1544683093139_0002/
Kill Command = /app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job -kill job_1544683093139_0002
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
MapReduce Total cumulative CPU time: 1 minutes 12 seconds 720 msec
Ended Job = job_1544683093139_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 72.72 sec HDFS Read: 573693021 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 12 seconds 720 msec
OK
5000000
Time taken: 90.678 seconds, Fetched: 1 row(s)
hive> select count(*) from sogou_partitioned_data where keyword is not null and keyword!='';
- 无重复总条数
hive> select count(*) from(select count(*) as no_repeat_count from sogou_partitioned_data group by ts,uid,keyword,url having no_repeat_count=1) a;
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 3 Cumulative CPU: 383.06 sec HDFS Read: 573702274 HDFS Write: 351 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 12.22 sec HDFS Read: 5186 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 6 minutes 35 seconds 280 msec
OK
4999272
Time taken: 448.265 seconds, Fetched: 1 row(s)
hive>
hive> select count(*) from(select count(*) as no_repeat_count from sogou_partitioned_data group by ts,uid,keyword,url having no_repeat_count=1) a;
- 独立UID总数
hive> select count(distinct(uid)) from sogou_partitioned_data;
Ended Job = job_1544683093139_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 88.13 sec HDFS Read: 573691789 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 28 seconds 130 msec
OK
1352664
Time taken: 91.419 seconds, Fetched: 1 row(s)
hive> select count(distinct(uid)) from sogou_partitioned_data;
实现数据分析需求二:关键字分析
(1)查询频度排名(频度最高的前50词)
> select keyword,count(*)query_count from sogou_partitioned_data group by keyword
Total MapReduce CPU Time Spent: 3 minutes 10 seconds 30 msec
OK
百度 38441
baidu 18312
人体艺术 14475
4399小游戏 11438
qq空间 10317
优酷 10158
新亮剑 9654
馆陶县县长闫宁的父亲 9127
公安卖萌 8192
百度一下 你就知道 7505
百度一下 7104
4399 7041
魏特琳 6665
qq网名 6149
7k7k小游戏 5985
黑狐 5610
儿子与母亲不正当关系 5496
新浪微博 5369
李宇春体 5310
新疆暴徒被击毙图片 4997
hao123 4834
123 4829
4399洛克王国 4112
qq头像 4085
nba 4027
龙门飞甲 3917
qq个性签名 3880
张去死 3848
cf官网 3729
凰图腾 3632
快播 3423
金陵十三钗 3349
吞噬星空 3330
dnf官网 3303
武动乾坤 3232
新亮剑全集 3210
电影 3155
优酷网 3115
两次才处决美女罪犯 3106
电影天堂 3028
土豆网 2969
qq分组 2940
全国各省最低工资标准 2872
清代姚明 2784
youku 2783
争产案 2755
dnf 2686
12306 2682
身份证号码大全 2680
火影忍者 2604
Time taken: 240.291 seconds, Fetched: 50 row(s)
hive> select keyword,count(*)query_count from sogou_partitioned_data group by keyword order by query_count desc limit 50;
实现数据分析需求三:UID分析
- 查询次数大于2次的用户总数
hive>
>
> select count(*) from( select count(*) as query_count from sogou_partitioned_data group by uid having query_count > 2) a;
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 minutes 19 seconds 420 msec
OK
546353
Time taken: 249.635 seconds, Fetched: 1 row(s)
hive> select count(*) from( select count(*) as query_count from sogou_partitioned_data group by uid having query_count > 2) a;
- 查询次数大于2次的用户占比
A:
hive> select count(*) from(select count(*) as query_count from sogou_partitioned_data group by uid having query_count > 2) a;
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 minutes 13 seconds 250 msec
OK
546353
Time taken: 239.699 seconds, Fetched: 1 row(s)
hive> select count(*) from(select count(*) as query_count from sogou_partitioned_data group by uid having query_count > 2)
B:
> select count(distinct(uid)) from sogou_partitioned_data;
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 106.46 sec HDFS Read: 573691789 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 46 seconds 460 msec
OK
1352664
Time taken: 109.001 seconds, Fetched: 1 row(s)
hive> select count(distinct(uid)) from sogou_partitioned_data;
A/B
hive> select 546353/1352664;
OK
0.40390887907122536
Time taken: 0.255 seconds, Fetched: 1 row(s)
hive> select 546353/1352664;
- rank次数在10以内的点击次数占比(rank既是第四列的内容)
A:
hive> select count(*) from sogou_partitioned_data where rank < 11;
4999869
Time taken: 29.653 seconds, Fetched: 1 row(s)
B:
hive> select count(*) from sogou_partitioned_data;
5000000
A/B
hive> select 4999869/5000000;
OK
0.9999738
- 直接输入URL查询的比例
A:
hive> select count(*) from sogou_partitioned_data where keyword like '%www%';
OK
73979
B:
hive> select count(*) from sogou_partitioned_data;
OK
5000000
A/B
hive> select 73979/5000000;
OK
0.0147958
实现数据分析需求四:独立用户行为分析
(1)查询搜索过”仙剑奇侠传“的uid,并且次数大于3
> select uid,count(*) as cnt from sogou_partitioned_data where keyword='仙剑奇侠传' group by uving cnt > 3;
Query ID = root_20181214020303_dbf96d64-9f8e-4ed5-844d-711de957e8b8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1544683093139_0015, Tracking URL = http://bigdata000:8088/proxy/application_1544683093139_0015/
Kill Command = /app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job -kill job_1544683093139_0015
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 3
MapReduce Total cumulative CPU time: 1 minutes 37 seconds 730 msec
Ended Job = job_1544683093139_0015
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 3 Cumulative CPU: 97.73 sec HDFS Read: 573703160 HDFS Write: 70 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 37 seconds 730 msec
OK
653d48aa356d5111ac0e59f9fe736429 6
e11c6273e337c1d1032229f1b2321a75 5
Time taken: 106.129 seconds, Fetched: 2 row(s)
hive> select uid,count(*) as cnt from sogou_partitioned_data where keyword='仙剑奇侠传' group by uid having cnt > 3;