使用Hadoop分析网站日志

某网站使用Apache服务器搭建公司的网站，Apache每小时产生一个日志，每天产生24个gz的包，gz的包总大小为200M

创建日志表格：

hive>

CREATE TABLE log 
(ipaddress STRING,identity STRING,user STRING,time STRING,method STRING,request STRING,protocol STRING,status STRING,size STRING,referer STRING,agent STRING) 

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES

("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) (\"[^ ]*) ([^ ]*) ([^ ]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s")STORED AS TEXTFILE;

手动导入1月份Apache日志：

> load data local inpath '/home/hduser/log/access_log-20140223' overwrite into table apachelog;

Copying data from file:/home/hduser/log/access_log-20140223
Copying file: file:/home/hduser/log/access_log-20140223
Loading data to table default.apachelog
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /user/hive/warehouse/apachelog
Table default.apachelog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 10588368, raw_data_size: 0]
OK
Time taken: 3.057 seconds

使用Python脚本批量导入1月份Apache日志：

[hduser@master ~]$ vim loadlog.sh
#!/usr/local/env python
# -*- coding:utf-8 -*-
import os

hour1=[ "%02d"%(i) for i in range(23)]
for i in hour1:
        print "正在导入1月份%s点的数据"%i
        cmd = "hive -S -e \"load data local inpath '/home/hduser/cachelog/cache.dayoo.com_1_20140101%s_log.gz' into table log\"" %i
        print cmd
        os.system(cmd)
[hduser@master ~]$ python loadlog.sh

查看导入的记录行数：

[hduser@master ~]$ hive

hive> add jar /home/hduser/hive-0.11.0/lib/hive-contrib-0.11.0.jar;
Added /home/hduser/hive-0.11.0/lib/hive-contrib-0.11.0.jar to class path
Added resource: /home/hduser/hive-0.11.0/lib/hive-contrib-0.11.0.jar

hive> select count(*) from log;
........................省略运行过程
5313762                #总共有5百多万行的数据
Time taken: 292.139 seconds, Fetched: 1 row(s)

查看访问量排行：

可以根据执行的结果来对访问量大的页面进而优化处理。

hive> SELECT request, COUNT(1) AS numrequest FROM log GROUP BY request SORT BY numrequest DESC LIMIT 10;

..........................省略运行过程
http://img0.dayoo.com/header.css    82538
http://life.dayoo.com/health/   76785
http://img0.dayoo.com/l/100901/dylist.css   66463
http://club.dayoo.com/topic_list_www.dy?b=cityonline&page=5&n=30    63705
http://club.dayoo.com/topic_list_www.dy?b=strangers&page=5&n=30 59187
http://www.dayoo.com/public/footer/copyright.html   42094
http://club.dayoo.com/topic_list_www.dy?b=cityonline&page=4&n=30    42012
http://img0.dayoo.com/dylc/ad/files/2010/zhangjiurong/shengyayiyuan/20131008/dyw.html   41677
http://img0.dayoo.com/dylc/ad/files/2010/linshitupianku/aiguangzhou/aiguangzhou.html    41482
http://rnd.dayoo.com/javascript/jquery/jquery-1.4.2.js  40061
Time taken: 1139.001 seconds, Fetched: 20 row(s)

查看IP访问排行：
可以根据IP地址的分布情况，分析出网站的用户类型等数据

hive>  SELECT ipaddress, COUNT(1) AS numrequest FROM log GROUP BY ipaddress SORT BY numrequest DESC LIMIT 10;
..........................省略运行过程
123.151.47.114  214071
116.236.228.19  194238
59.42.241.199   92858
59.42.241.198   92639
123.108.110.196 75841
123.126.105.65  49279
123.151.43.97   31796
42.156.139.46   30928
42.120.160.46   30918
42.120.161.46   30913
Time taken: 443.762 seconds, Fetched: 10 row(s)

使用Hadoop分析网站日志

猜你喜欢