学习笔记:Pig基础

一、Pig基本介绍

 1. 起源

MapReduce的一个缺点是开发周期太长。写mapperreducer,对代码进行编译和打包,提交作业,获取结果,这整个过程非常耗时。事实上,正是由于YAHOO公司想让科研人员和工程师能够便捷地挖掘大规模数据集,才设计了Pig.

2. 基础

一种探索大规模数据集的脚本语言。

Pig的好处在于仅用控制台上的几行Pig代码就能够处理TB级的数据。

二、Pig实验

该文件是某网站访问日志,请大家使用pig计算出每个ip的点击次数

1.数据源

119.146.220.12 - - [31/Jan/2012:23:59:44 +0800] "POST /forum.php?mod=post&action=reply&fid=53&tid=69&extra=page%3D1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1 HTTP/1.1" 200 397 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:45 +0800] "GET /forum.php?mod=viewthread&tid=69&viewpid=677&from=&inajax=1&ajaxtarget=post_new HTTP/1.1" 200 4794 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:45 +0800] "GET /static/js/common_extra.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/common.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_forum_forumdisplay.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /forum.php?mod=forumdisplay&fid=53&page=1 HTTP/1.1" 200 49334 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_widthauto.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/forum.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:49 +0800] "GET /static/js/seditor.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:49 +0800] "GET /home.php?mod=spacecp&ac=pm&op=checknewpm&rand=1328025588 HTTP/1.1" 200 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
220.181.94.221 - - [31/Jan/2012:23:59:49 +0800] "GET /home.php?mod=spacecp&ac=pm&op=showmsg&handlekey=showmsg_11&touid=11&pmid=0&daterange=2&pid=77&tid=26 HTTP/1.1" 200 10074 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_common.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:51 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:52 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /static/js/smilies.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /data/cache/common_smilies_var.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

2. Pig 命令

//加载HDFS中访问日志,使用空格进行分割,只加载ip列
records = LOAD 'hdfs://hadoop:9000/class7/input/website_log.txt' USING PigStorage(' ') AS (ip:chararray);

// 按照ip进行分组,统计每个ip点击数
records_b = GROUP records BY ip;
records_c = FOREACH records_b GENERATE group,COUNT(records) AS click;

// 按照点击数排序,保留点击数前10个的ip数据
records_d = ORDER records_c by click DESC;
top10 = LIMIT records_d 10;

// 把生成的数据保存到HDFS的class7目录中
STORE top10 INTO 'hdfs://hadoop:9000/class7/out';

猜你喜欢

转载自www.cnblogs.com/FrankZhou2017/p/9145419.html
pig