交互式数据处理

  一、数据预处理

  1.查看数据

  将要用到的sogou.500w.utf8包含了500万条搜狗网络访问日志记录的数据的文件(547MB)复制到/home/jun/Resources下,通过less命令查看文件内容,通过PgUp/PgDn上下翻页,退出时可以按ESC-Enter-Q即可。

20111230000013  b89952902d7821db37e8999776b32427        怎么骂一个人不带脏字    2       2       http://zhidao.baidu.com/question/224925866
20111230000013  072fa3643c91b29bd586aff29b402161        暴力破解无线网络密码    2       1       http://download.csdn.net/detail/think1919/3935722
20111230000014  f31f594bd1f3147298bd952ba35de84d        12306.cn        1       1       http://www.12306.cn/

  这个文件每一条记录的对应的意思是:访问时间、用户ID、查询词、返回结果排序、用户单击的顺序号、用户单击的URL。一共6个字段,字段与字段之间是通过一个“\t”(Tab)分割的。

  通过wc命令统计文件的行数(-l)、字数(-w)、字节数(-c),从下面的结果来看,确实有500W条记录,确实称得上是大数据了。

[jun@master Resources]$ wc -l sogou.500w.utf8 
5000000 sogou.500w.utf8
[jun@master Resources]$ wc -w sogou.500w.utf8 
30436251 sogou.500w.utf8
[jun@master Resources]$ wc -c sogou.500w.utf8 
573670020 sogou.500w.utf8

  还可以通过head命令截取文件的部分数据

[jun@master Resources]$ head -200 sogou.500w.utf8 > sogou.200.utf8
[jun@master Resources]$ wc -l sogou.200.utf8 
200 sogou.200.utf8

  2.数据扩展

  由于上面的文件的第一个字段是20111230000013形式的,为了方便统计,将这个字符提取并分割成4个字段,分别对应着年(2011)、月(12)、日(30)、小时(00),再将这四个字段添加到原来每条记录的后面。

  使用shell脚本可以完成这一工作,来看一下shell脚本的内容。第一个参数是输入文件,第二个参数是输出文件,使用awk编程语句可以对每一行的字段进行重新编辑。

#!/bin/bash
infile=$1
outfile=$2
awk -F '\t' '{print $0"\t"substr($1,0,4)"\t"substr($1,4,2)"\t"substr($1,6,2)"\t"substr($1,8,2)}' $infile > $outfile

  先给脚本赋予权限,然后运行这个脚本

jun@master Resources]$ chmod +x sogou-log-extend.sh 
[jun@master Resources]$ ./sogou-log-extend.sh sogou.500w.utf8 sogou.500w.utf8.ext

  查看生成的文件,可以看到确实在后面增加了4个字段

20111230000013  e0d255845fc9e66b2a25c43a70de4a9a        无饶河,益慎职 意思     3       1       http://hanyu.iciba.com/wiki/1230433.shtml       2011    11      23      00
20111230000013  b89952902d7821db37e8999776b32427        怎么骂一个人不带脏字    2       2       http://zhidao.baidu.com/question/224925866      2011    11      23      00
20111230000013  072fa3643c91b29bd586aff29b402161        暴力破解无线网络密码    2       1       http://download.csdn.net/detail/think1919/3935722       2011    11      23      00

  3.数据过滤

  通过分析可以看出,这500万条记录中,有的记录不是很完整,缺少了某一个或者某几个字段,这样得到的数据就不是很完整,因此,为了保留相对完整的记录,将这些记录中第2个或第3个字段为空的记录过滤掉,同样编写一个shell脚本来实现。

#!/bin/bash
infile=$1
outfile=$2
awk -F"\t" '{if($2 != "" && $3 != "" && $2 != " " && $3 != " ") print $0}' $infile > $outfile

  先给脚本赋予权限,然后运行这个脚本

[jun@master Resources]$ chmod +x sogou-log-filter.sh 
[jun@master Resources]$ ./sogou-log-filter.sh sogou.500w.utf8.ext sogou.500w.utf8.flt

  4.数据上传

  在得到处理过的数据之后,需要在Hadoop平台上进行分析,当然需要把文件提交到HDFS上了,首先确保启动了Hadoop,然后创建目录

[jun@master Resources]$ hadoop fs -mkdir /sogou_ext
[jun@master Resources]$ hadoop fs -mkdir /sogou_ext/20180724

  将文件上传到新建的目录下

[jun@master Resources]$ hadoop fs -put ~/Resources/sogou.500w.utf8.flt /sogou_ext/20180724/

  二、创建数据仓库

  1.新建Hive数据仓库

  启动Hive

[jun@master Resources]$ cd /home/jun/apache-hive-2.3.3-bin/bin         
[jun@master bin]$ ./hive

  执行创建仓库命令,然后打开仓库,显示数据仓库中的表

hive> create database sogou;
OK
Time taken: 6.412 seconds
hive> show databases;
OK
default
sogou
test_db
Time taken: 0.146 seconds, Fetched: 3 row(s)
hive> use sogou;
OK
Time taken: 0.019 seconds
hive> show tables;
OK
Time taken: 0.035 seconds

  2.新建一个外部表,该表包含了扩展字段,即

hive> create external table sogou.sogou_ext_20180724(
    > time string,
    > uid string,
    > keywords string,
    > rank int,
    > ordering int,
    > url string,
    > year int,
    > month int,
    > day int,
    > hour int)
    > comment 'This is the sogou search data extend'
    > row format delimited
    > fields terminated by '\t'
    > stored as textfile
    > location '/sogou_ext/20180724';
OK
Time taken: 0.758 seconds

  3.创建一个带分区的表,以最后四个时间字段分区,即

hive> create external table sogou.sogou_partition(
    > time string,
    > uid string,
    > keywords string,
    > rank int,
    > ordering int,
    > url string)
    > partitioned by (
    > year int,
    > month int,
    > day int,
    > hour int)
    > row format delimited
    > fields terminated by '\t'
    > stored as textfile
    > ;
OK
Time taken: 0.416 seconds

  4.向数据库中导入数据,这里由于之前在创建外部表的时候指定了location '/sogou_ext/20180724'所以就会自动去指定的路径去找。

hive> set hive.exec.dynamic.partition.mode=nonstrict;      
hive> insert overwrite table sogou.sogou_partition partition(year,month,day,hour) select * from sogou.sogou_ext_20180724;

  查询一下导入的数据:

hive> select * from sogou_ext_20180724 limit 5
    > ;
OK
20111230000005    57375476989eea12893c0c3811607bcf    奇艺高清    1    1    http://www.qiyi.com/    2011    11    23    0
20111230000005    66c5bb7774e31d0a22278249b26bc83a    凡人修仙传    3    1    http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1    2011    11    23    0
20111230000007    b97920521c78de70ac38e3713f524b50    本本联盟    1    1    http://www.bblianmeng.com/    2011    11    23    0
20111230000008    6961d0c97fe93701fc9c0d861d096cd9    华南师范大学图书馆    1    1    http://lib.scnu.edu.cn/    2011    11    23    0
20111230000008    f2f5a21c764aebde1e8afcc2871e086f    在线代理    2    1    http://proxyie.cn/    2011    11    23    0
Time taken: 0.187 seconds, Fetched: 5 row(s)

  三、数据分析

  1.基本统计

  (1)统计总记录数

hive> select count(*) from sogou_ext_20180724;
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2018-07-24 17:32:24,305 Stage-1 map = 0%,  reduce = 0%
2018-07-24 17:32:38,146 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 9.42 sec
2018-07-24 17:32:39,211 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 17.38 sec
2018-07-24 17:32:41,956 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 33.68 sec
2018-07-24 17:32:51,246 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 38.55 sec
MapReduce Total cumulative CPU time: 38 seconds 550 msec
Ended Job = job_1532414392815_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 38.55 sec   HDFS Read: 643703894 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 38 seconds 550 msec
OK
5000000
Time taken: 45.712 seconds, Fetched: 1 row(s)

  (2)统计keywords非空记录数

hive> select count(*) from sogou_ext_20180724 where keywords is not null and keywords!='';
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2018-07-24 17:34:28,102 Stage-1 map = 0%,  reduce = 0%
2018-07-24 17:34:39,467 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 6.55 sec
2018-07-24 17:35:04,710 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU 56.56 sec
2018-07-24 17:35:07,843 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 57.75 sec
2018-07-24 17:35:08,875 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 60.35 sec
MapReduce Total cumulative CPU time: 1 minutes 0 seconds 350 msec
Ended Job = job_1532414392815_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 60.35 sec   HDFS Read: 643705555 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 0 seconds 350 msec
OK
5000000
Time taken: 49.97 seconds, Fetched: 1 row(s)

  (3)统计独立uid总数

hive> select count(distinct(uid)) from sogou_ext_20180724;
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2018-07-24 17:36:46,624 Stage-1 map = 0%,  reduce = 0%
2018-07-24 17:37:00,919 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 10.85 sec
2018-07-24 17:37:02,995 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 46.25 sec
2018-07-24 17:37:10,351 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 54.47 sec
MapReduce Total cumulative CPU time: 54 seconds 470 msec
Ended Job = job_1532414392815_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 54.47 sec   HDFS Read: 643704766 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 54 seconds 470 msec
OK
1352664
Time taken: 33.131 seconds, Fetched: 1 row(s)

  (4)关键词长度统计

hive> select avg(a.cnt) from (select size(split(keywords,'\\s+')) as cnt from sogou.sogou_ext_20180724) a;
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1
2018-07-24 17:42:10,425 Stage-1 map = 0%,  reduce = 0%
2018-07-24 17:42:24,710 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 11.31 sec
2018-07-24 17:42:31,858 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 49.17 sec
2018-07-24 17:42:34,182 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 53.05 sec
2018-07-24 17:42:36,285 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 55.35 sec
MapReduce Total cumulative CPU time: 55 seconds 350 msec
Ended Job = job_1532414392815_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 55.35 sec   HDFS Read: 643705682 HDFS Write: 109 SUCCESS
Total MapReduce CPU Time Spent: 55 seconds 350 msec
OK
1.0869984
Time taken: 34.047 seconds, Fetched: 1 row(s)

  (5)频率最高的20个关键词

hive> select keywords, count(*) as cnt from sogou.sogou_ext_20180724 group by keywords order by cnt desc limit 20;
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2018-07-24 17:45:47,831 Stage-2 map = 0%,  reduce = 0%
2018-07-24 17:45:56,536 Stage-2 map = 50%,  reduce = 0%, Cumulative CPU 6.49 sec
2018-07-24 17:45:57,613 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 10.82 sec
2018-07-24 17:46:01,833 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 12.62 sec
MapReduce Total cumulative CPU time: 12 seconds 620 msec
Ended Job = job_1532414392815_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 75.89 sec   HDFS Read: 643711008 HDFS Write: 62953072 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 12.62 sec   HDFS Read: 62961345 HDFS Write: 949 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 28 seconds 510 msec
OK
百度    38441
baidu    18312
人体艺术    14475
4399小游戏    11438
qq空间    10317
优酷    10158
新亮剑    9654
馆陶县县长闫宁的父亲    9127
公安卖萌    8192
百度一下 你就知道    7505
百度一下    7104
4399    7041
魏特琳    6665
qq网名    6149
7k7k小游戏    5985
黑狐    5610
儿子与母亲不正当关系    5496
新浪微博    5369
李宇春体    5310
新疆暴徒被击毙图片    4997
Time taken: 95.338 seconds, Fetched: 20 row(s)

  (6)查询次数的分布

hive> select sum(if(uids.cnt=1,1,0)), sum(if(uids.cnt=2,1,0)), sum(if(uids.cnt=3,1,0)), sum(if(uids.cnt>3,1,0)) from (select uid, count(*) as cnt from sogou.sogou_ext_20180724 group by uid) uids;
MapReduce Total cumulative CPU time: 5 seconds 690 msec
Ended Job = job_1532414392815_0020
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 82.1 sec   HDFS Read: 643715334 HDFS Write: 384 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 5.69 sec   HDFS Read: 9325 HDFS Write: 127 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 27 seconds 790 msec
OK
549148    257163    149562    396791
Time taken: 62.601 seconds, Fetched: 1 row(s)

  (7)平均查询次数

hive> select sum(a.cnt)/count(a.uid) from (select uid,count(*) as cnt from sogou.sogou_ext_20180724 group by uid) a;
MapReduce Total cumulative CPU time: 6 seconds 610 msec
Ended Job = job_1532414392815_0010
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 70.07 sec   HDFS Read: 643712322 HDFS Write: 363 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 6.61 sec   HDFS Read: 9207 HDFS Write: 118 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 16 seconds 680 msec
OK
3.6964094557111005
Time taken: 89.135 seconds, Fetched: 1 row(s

  (8)查询次数大于2的用户数

hive> select count(a.cnt) from (select uid,count(*) as cnt from sogou.sogou_ext_20180724 group by uid having cnt > 2 ) a;
 seconds 790 msec
Ended Job = job_1532414392815_0012
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 70.04 sec   HDFS Read: 643713027 HDFS Write: 351 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 4.79 sec   HDFS Read: 7712 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 14 seconds 830 msec
OK
546353
Time taken: 61.16 seconds, Fetched: 1 row(s)

  (9)查询次数大于2的数据展示

hive> select b.* from 
    > (select uid,count(*) as cnt from sogou.sogou_ext_20180724 group by uid having cnt>2) a
    > join sogou.sogou_ext_20180724 b on a.uid=b.uid
    > limit 20;
MapReduce Total cumulative CPU time: 3 minutes 40 seconds 190 msec
Ended Job = job_1532414392815_0014
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 73.96 sec   HDFS Read: 643711740 HDFS Write: 27591098 SUCCESS
Stage-Stage-2: Map: 5  Reduce: 3   Cumulative CPU: 220.19 sec   HDFS Read: 671324193 HDFS Write: 9785 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 54 seconds 150 msec
OK
20111230222158    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    6    3    http://bbs.17500.cn/thread-2453170-1-1.html    2011    11    23    2
20111230222603    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    10    5    http://www.18888.com/read-htm-tid-6069520.html    2011    11    23    2
20111230222128    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    5    2    http://www.zibocn.com/Infor/i8513.html    2011    1123    2
20111230222802    000080fd3eaf6b381e33868ec6459c49    福彩3d单选号码走势图    1    1    http://zst.cjcp.com.cn/cjw3d/view/3d_danxuan.php    2011    11    23    2
20111230222417    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    7    4    http://bbs.18888.com/read-htm-tid-4017348.html    2011    11    23    2
20111230220953    000080fd3eaf6b381e33868ec6459c49    福彩3d单选一注法    4    1    http://www.55125.cn/3djq/20111103_352210.htm    2011    11    23    2
20111230211504    0000c2d1c4375c8a827bff5dab0cc0a6    穿越小说txt    3    2    http://www.booktxt.com/chuanyue/    2011    11    232
20111230213029    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦txt    1    1    http://ishare.iask.sina.com.cn/f/15694326.html?from=like    2011    11    23    2
20111230211319    0000c2d1c4375c8a827bff5dab0cc0a6    穿越小说txt    2    1    http://www.zlsy.net.cn/    2011    11    23    2
20111230213047    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦txt    2    2    http://www.txtinfo.com/txtshow/txt6105.html    2011    1123    2
20111230205803    0000c2d1c4375c8a827bff5dab0cc0a6    步步惊心歌曲    4    1    http://www.tingge123.com/zhuanji/1606.shtml    2011    1123    2
20111230205643    0000c2d1c4375c8a827bff5dab0cc0a6    步步惊心主题曲    4    1    http://bubujingxin.net/music.shtml    2011    11    232
20111230212531    0000c2d1c4375c8a827bff5dab0cc0a6    乱世公主txt    1    1    http://ishare.iask.sina.com.cn/f/20689380.html    2011    1123    2
20111230210041    0000c2d1c4375c8a827bff5dab0cc0a6    步步惊心歌曲    5    2    http://www.yue365.com/mlist/10981.shtml    2011    11    232
20111230213911    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦小说在线阅读    2    1    http://www.readnovel.com/partlist/22004/    2011    11    23    2
20111230213835    0000c2d1c4375c8a827bff5dab0cc0a6    浮生若梦小说txt下载    2    1    http://www.2yanqing.com/f_699993/244670/download.html    2011    11    23    2
20111230195312    0000d08ab20f78881a2ada2528671c58    棉花价格    3    3    http://www.yz88.org.cn/jg/    2011    11    23    1
20111230195114    0000d08ab20f78881a2ada2528671c58    棉花价格    2    2    http://www.cnjidan.com/mianhua.asp    2011    11    231
20111230200339    0000d08ab20f78881a2ada2528671c58    棉花价格最新    2    2    http://www.yz88.org.cn/jg/    2011    11    23    2
20111230195652    0000d08ab20f78881a2ada2528671c58    棉花价格行情走势图    1    1    http://www.yz88.org.cn/jg/    2011    11    231

  2.用户行为分析

  (1)用户在使用搜索引擎时,搜索引擎返回的结果的前10个项目正好位于网页的第一页,因此来查询一下有多少记录是在前十条

hive> select count(*) from sogou.sogou_ext_20180724 where rank<11;
MapReduce Total cumulative CPU time: 23 seconds 180 msec
Ended Job = job_1532414392815_0015
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 23.18 sec   HDFS Read: 643705566 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 23 seconds 180 msec
OK
4999869
Time taken: 29.25 seconds, Fetched: 1 row(s)

  可以看到,一共有4999869次记录的rank值小于等于10,也就是说基本上所有用户都只浏览搜索引擎返回的第一页的内容。

  (2)用户在搜索引擎中有的输入关键字,有的则是记不全网站的域名,想通过搜索引擎来找到想要访问的网站,统计这部分记录的个数。可以使用下面的含有正则表达式的查询语句来查询keywords中包含“www”的记录数。可以看到,绝大部分用户不会采用URL进行查询。

hive> select count(*) from sogou.sogou_ext_20180724 where keywords like '%www%';
MapReduce Total cumulative CPU time: 27 seconds 960 msec
Ended Job = job_1532414392815_0016
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 27.96 sec   HDFS Read: 643705515 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 27 seconds 960 msec
OK
73979
Time taken: 28.339 seconds, Fetched: 1 row(s

  还可以查询在用户输入了URL的情况下,用户点击了其输入的URL网址的记录数。可以看到,有27561/73979=37%的用户提交了URL进行查询,并且继续点击了查询的结果。这可能是由于用户没有记全URL等原因,而想借助搜索引擎来找到自己想要的网址。这个分析结果就提示我们:搜索引擎在处理这一部分查询请求的时候,一个可能比较理想的改进方式就是,首先把相关的完整URL返回给用户,这样就有较大可能改善用户的查询体验,满足用户需求。

hive> select sum(if(instr(url,keywords)>0,1,0)) from (select * from sogou.sogou_ext_20180724 where keywords like '%www%' ) a;
MapReduce Total cumulative CPU time: 32 seconds 220 msec
Ended Job = job_1532414392815_0017
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 1   Cumulative CPU: 32.22 sec   HDFS Read: 643706391 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 32 seconds 220 msec
OK
27561
Time taken: 27.895 seconds, Fetched: 1 row(s)

  (3)要想知道有多少人喜欢“仙剑奇侠传”,查询出搜索过“仙剑奇侠传”且次数大于3的uid,可以看到有两人满足,分别查询了6次和5次。

select uid,count(*) as cnt from sogou.sogou_ext_20180724 where keywords='仙剑奇侠传' group by uid having cnt > 3;
Ended Job = job_1532414392815_0018
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3  Reduce: 3   Cumulative CPU: 47.1 sec   HDFS Read: 643717341 HDFS Write: 355 SUCCESS
Total MapReduce CPU Time Spent: 47 seconds 100 msec
OK
653d48aa356d5111ac0e59f9fe736429    6
e11c6273e337c1d1032229f1b2321a75    5
Time taken: 39.244 seconds, Fetched: 2 row(s)

  3.实时数据

  在实际应用中,为了实时地显示当前搜索引擎的搜索数据,首先需要创建一些临时表,然后在一天结束后对数据进行处理,并将数据插入到临时表中,供显示部分展示。

  (1)创建临时表

hive> create table sogou.uid_cnt(uid string, cnt int)
    > comment 'This is the sogou search data of one day'
    > row format delimited
    > fields terminated by '\t'
    > stored as textfile;
OK
Time taken: 0.488 seconds

  (2)插入数据

hive> insert overwrite table sogou.uid_cnt select uid, count(*) as cnt
    > from sogou.sogou_ext_20180724 group by uid;

  (3)查看数据

hive> select * from  uid_cnt limit 20;
OK
00005c113b97c0977c768c13a6ffbb95    2
000080fd3eaf6b381e33868ec6459c49    6
0000c2d1c4375c8a827bff5dab0cc0a6    10
0000d08ab20f78881a2ada2528671c58    9
0000e7482034da216ce878a9f16feb49    5
0001520a31ed091fa857050a5df35554    1
0001824d091de069b4e5611aad47463d    1
0001894c9f9de37ef9c90b6e5a456767    2
0001b04bf9473458af40acb4c13f1476    1
0001f5bacf60b0ff8c1c9e66e4905c1f    2
000202ae03f7acc86d5ae784b4bf56ba    1
0002b0dfc0b974b05f246acc590694ea    2
0002c93607740aa5919c0de3645639cb    1
000312ca0eaa91c30e5bafbcf2981bfd    21
00032480797f1578f8fc83f47e180a77    1
00032937ee88388581c86aa910b2a85b    1
0003dbdb7fca09669a9784c6aaaf3bb1    6
00043047d46f5e49dfcf15979b1bd49d    11
00043fcb1a34d32bb06c0dfa35fb199b    3
00047c0822b036bc1b473d9373fda199    1
Time taken: 0.16 seconds, Fetched: 20 row(s)

  这样,前端开发人员就可以访问该临时表,并将数据展示出来,其展示方式可以根据实际需要设计,如表格、统计图等。

 

猜你喜欢

转载自www.cnblogs.com/BigJunOba/p/9362604.html