Hadoop study notes -20. Web site log analysis project cases (a) Projects


I. Background and data situation

1.1 Project Source

  This practice of secondary data log from a domestic technological learning Forum, which is organized by a training organization, brought together many technology learners, there are people who post every day, replies, shown in Figure 1.

Figure 1 Source Project website - Technical Learning Forum

  The purpose of this practice is that by the Forum of apache common for log analysis to calculate some of the key indicators of the Forum, a reference for operators to make decisions.

PS: development of the system's purpose is to get some business-related indicators, which are third-party tools can not be obtained;

1.2 Data case 

  The forum data has two parts:

  (1) historical data about 56GB, statistics to 2012-05-29. This also shows that, prior to 2012-05-29, and the log file in a file inside, using the additional written way.

  (2) Since 2013-05-30, generates a data file per day, about 150MB. This also shows that, from after 2013-05-30, the log file is no longer in a file inside.

  Figure 2 shows the recording format of the log data, wherein each line is recorded 5 parts: the IP visitors, access time, access to the resource, the access status (HTTP status code), this traffic.

Logging Data Format FIG. 2

Second, the key indicator KPI

2.1 page views PV

  (1) Definition: Page Views is the PV (Page View), is the sum of all users browse the page, open a separate page for each user are recorded once.

  (2) Analysis: The site's total page views, users can examine the site of interest, just as the ratings for TV shows. But for Web site operators are concerned, more importantly, the views under each section.

  Calculated: record count, obtain access number from the log, it can be subdivided into various sections in the number of accesses.

2.2 the number of registered users

  The forum user registration page for the member.php, and the request when the user clicks registration is member.php? Mod = url register of.

  The formula is: access to member.php mod = url register, the count?.

2.3 IP number

  (1)定义:一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。

  (2)分析:这是我们最熟悉的一个概念,无论同一个IP上有多少电脑,或者其他用户,从某种程度上来说,独立IP的多少,是衡量网站推广活动好坏最直接的数据。

  计算公式:对不同的访问者ip,计数

2.4 跳出率

  (1)定义:只浏览了一个页面便离开了网站的访问次数占总的访问次数的百分比,即只浏览了一个页面的访问次数 / 全部的访问次数汇总。

  (2)分析:跳出率是非常重要的访客黏性指标,它显示了访客对网站的兴趣程度:跳出率越低说明流量质量越好,访客对网站的内容越感兴趣,这些访客越可能是网站的有效用户、忠实用户。

PS:该指标也可以衡量网络营销的效果,指出有多少访客被网络营销吸引到宣传产品页或网站上之后,又流失掉了,可以说就是煮熟的鸭子飞了。比如,网站在某媒体上打广告推广,分析从这个推广来源进入的访客指标,其跳出率可以反映出选择这个媒体是否合适,广告语的撰写是否优秀,以及网站入口页的设计是否用户体验良好。

  计算公式:①统计一天内只出现一条记录的ip,称为跳出数;②跳出数/PV;

2.5 板块热度排行榜

  (1)定义:版块的访问情况排行。

  (2)分析:巩固热点版块成绩,加强冷清版块建设。同时对学科建设也有影响。

  计算公式:按访问次数统计排序;

三、开发步骤

3.0 需要用到的技术

  (1)Linux Shell编程

  (2)HDFS、MapReduce

  (3)HBase、Hive、Sqoop框架

3.1 上传日志文件至HDFS

  把日志数据上传到HDFS中进行处理,可以分为以下几种情况:

  (1)如果是日志服务器数据较小、压力较小,可以直接使用shell命令把数据上传到HDFS中;

  (2)如果是日志服务器数据较大、压力较大,使用NFS在另一台服务器上上传数据;

  (3)如果日志服务器非常多、数据量大,使用flume进行数据处理;

3.2 数据清洗

  使用MapReduce对HDFS中的原始数据进行清洗,以便后续进行统计分析;

3.3 统计分析

  使用Hive对清洗后的数据进行统计分析;

3.4 分析结果导入MySQL

  使用Sqoop把Hive产生的统计结果导出到mysql中;

3.5 提供视图工具

  提供视图工具供用户使用,指标查询mysql、明细则查询Hbase;

四、表结构设计

4.1 MySQL表结构设计

  这里使用MySQL存储关键指标的统计分析结果。

4.2 HBase表结构设计

  这里使用HBase存储明细日志,能够利用ip、时间查询。

  后面,我们就开始具体的实战了,本篇作为介绍就到此为止!

Guess you like

Origin blog.csdn.net/qq_35281775/article/details/52684602