大数据项目实战之 --- 某某电子商务网站在线团购业务关键绩效指标KPI项目

一、术语介绍
--------------------------------------------------------------
    1.KPI: Key Performance Indicator，关键绩效指标。

    2.PV: page view,页面浏览量
        100000 --> 1G日志          //一般网站(每天)
        10G日志量             //大型网站(每小时)
        10G x 10 x 30 = 3000G = 3T //9T / 1T

    3.UV: unique visitor的,指访问某个站点或点击某条新闻的不同IP地址的人数。

    4.PR: page rank,页面评级

    5.日志格式
      222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
      拆解为以下 拆解为以下 8  个变量
        remote_addr: 记录客户端的 ip 地址, 222.68.172.190
        remote_user: 记录客户端用户名称, –
        time_local: 记录访问时间与时区, [18/Sep/2013:06:49:57 +0000]
        request: 记录请求的 url 与 http 协议, “GET /images/my.jpg HTTP/1.1”
        status: 记录请求状态,成功是 200, 200
        body_bytes_sent: 记录发送给客户端文件主体内容大小, 19939
        http_referer: 用来记录从那个页面链接访问过来的, “http://www.angularjs.cn/A00n”
        http_user_agent: 记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1)
      AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36”


二、案例介绍
--------------------------------------------------------------
    某电子商务网站，在线团购业务。每日 PV 数 100w，独立 IP 数 5w。用户通常在工作日上午
    10:00-12:00 和下午 15:00-18:00 访问量最大。日间主要是通过 PC 端浏览器访问，休息日及夜
    间通过移动设备访问较多。网站搜索浏量占整个网站的 80%，PC 用户不足 1%的用户会消费，
    移动用户有 5%会消费。
    通过简短的描述，我们可以粗略地看出，这家电商网站的经营状况，并认识到愿意消费的用户从
    哪里来，有哪些潜在的用户可以挖掘，网站是否存在倒闭风险等。


三、KPI指标设计
--------------------------------------------------------------
      PV(PageView): 页面访问量统计
      IP: 页面独立 IP 的访问量统计
      Time: 用户每小时 PV 的统计
      Source: 用户来源域名的统计
      Browser: 用户的访问设备统计


四、算法模型：Hadoop  并行算法
-------------------------------------------------------------
    并行算法的设计：
    注：找到第一节有定义的 8 个变量
    PV(PageView):  页面访问量统计
      Map 过程{key:$request,value:1}
      Reduce 过程{key:$request,value:求和(sum)}
    IP:  页面独立 IP  的访问量统计
      Map: {key:$request,value:$remote_addr}
      Reduce: {key:$request,value:去重再求和(sum(unique))}
    Time:  用户每小时 PV  的统计
      Map: {key:$time_local,value:1}
      Reduce: {key:$time_local,value:求和(sum)}
    Source:  用户来源域名的统计
      Map: {key:$http_referer,value:1}
      Reduce: {key:$http_referer,value:求和(sum)}
    Browser:  用户的访问设备统计
      Map: {key:$http_user_agent,value:1}
      Reduce: {key:$http_user_agent,value:求和(sum)}


五、架构设计 -- 日志 KPI  系统架构
-------------------------------------------------------------
    1.日志是由业务系统产生的，我们可以设置 web 服务器每天产生一个新的目录，目录下面会产生
    多个日志文件，每个日志文件 64M。
    2. 设置系统定时器 CRON，夜间在 0 点后，向 HDFS 导入昨天的日志文件。
    3. 完成导入后，设置系统定时器，启动 MapReduce 程序，提取并计算统计指标。
    4. 完成计算后，设置系统定时器，从 HDFS 导出统计指标数据到数据库，方便以后的即使查询。



六、开始项目
--------------------------------------------------------------
    1.创建新的项目KPI

    2.创建新的模块kpi，添加maven支持
        <?xml version="1.0" encoding="UTF-8"?>
        <project xmlns="http://maven.apache.org/POM/4.0.0"
                 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
            <modelVersion>4.0.0</modelVersion>

            <groupId>test.com</groupId>
            <artifactId>kpi</artifactId>
            <version>1.0-SNAPSHOT</version>
            <dependencies>
                <dependency>
                    <groupId>junit</groupId>
                    <artifactId>junit</artifactId>
                    <version>3.8.1</version>
                    <scope>test</scope>
                </dependency>
                <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
                <dependency>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-client</artifactId>
                    <version>2.7.2</version>
                </dependency>
                <dependency>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-hdfs</artifactId>
                    <version>2.7.2</version>
                </dependency>
            </dependencies>

        </project>

    3.分析日志，抽取成类KPI.java
        package model;

        public class KPI {
           private String ip ;
           private String time ;
           private String agent ;
           private String url ;
           private String userFrom ;
           public String getIp() {
              return ip;
           }
           public void setIp(String ip) {
              this.ip = ip;
           }
           public String getTime() {
              return time;
           }
           public void setTime(String time) {
              this.time = time;
           }
           public String getAgent() {
              return agent;
           }
           public void setAgent(String agent) {
              this.agent = agent;
           }
           public String getUrl() {
              return url;
           }
           public void setUrl(String url) {
              this.url = url;
           }
           public String getUserFrom() {
              return userFrom;
           }
           public void setUserFrom(String userFrom) {
              this.userFrom = userFrom;
           }
        }


    4.编写工具解析类，将每行日志解析成一个KPI实例LogParser.java
        package util;

        import java.text.SimpleDateFormat;
        import java.util.Date;

        import model.KPI;

        /**
         * 日志解析工具类
         */
        public class LogParser {

           /**
            * 解析Log记录成KPI
            */
           public static KPI kpiParse(String log){
              KPI kpi = new KPI();
              if(log != null && log.length() > 0){
                 String[] arr = log.split("\"");
                 //有效性判断
                 if(arr != null && arr.length == 6 ){
                    //agent
                    kpi.setAgent(arr[5].trim().substring(0, 11));
                    //from
                    kpi.setUserFrom(arr[3].trim());
                    //url
                    kpi.setUrl(arr[1].split(" ")[1]);
                    //
                    String[] a = arr[0].split(" ");
                    //设置IP
                    kpi.setIp(a[0]);
                    //Time
                    String timeStr = a[3].substring(1) ;
                    kpi.setTime(dateParse(timeStr));
                 }
              }
              return kpi ;
           }

           /**
            * 解析时间字符串，返回新串，精确到hour
            */
           public static String dateParse(String src){
              SimpleDateFormat df = new SimpleDateFormat();
              df.applyPattern("dd/MM/yyyy:hh:mm:ss");
              Date date = null;
              try {
                 date = df.parse(src);
                 df.applyPattern("yyyy-MM-dd hh");
                 return df.format(date);
              } catch (Exception e) {
                 e.printStackTrace();
              }
              return null ;
           }
        }


    5.编写Mapper类KPIMapper.java
        package mapreduce;

        import java.io.IOException;

        import org.apache.hadoop.io.LongWritable;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.Mapper;

        import model.KPI;
        import util.LogParser;

        /**
         * Mapper
         * PV  :url-> 1
         * UV  :url->ip
         * Time    :time->1
         * Browser:agent->1
         */
        public class KPIMapper extends Mapper<LongWritable, Text, Text, Text> {

            protected void map(LongWritable key, Text value, Context context)
                    throws IOException, InterruptedException {
                //取得一行文本日志
                String log = value.toString();
                //解析成kpi对象
                KPI kpi = LogParser.kpiParse(log);

                if(kpi != null){
                    String url = kpi.getUrl();
                    String ip = kpi.getIp();
                    String time = kpi.getTime();
                    String agent = kpi.getAgent();
                    //PV
                    context.write(new Text("PV:" + url),new Text("1"));
                    //UV
                    context.write(new Text("UV:" + url),new Text(ip));
                    //Time
                    context.write(new Text("TIME:" + time),new Text("1"));
                    //Browser
                    context.write(new Text("BROWSER:" + agent),new Text("1"));
                    context.getCounter("m1", "BROWSER:" + agent).increment(1);
                }
            }
        }


    6.编写reduce类KPIReducer.java
        package mapreduce;

        import java.io.IOException;
        import java.util.HashSet;
        import java.util.Set;

        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.Reducer;

        /**
         * KPIReducer
         */
        public class KPIReducer extends Reducer<Text, Text,Text, Text> {
            /**
             * IP:222.68.172.190
             * URL:/images/my.jpg
             * Time:18/11/2013:06
             * Browser:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36
             */
            protected void reduce(Text key0, Iterable<Text> valueIn, Context context)
                    throws IOException, InterruptedException {

                //提取key串
                String key = key0.toString();

                //PV
                if(key.startsWith("PV") || key.startsWith("TIME") || key.startsWith("BROWSER")){
                    int count = 0 ;
                    for(Text v : valueIn){
                        count ++ ;
                    }
                    context.write(key0, new Text(count + ""));
                    //
                    if(key.startsWith("BROWSER")){
                        context.getCounter("r1",key0.toString()).increment(1);
                    }
                }
                //UV
                else if(key.startsWith("UV")){
                    Set<String> ips = new HashSet<String>();
                    for(Text t : valueIn){
                        ips.add(t.toString());
                    }
                    context.write(key0, new Text(ips.size() + ""));
                }
            }
        }


    7.编写App类App.java
        import org.apache.hadoop.fs.Path;
        import org.apache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.Job;
        import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

        import mapreduce.KPIMapper;
        import mapreduce.KPIReducer;

        /**
         * Hello world!
         *
         */
        public class App {
           public static void main(String[] args) throws Exception {
              Job job = Job.getInstance();
              job.setJarByClass(KPIMapper.class);
              job.setJobName("KPI App");
              FileInputFormat.addInputPath(job, new Path(args[0]));
              FileOutputFormat.setOutputPath(job, new Path(args[1]));

              job.setMapperClass(KPIMapper.class);
              job.setReducerClass(KPIReducer.class);

              job.setOutputKeyClass(Text.class);
              job.setOutputValueClass(Text.class);

              System.exit(job.waitForCompletion(true) ? 0 : 1);
           }
        }
大数据项目实战之 --- 某某电子商务网站在线团购业务关键绩效指标KPI项目

猜你喜欢