I. Data Analysis

1.1 Taking stock data

　　The forum data has two parts:

　　(1) historical data about 56GB, statistics to 2012-05-29. This also shows that, prior to 2012-05-29, and the log file in a file inside, using the additional written way.

　　(2) Since 2013-05-30, generates a data file per day, about 150MB. This also shows that, from after 2013-05-30, the log file is no longer in a file inside.

　　Figure 1 shows a recording format of the log data, wherein each line is recorded 5 parts: the IP visitors, access time, access to the resource, the access status (HTTP status code), this traffic.

log

Logging data format of FIG. 1

　　The use of data from the two 2013 log files, respectively access_2013_05_30.log and access_2013_05_31.log, download address is: http://pan.baidu.com/s/1pJE7XR9

1.2 To clean data

　　(1) According to an analysis of key indicators before, we have to statistical analysis are not related to the state visit (HTTP status code) and the flow of this visit, so we can first of these two records clean out;

　　(2) the data format of the log records, we need to convert the date format to a common format, such as 20,150,426 usually see this, so we can write a class logging the date conversion;

　　(3) Since the static resource access request does not make sense for our data analysis, so we can "GET / staticsource /" at the beginning of access to records filtered out, and because the GET and POST string for us there is no sense, therefore, be which was omitted;

Second, the cleaning process data

2.1 regular uploads logs to the HDFS

　　First, upload the log data to HDFS for processing can be divided into the following situations:

　　(1) If the log server data smaller, less pressure may be used to upload the data directly to the command shell in HDFS;

　　(2) If the log server data is large, pressure, uploading data using NFS server on another;

　　（3）如果日志服务器非常多、数据量大，使用flume进行数据处理；

　　这里我们的实验数据文件较小，因此直接采用第一种Shell命令方式。又因为日志文件时每天产生的，因此需要设置一个定时任务，在第二天的1点钟自动将前一天产生的log文件上传到HDFS的指定目录中。所以，我们通过shell脚本结合crontab创建一个定时任务techbbs_core.sh，内容如下：

#!/bin/sh

#step1.get yesterday format string
yesterday=$(date --date='1 days ago' +%Y_%m_%d)
#step2.upload logs to hdfs
hadoop fs -put /usr/local/files/apache_logs/access_${yesterday}.log /project/techbbs/data

　　结合crontab设置为每天1点钟自动执行的定期任务：crontab -e，内容如下（其中1代表每天1:00，techbbs_core.sh为要执行的脚本文件）：

* 1 * * * techbbs_core.sh

　　验证方式：通过命令 crontab -l 可以查看已经设置的定时任务

2.2 编写MapReduce程序清理日志

　　（1）编写日志解析类对每行记录的五个组成部分进行单独的解析

    static class LogParser {
        public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
                "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
        public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
                "yyyyMMddHHmmss");/**
         * 解析英文时间字符串
         * 
         * @param string
         * @return
         * @throws ParseException
         */
        private Date parseDateFormat(String string) {
            Date parse = null;
            try {
                parse = FORMAT.parse(string);
            } catch (ParseException e) {
                e.printStackTrace();
            }
            return parse;
        }

        /**
         * 解析日志的行记录
         * 
         * @param line
         * @return 数组含有5个元素，分别是ip、时间、url、状态、流量
         */
        public String[] parse(String line) {
            String ip = parseIP(line);
            String time = parseTime(line);
            String url = parseURL(line);
            String status = parseStatus(line);
            String traffic = parseTraffic(line);

            return new String[] { ip, time, url, status, traffic };
        }

        private String parseTraffic(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String traffic = trim.split(" ")[1];
            return traffic;
        }

        private String parseStatus(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String status = trim.split(" ")[0];
            return status;
        }

        private String parseURL(String line) {
            final int first = line.indexOf("\"");
            final int last = line.lastIndexOf("\"");
            String url = line.substring(first + 1, last);
            return url;
        }

        private String parseTime(String line) {
            final int first = line.indexOf("[");
            final int last = line.indexOf("+0800]");
            String time = line.substring(first + 1, last).trim();
            Date date = parseDateFormat(time);
            return dateformat1.format(date);
        }

        private String parseIP(String line) {
            String ip = line.split("- -")[0].trim();
            return ip;
        }
    }

　　（2）编写MapReduce程序对指定日志文件的所有记录进行过滤

　　Mapper类：

        static class MyMapper extends
            Mapper<LongWritable, Text, LongWritable, Text> {
        LogParser logParser = new LogParser();
        Text outputValue = new Text();

        protected void map(
                LongWritable key,
                Text value,
                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)
                throws java.io.IOException, InterruptedException {
            final String[] parsed = logParser.parse(value.toString());

            // step1.过滤掉静态资源访问请求
            if (parsed[2].startsWith("GET /static/")
                    || parsed[2].startsWith("GET /uc_server")) {
                return;
            }
            // step2.过滤掉开头的指定字符串
            if (parsed[2].startsWith("GET /")) {
                parsed[2] = parsed[2].substring("GET /".length());
            } else if (parsed[2].startsWith("POST /")) {
                parsed[2] = parsed[2].substring("POST /".length());
            }
            // step3.过滤掉结尾的特定字符串
            if (parsed[2].endsWith(" HTTP/1.1")) {
                parsed[2] = parsed[2].substring(0, parsed[2].length()
                        - " HTTP/1.1".length());
            }
            // step4.只写入前三个记录类型项
            outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
            context.write(key, outputValue);
        }
    }

　　Reducer类：

    static class MyReducer extends
            Reducer<LongWritable, Text, Text, NullWritable> {
        protected void reduce(
                LongWritable k2,
                java.lang.Iterable<Text> v2s,
                org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context)
                throws java.io.IOException, InterruptedException {
            for (Text v2 : v2s) {
                context.write(v2, NullWritable.get());
            }
        };
    }

　　（3）LogCleanJob.java的完整示例代码

 
    View Code

　　（4）导出jar包，并将其上传至Linux服务器指定目录中

2.3 定期清理日志至HDFS

　　这里我们改写刚刚的定时任务脚本，将自动执行清理的MapReduce程序加入脚本中，内容如下：

#!/bin/sh

#step1.get yesterday format string
yesterday=$(date --date='1 days ago' +%Y_%m_%d)
#step2.upload logs to hdfs
hadoop fs -put /usr/local/files/apache_logs/access_${yesterday}.log /project/techbbs/data
#step3.clean log data
hadoop jar /usr/local/files/apache_logs/mycleaner.jar /project/techbbs/data/access_${yesterday}.log /project/techbbs/cleaned/${yesterday}

　　这段脚本的意思就在于每天1点将日志文件上传到HDFS后，执行数据清理程序对已存入HDFS的日志文件进行过滤，并将过滤后的数据存入cleaned目录下。

2.4 定时任务测试

　　（1）因为两个日志文件是2013年的，因此这里将其名称改为2015年当天以及前一天的，以便这里能够测试通过。

　　（2）执行命令：techbbs_core.sh 2014_04_26

　　控制台的输出信息如下所示，可以看到过滤后的记录减少了很多：

15/04/26 04:27:20 INFO input.FileInputFormat: Total input paths to process : 1
15/04/26 04:27:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/04/26 04:27:20 WARN snappy.LoadSnappy: Snappy native library not loaded
15/04/26 04:27:22 INFO mapred.JobClient: Running job: job_201504260249_0002
15/04/26 04:27:23 INFO mapred.JobClient: map 0% reduce 0%
15/04/26 04:28:01 INFO mapred.JobClient: map 29% reduce 0%
15/04/26 04:28:07 INFO mapred.JobClient: map 42% reduce 0%
15/04/26 04:28:10 INFO mapred.JobClient: map 57% reduce 0%
15/04/26 04:28:13 INFO mapred.JobClient: map 74% reduce 0%
15/04/26 04:28:16 INFO mapred.JobClient: map 89% reduce 0%
15/04/26 04:28:19 INFO mapred.JobClient: map 100% reduce 0%
15/04/26 04:28:49 INFO mapred.JobClient: map 100% reduce 100%
15/04/26 04:28:50 INFO mapred.JobClient: Job complete: job_201504260249_0002
15/04/26 04:28:50 INFO mapred.JobClient: Counters: 29
15/04/26 04:28:50 INFO mapred.JobClient: Job Counters
15/04/26 04:28:50 INFO mapred.JobClient: Launched reduce tasks=1
15/04/26 04:28:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=58296
15/04/26 04:28:50 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/04/26 04:28:50 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/04/26 04:28:50 INFO mapred.JobClient: Launched map tasks=1
15/04/26 04:28:50 INFO mapred.JobClient: Data-local map tasks=1
15/04/26 04:28:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=25238
15/04/26 04:28:50 INFO mapred.JobClient: File Output Format Counters
15/04/26 04:28:50 INFO mapred.JobClient: Bytes Written=12794925
15/04/26 04:28:50 INFO mapred.JobClient: FileSystemCounters
15/04/26 04:28:50 INFO mapred.JobClient: FILE_BYTES_READ=14503530
15/04/26 04:28:50 INFO mapred.JobClient: HDFS_BYTES_READ=61084325
15/04/26 04:28:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=29111500
15/04/26 04:28:50 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=12794925
15/04/26 04:28:50 INFO mapred.JobClient: File Input Format Counters
15/04/26 04:28:50 INFO mapred.JobClient: Bytes Read=61084192
15/04/26 04:28:50 INFO mapred.JobClient: Map-Reduce Framework
15/04/26 04:28:50 INFO mapred.JobClient: Map output materialized bytes=14503530
15/04/26 04:28:50 INFO mapred.JobClient: Map input records=548160
15/04/26 04:28:50 INFO mapred.JobClient: Reduce shuffle bytes=14503530
15/04/26 04:28:50 INFO mapred.JobClient: Spilled Records=339714
15/04/26 04:28:50 INFO mapred.JobClient: Map output bytes=14158741
15/04/26 04:28:50 INFO mapred.JobClient: CPU time spent (ms)=21200
15/04/26 04:28:50 INFO mapred.JobClient: Total committed heap usage (bytes)=229003264
15/04/26 04:28:50 INFO mapred.JobClient: Combine input records=0
15/04/26 04:28:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=133
15/04/26 04:28:50 INFO mapred.JobClient: Reduce input records=169857
15/04/26 04:28:50 INFO mapred.JobClient: Reduce input groups=169857
15/04/26 04:28:50 INFO mapred.JobClient: Combine output records=0
15/04/26 04:28:50 INFO mapred.JobClient: Physical memory (bytes) snapshot=154001408
15/04/26 04:28:50 INFO mapred.JobClient: Reduce output records=169857
15/04/26 04:28:50 INFO mapred.JobClient: Virtual memory (bytes) snapshot=689442816
15/04/26 04:28:50 INFO mapred.JobClient: Map output records=169857
Clean process success!

　　(3) interface to view log data in HDFS through the Web:

　　Log data stored unfiltered: / project / techbbs / data /

　　Stored log data filtered: / project / techbbs / cleaned /

Hadoop study notes -20. Web site log analysis project cases (b) data cleaning