文件清洗

题目:

Result文件数据说明:

Ip:106.39.41.166,(城市)

Date:10/Nov/2016:00:01:02 +0800,(日期)

Day:10,(天数)

Traffic: 54 ,(流量)

Type: video,(类型:视频video或文章article)

Id: 8701(视频或者文章的id)

测试要求:

1、 数据清洗:按照进行数据清洗,并将清洗后的数据导入hive数据库中。

两阶段数据清洗:

(1)第一阶段:把需要的信息从原始日志中提取出来

ip:    199.30.25.88

time:  10/Nov/2016:00:01:03 +0800

traffic:  62

文章: article/11325

视频: video/3235

1 2 4 5 6

(2)第二阶段:根据提取出来的信息做精细化操作

ip--->城市 city(IP)

date--> time:2016-11-10 00:01:03

day: 10

traffic:62

type:article/video

id:11325

(3)hive数据库表结构:

create table data(  ip string,  time string , day string, traffic bigint,

type string, id   string )

2、数据处理:

·统计最受欢迎的视频/文章的Top10访问次数 (video/article)

·按照地市统计最受欢迎的Top10课程 (ip)

·按照流量统计最受欢迎的Top10课程 (traffic)

3、数据可视化:将统计结果倒入MySql数据库中,通过图形化展示的方式展现出来。

完成情况:

目前完成了第一步

数据清洗代码:

package keshang11t13;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class qingxi1{
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance();
job.setJobName("WordCount1");
job.setJarByClass(qingxi1.class);
job.setMapperClass(doMapper.class);
job.setReducerClass(doReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
Path in = new Path("hdfs://localhost:9000/user/hadoop/input/result");
Path out = new Path("hdfs://localhost:9000/user/hadoop/output2");
FileInputFormat.addInputPath(job, in);
FileOutputFormat.setOutputPath(job, out);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class doMapper extends Mapper<Object, Text, Text,  NullWritable>{
public static Text word = new Text();


public static final SimpleDateFormat FORMAT = new SimpleDateFormat("d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH); //原时间格式

public static final SimpleDateFormat dateformat1 = new SimpleDateFormat("yyyy-MM-dd-HH:mm:ss");//现时间格式

private  static Date parseDateFormat(String string) {         //转换时间格式

   Date parse = null;

   try {

       parse = FORMAT.parse(string);

   } catch (Exception e) {

       e.printStackTrace();

   }

   return parse;

}

private  static  String parseTime(String line) {    //时间

    final int first = line.indexOf("");

    final int last = line.indexOf(" +0800");

    String time = line.substring(first + 1, last).trim();

    Date date = parseDateFormat(time);

    return dateformat1.format(date);

}
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
	String line = value.toString();
	String arr[] = line.split(",");
	System.out.println(arr[1]);
	 arr[1] = parseTime(arr[1]);   //时间
		System.out.println(arr[1]);
word.set(arr[0]+" "+arr[1]+" "+arr[2]+" "+arr[3]+" "+arr[4]+" "+arr[5]);
context.write(word,  NullWritable.get());
}	
}	
public static class doReducer extends Reducer<Text, NullWritable, Text,  NullWritable>{
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context)
throws IOException, InterruptedException {
context.write(key, NullWritable.get());
}
}
}

  在hive中建表并导入数据:

猜你喜欢

转载自www.cnblogs.com/xuange1/p/11853807.html