编写MapReduce程序（简单的电话被呼叫分析程序）

由于Hadoop 2.2.0目前还没有好用的Eclipse插件，目前使用Eclipse上编写代码，而后放到Hadoop环境执行的形式。

准备工作：

1、搭建Hadoop环境，创建项目，项目的BuildPath中添加所有Hadoop中的jar包；

2、构造数据集：每一行数据两个号码组成，呼叫号和被呼叫号，生成随机测试数据，将生成的文件放入hdfs中；

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.Random;

public class GenerateTestData {
	public static void writeToFile(String fileName) throws Exception{
		 OutputStream out = new FileOutputStream(new File(fileName));
		 BufferedOutputStream bo = new BufferedOutputStream(out);
		 
		 Random rd1 = new Random();
		 
		 for(int i=0; i<10000; i++){
			 int j=0;
			 StringBuffer sb = new StringBuffer("");
			 sb.append(1);
			 for(j=1;j<9;j++){
				 sb.append(rd1.nextInt(10));
				 //bo.write(rd1.nextInt(10));
			 }
			 sb.append(" ");
			 switch(rd1.nextInt(10)){
			    case 1:
			    	sb.append("10086");
			    	break;
			    case 2:
			    	sb.append("110");
			    	break;
			    case 3:
			    	sb.append("120");
			    	break;
			    case 4:
			    	sb.append("119");
			    	break;
			    case 5:
			    	sb.append("114");
			    	break;
			    case 6:
			    	sb.append("17951");
			    	break;
			    case 7:
			    	sb.append("10010");
			    	break;
			    case 8:
			    	sb.append("13323567897");
			    	break;
			    default:
			    	sb.append(1);
					 for(j=1;j<9;j++){
						 sb.append(rd1.nextInt(10));
						 //bo.write(rd1.nextInt(10));
					 }
			    	break;
			    
			 }
			
			 
			 sb.append("\r\n");
			 bo.write(sb.toString().getBytes());
		 }
	}
	
	public static void main(String[] args) {
			 try {
				writeToFile("d://helloa.txt");
				System.out.println("finish!");
			} catch (Exception e) {
				e.printStackTrace();
			}

	}

}

MapReduce程序如下，目前编写的程序参考自Hadoop权威指南，用的还是老版本的API：

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class FirstTest extends Configured implements Tool{

	enum Counter{
		LINESKIP,
	}
	
	public static class Map extends Mapper<LongWritable, Text, Text, Text>{
		@Override
		public void map(LongWritable key, Text value, Context context)
			      throws IOException, InterruptedException{
			String line = value.toString();
			
			try {
				String []arr = line.split(" ");
				context.write(new Text(arr[1]), new Text(arr[0]));
			} catch (Exception e) {
				context.getCounter(Counter.LINESKIP).increment(1);
			}
		}
	}
	
	
	public static class Reduce extends Reducer<Text,Text,Text,Text>{
		@Override
		public void reduce(Text key, Iterable<Text> values,Context context)throws IOException, InterruptedException{
			String out = "";
			for(Text t:values){
				out += t.toString()+"|";
			}
			context.write(key, new Text(out));
		}
	}
	
	@Override
	public int run(String[] args) throws Exception {
		
		Configuration conf = getConf();
		
		Job job = new Job(conf, "First Map-Reduce Program");
	    job.setJarByClass(getClass());

	    FileInputFormat.addInputPath(job, new Path(args[0]));
	    FileOutputFormat.setOutputPath(job, new Path(args[1]));
	    
	    job.setMapperClass(Map.class);
	    job.setReducerClass(Reduce.class);
	    
	    job.setOutputKeyClass(Text.class);
	    job.setOutputValueClass(Text.class);
	    
	    job.waitForCompletion(true);
		return job.isSuccessful()?0:1;
	}
	
	public static void main(String[] args) throws Exception {
	    int exitCode = ToolRunner.run(new Configuration(),new FirstTest(), args);
	    System.exit(exitCode);
		
	  }

}

在linux下编译构造jar文件后在hadoop环境运行：
hadoop jar FirstTest.jar /input/helloa.txt /output

注意出现的问题：

1、由于是在Eclipse编写的程序，加了package，但是在Linux下打包时直接使用了jar cvfm abc.jar ..的命令，导致hadoop运行jar包时总提示找不到main class；

2、在linux下编译时，FirstTest.java文件是放在了HADOOP_CLASSPATH下编译，在此目录运行hadoop jar FirstTest.jar /input/helloa.txt /output时提示FirstTest&Map类找不着，将生成的FirstTest.jar放入其他目录后运行正常。

编写MapReduce程序（简单的电话被呼叫分析程序）

猜你喜欢