【甘道夫】通过Mahout构建贝叶斯文本分类器案例详解

转载于：https://blog.csdn.net/u010967382/article/details/25368795

背景&目标：

1、sport.tar 是体育类的文章，一共有10个类别；

用这些原始材料构造一个体育类的文本分类器，并测试对比bayes和cbayes的效果；

记录分类器的构造过程和测试结果。

2、user-sport.tar 是用户浏览的文章，每个文件夹对应一个用户；

利用上题构造的文本分类器，计算每个用户浏览各类文章的占比；

记录计算过程和结果。

实验环境：

Hadoop-1.2.1

Mahout0.6

Pig0.12.1

Ubuntu12

Jdk1.7

原理&流程

建立文本分类器：

1.分类体系的确定

2.文本样本的积累

3.文本的预处理（分词）

4.划分训练集，测试集

5.对模型的训练

6.对模型准确性测试

测试分类器模型时，如果觉得模型效果不够满意，可以对过程进行调整，然后重新生成模型。

可调整的方面包括：

积累更多，更有具代表性的样本；
在文本预处理阶段选择更好的分词算法；
在训练分类器时，对训练参数进行调整。

不断重复以上过程，直到得到满意的模型为止。

对文本进行分类：

建立完文本分类器以后，就可以输入一个文本，输出一个分类。

Step1：将所需用到的原始数据sport和user-sport文件夹上传到hdfs

sport文件夹：

用于训练文本分类器
包含了多个子文件夹，每个子文件夹都是一个分类的文章
在现实项目中，该原始数据需要人工收集

user-sport：

待分类的文本

注意：user-sport文件夹下的子文件夹名称是用户id，子文件夹内包含了多个文本文件，都是该用户浏览过的文章。

step2：对sport文件夹进行分词

用到 MRTokenize.jar中的 tokenize.TokenizeDriver

到此为止，原始数据已经分好词，并且已经处理成Mahout训练文本分类器要求的输入格式：

每行一篇文章
每行的格式为：分类名称文章分词结果

Step3：划分训练集和测试集

我们把经过分词处理的原始数据划分为训练集和测试集，训练集用于训练模型，测试集用于测试模型效果。

该过程通过pig实现：

grunt> processed = load'/dataguru/hadoopdev/week8/fenciout/part-r-00000' as (category:chararray,doc:chararray);

grunt> test = sample processed 0.2;

grunt> jnt = join processed by (category,doc) left outer, test by (category,doc);

grunt> filt_test = filter jnt by test::category is null;

grunt> train = foreach filt_test generate processed::category as category,processed::doc as doc;

grunt> store test into '/dataguru/hadoopdev/week8/test';

grunt> store train into '/dataguru/hadoopdev/week8/train';

结果截图：

Step4：训练贝叶斯模型

我们分别训练bayes模型和cbayes模型，后面测试两者的效果做对比。

首先训练bayes模型：

casliyang@singlehadoop:~$ mahout trainclassifier -i /dataguru/hadoopdev/week8/train -o /dataguru/hadoopdev/week8/model-bayes -type bayes -ng 1 -source hdfs

然后训练cbayes模型：

casliyang@singlehadoop:~$ mahout trainclassifier -i /dataguru/hadoopdev/week8/train -o /dataguru/hadoopdev/week8/model- cbayes -type cbayes -ng 1 -source hdfs

训练结果：

Step5：测试模型

测试贝叶斯模型命令如下：

casliyang@singlehadoop:~$ mahout testclassifier -d /dataguru/hadoopdev/week8/test -m /dataguru/hadoopdev/week8/model-bayes -type bayes -ng 1 -source hdfs -method mapreduce

测试结果：

测试C贝叶斯模型命令如下：

casliyang@singlehadoop:~$ mahout testclassifier -d /dataguru/hadoopdev/week8/test -m /dataguru/hadoopdev/week8/model-cbayes -type cbayes -ng 1 -source hdfs -method mapreduce

测试结果：

Step5：处理待分类数据

我们的待分类数据全存储在user-sport文件夹下，每个子文件夹都存储了一个用户浏览过的文章，子文件夹的名称就是用户id：

Mahout的文本分类器要求输入数据为分词后的文章，我们直接使用训练分类器时用到的 MRTokenize.jar 中的 tokenize.TokenizeDriver 来对文章进行分词，输出格式为：

每行一篇文章
每行的格式为：用户ID 文章分词结果

执行命令对待分类数据进行分词：

casliyang@singlehadoop:~/Myfiles$ hadoop jar MRTokenize.jar tokenize.TokenizeDriver /dataguru/hadoopdev/week8/user-sport /dataguru/hadoopdev/week8/user-sport-fenciout

结果：

Step6：Hadoop环境下，对待分类数据进行分类，并统计每个用户浏览每个分类的次数

Hadoop环境下调用Mahout分类器的程序细节见最后！

将程序打jar包后拷贝到集群上执行。

执行命令对待分类数据进行分类：

casliyang@singlehadoop:~/Myfiles$ hadoop jar MRClassify.jar classifier.ClassifierDriver /dataguru/hadoopdev/week8/user-sport-fenciout /dataguru/hadoopdev/week8/user-sport-bayesout /dataguru/hadoopdev/week8/model-bayes bayes

说明：

参数1：输入路径，即上一步分词处理好的待分类的文章存储路径

参数2：输出路径，即统计好的用户浏览各个分类的数量

参数3：模型所在路径

参数4：模型的算法

分类并统计的结果：

结果的每行格式：用户ID | 分类 | 浏览次数

Step6：处理上一步的输出数据，得到每个用户访问次数最多的分类

使用pig处理：

grunt> u_ct = load'/dataguru/hadoopdev/week8/user-sport-bayesout' using PigStorage('|') as (user:chararray, category:chararray, times:int);

grunt> u_stat = foreach(group u_ct by user)

>> {

>> sorted = order u_ct by times desc;

>> top = limit sorted 1;

>> generate flatten(top),SUM(u_ct.times);

>> };

grunt> store u_stat into '/dataguru/hadoopdev/week8/user-sport-pigout';

结果（第一列是用户id，第二列是浏览量最多的类别，第三列是该类别的浏览次数，第四列是该用户总共的浏览量）：

附Hadoop环境下调用Mahout分类器的程序MRClassify：

Mapper:

package classifier;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.Algorithm;
import org.apache.mahout.classifier.bayes.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.BayesParameters;
import org.apache.mahout.classifier.bayes.CBayesAlgorithm;
import org.apache.mahout.classifier.bayes.ClassifierContext;
import org.apache.mahout.classifier.bayes.Datastore;
import org.apache.mahout.classifier.bayes.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.InvalidDatastoreException;
import org.apache.mahout.common.nlp.NGrams;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import classifier.Counter;

public class ClassifierMapper extends Mapper<Text, Text, Text, IntWritable> {

	private Text outKey = new Text();
	private static final IntWritable ONE = new IntWritable(1);
	
	private int gramSize = 1;
	private ClassifierContext classifier;
	private String defaultCategory;

	private static final Logger log = LoggerFactory.getLogger(ClassifierMapper.class);

	/**
	 * Parallel Classification
	 * 
	 * @param key
	 *          The label
	 * @param value
	 *          the features (all unique) associated w/ this label
	 * @param context
	 */
	public void map(Text key, Text value, Context context)
			throws IOException, InterruptedException {

		String docLabel = "";
		String userID = key.toString();
		List<String> ngrams = new NGrams(value.toString(), gramSize).generateNGramsWithoutLabel();
		try {
			ClassifierResult result;
			result = classifier.classifyDocument(ngrams.toArray(new String[ngrams.size()])
					, defaultCategory);
			docLabel = result.getLabel();			
		} catch (InvalidDatastoreException e) {
			log.error(e.toString(), e);
			context.getCounter(Counter.FAILDOCS).increment(1);
		}
		// key is userID and docLabel
		outKey.set(userID+"|"+docLabel);
		context.write(outKey, ONE);
	}


	/**
	 * read the model
	 * @throws IOException 
	 */
	@Override
	public void setup(Context context) throws IOException{
		// get bayes parameters
		Configuration conf = context.getConfiguration();
		BayesParameters params = new BayesParameters(conf.get("bayes.parameters", ""));
		log.info("Bayes Parameter {}", params.print());  

		Algorithm algorithm;
		Datastore datastore;
		if ("bayes".equalsIgnoreCase(params.get("classifierType"))) {
			algorithm = new BayesAlgorithm();
			datastore = new InMemoryBayesDatastore(params);
		} else if ("cbayes".equalsIgnoreCase(params.get("classifierType"))) {
			algorithm = new CBayesAlgorithm();
			datastore = new InMemoryBayesDatastore(params);
		} else {
			throw new IllegalArgumentException(
					"Unrecognized classifier type: " + params.get("classifierType"));
		}

		classifier = new ClassifierContext(algorithm, datastore);
		try {
			classifier.initialize();
		} catch (InvalidDatastoreException e) {
			log.error(e.toString(), e);
		}
		
		defaultCategory = params.get("defaultCat");
		gramSize = params.getGramSize();
	}



}

Reducer:

package classifier;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class ClassifierReducer extends Reducer<Text, IntWritable, NullWritable, Text> {

	private Text outValue = new Text();
	public void reduce(Text key, Iterable<IntWritable> values, Context context) 
			throws IOException, InterruptedException {
		// get the number of labels that user read
		int num = 0;
		for(IntWritable value: values){
			num += value.get();
		}
		outValue.set(key.toString()+"|"+num);
		// output
		context.write(NullWritable.get(), outValue);
	}
}

Counter：

package classifier;

public enum Counter {
	FAILDOCS
}

Driver：

package classifier;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.mahout.classifier.bayes.BayesParameters;


public class ClassifierDriver {
	
	public static void main(String[] args) throws Exception {
		
		// set bayes parameter
		BayesParameters params = new BayesParameters();
		params.setBasePath(args[2]);
		params.set("classifierType", args[3]);
		params.set("alpha_i", "1.0");
		params.set("defaultCat", "unknown");
		params.setGramSize(1);
		
		// set configuration
		Configuration conf = new Configuration();
		conf.set("bayes.parameters", params.toString());
		
		// create job
		Job job = new Job(conf,"Classifier");
		job.setJarByClass(ClassifierDriver.class);

	    // specify input format
		job.setInputFormatClass(KeyValueTextInputFormat.class);
		
        //  specify mapper & reducer
		job.setMapperClass(classifier.ClassifierMapper.class);
		job.setReducerClass(ClassifierReducer.class);
		
		// specify output types of mapper and reducer
		job.setOutputKeyClass(NullWritable.class);
		job.setOutputValueClass(Text.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		// specify input and output DIRECTORIES 
		Path inPath = new Path(args[0]);
		Path outPath = new Path(args[1]);
		FileInputFormat.addInputPath(job, inPath);
        FileOutputFormat.setOutputPath(job,outPath);     //  output path

		// delete output directory
		try{
			FileSystem hdfs = outPath.getFileSystem(conf);
			if(hdfs.exists(outPath))
				hdfs.delete(outPath);
			hdfs.close();
		} catch (Exception e){
			e.printStackTrace();
			return ;
		}
		
		//  run the job
		System.exit(job.waitForCompletion(true) ? 0 : 1);
		
	}

}

【甘道夫】通过Mahout构建贝叶斯文本分类器案例详解

猜你喜欢