第3节 mapreduce高级：4、倒排索引的建立

倒排索引建立

需求分析

需求：有大量的文本（文档、网页），需要建立搜索索引

最终实现的结果就是哪个单词在哪个文章当中出现了多少次

思路分析：

首选将文档的内容全部读取出来，加上文档的名字作为key，文档的value为1，组织成这样的一种形式的数据

map端数据输出

hello-a.txt 1
tom-a.txt 1
hello-a.txt 1
jerry-a.txt 1

到reduce阶段
hello-a.txt <1,1>

reduce端数据输出

hello-a.txt 2

tom-a.txt 1

jerry-a.txt 1

代码：

IndexMain：
。。。

TextInputFormat.addInputPath(job,new Path("file:///D:\\Study\\BigData\\heima\\stage2\\5、大数据离线第五天\\倒排索引\\input"));

TextOutputFormat.setOutputPath(job,new Path("file:///D:\\Study\\BigData\\heima\\stage2\\5、大数据离线第五天\\倒排索引\\out_index"));
。。。

IndexMapper：

package cn.itcast.demo2.index;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class IndexMapper extends Mapper<LongWritable,Text,Text,LongWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //判断数据是从哪个文件里面来的
        //获取文件的切片
        FileSplit inputSplit = (FileSplit) context.getInputSplit();
        //获取到了我们的文件名
        String name = inputSplit.getPath().getName();

        String line = value.toString();
        String[] split = line.split(" ");
        for(String word:split){
            //输出格式：tom-b.txt   1
            context.write(new Text(word+"-"+name),new LongWritable(1L));
        }
    }
}

IndexReducer：

package cn.itcast.demo2.index;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class IndexReducer extends Reducer<Text,LongWritable,Text,LongWritable> {
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long num = 0L;
        for(LongWritable longWritable:values){
            num++;
        }
        context.write(key,new LongWritable(num));
    }
}

第3节 mapreduce高级：4、倒排索引的建立

倒排索引建立

需求分析

猜你喜欢