Based on the evaluation of forecast hadoop

Hadoop curriculum design guitar lessons when the teacher asked to make a forecast evaluation, will complete the core elements of the record, the follow-up to facilitate the students can refer to

hadoop-word-predict

Evaluation prediction system based on hadoop

Experiment topics are as follows

Write java program so that it can be achieved based on hdfs uploaded to the "study _ upload files .txt" purpose of the training data set emotion classifier. In the training process, the filter should contain non-Chinese characters or words composed entirely of non-Chinese characters. Save the model file to "study model _ .txt" file. format requirement:

类标_词语1\t计数
类标_词语2\t计数
类标_词语3\t计数
……
类标1\t计数
类标2\t计数

Model parameters (i.e., Nc and Ncw obtained based training, wherein, c denotes emotion category label, c∈ {Good, Poor}, w∈V, V is the "study _ .data upload" data set contains Chinese dictionary set), of "test.txt" dataset for each record "emotion tags" determination. Judgment result output to "study _ predictions .txt" file. "Study _ predictions .txt" file and line number of each line is "test.txt" predicted "emotional label": format requirements:

1 情感标签 
2 情感标签 
3 情感标签 
……
2000 情感标签 

Training and prediction data format

好评	几乎 凌晨 才 到 包头 包头 没有 什么 特别 好 酒店 每次 来 就是 住 这家 所以 没有 忒 多 对比 感觉 行 下次 还是 得到 这里 来 住
好评	住 过 几次 东莞 酒店 海悦 地理位置 早餐 最棒 听说 朋友 说 请来 厨师 来头 呵呵 冲 这个 去
好评	酒店设施 比较 不错 就是 携程 价格 酒店 前台 一样 没有 竞争力
好评	房间 不算 大 中规中矩 北方 服务 真的 不敢恭维 CHECK IN 后 没有 服务生 帮 你 拿 行李 到 房间 去 周围 酒店 没 啥 逛 自己 吃 早饭 可以 去 万豪 喜来登 之间 那条 路 永和 豆浆店 很 便宜
好评	通过 朋友 介绍 住 苏州 南林 饭店 一进 酒店 大堂 感觉 很 好 酒店 行李 员 前台 服务员 大堂 经理 很 热情 有种 宾至如归 感觉 房间 很 特色 背景 墙上 金色 字体 诗词 我 住 朝南 景观 房 感觉 真的 很 好 一 出门 就是 娱乐 酒吧 一条街 美食 一条街 出门 很 方便 下次 来 苏州 我 会 选择 南林 我 会 介绍 我 朋友 入住 南林 饭店
好评	西宁 住 过 几个 酒店 此 酒店 虽然 比起 内地 四星级 差 一些 但 西宁 算是 不错 价格 不 高 房间 里 东西 倒 干净 地毯 有点 脏 用 地 暖 感觉 比 空调 舒服 多 没有 噪音 安全 周围环境 尚可
好评	房间 算 整齐 宽敞 我 住 标准间 大床 房 只是 浴室 淋浴 笼头 不太好 出水 不 均匀 洗澡 不 舒服 服务 不错 到 酒店 早上 点 让 我 提前 入住 而且 结账 速度 比较 快 不 耽误时间 酒店 靠近 号 地铁 算 方便

Description

Environment configuration and file upload here do not describe the remaining superfluous, mainly on ideological achieved under

In order to achieve predictive model using two sets of mapperreducer

First set: word frequency statistics, the number of each word obtained at the corresponding evaluation format

类标_词语1\t计数
类标_词语2\t计数
类标_词语3\t计数
...
好评_几乎 \t 23

mapper to achieve:

The first row of data to \ t split by keyword lines, then the keyword line separated by spaces, after the split with <keyword evaluation word _ 1> write context, and the context is also written speech statistics

 //把value对应的行数据按照指定的间隔符拆分开
        String[] words = value.toString().split("\t");
        //word[0]是评价(好评或者差评)
        //word[1]是评价的内容

        //过滤一下有些评价后面没有关键字
        if (words.length == 2){
            String[] pjs = words[1].split(" ");
            for (String pj : pjs) {
                //如果含有非中文的就过滤掉
                if (isAllChinese(pj)){
                    context.write(new Text(words[0] + "_" + pj), new IntWritable(1));

                }
            }
            //统计好评差评数目
            context.write(new Text("统计_"+words[0]), new IntWritable(1));
        }

In order to determine whether the segmentation result is pure Chinese, write a test method to do

 /**
     * 判断字符串是否全为中文
     * @param str
     * @return
     */
    public boolean isAllChinese(String str) {

        if (str == null) {
            return false;
        }
        for (char c : str.toCharArray()) {
            if (!isChinese(c)) {
                return false;
            }
        }
        return true;
    }

    /**
     * 判断单个字符是否为中文
     * @param c
     * @return
     */
    public Boolean isChinese(char c) {
        return c >= 0x4E00 && c <= 0x9Fa5;
    }

reducer achieve

Reading the context of the second attribute value accumulates until the count for each combination of keywords, then to <_ keyword evaluation word count> write context

 /**
     *
     * map端 输出到reduce端,按相同的key分发到同一个reduce去执行
     *  (hello,<1,1,1>)
     *     (welcome,<1>)
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        Iterator<IntWritable> iterator = values.iterator();
        while (iterator.hasNext()){
            IntWritable value = iterator.next();
            count += value.get();
        }
        context.write(key,new IntWritable(count));
    }

Good bad count achieved

Mapper is determined whether the received, if received, write a <_ received statistics, 1> to context; if negative feedback, write a <_ Poor statistics, 1> to Context

Note: when sorting performed automatically reducer, statistical results should put necessary to add a last name different from the previous data

Group II: evaluation prediction, the prediction results obtained, the following format

1 情感标签 
2 情感标签 
3 情感标签 
4 好评

mapper achieve

The main prediction mapper complete data word, only the keyword group behind the specific evaluation is predicted, so to \ t first divided key OK, then the keyword sets separated by spaces, the keywords <line number, keyword > write context

@Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        /**
         * 将测试文件进行分解  分解为(行号,关键字)格式
         */

        //把value对应的行数据按照指定的间隔符拆分开
        String[] words = value.toString().split("\t");
        //word[0]是评价(好评或者差评)
        //word[1]是评价的内容
        IntWritable lineCount = new IntWritable(PredictApp.lineCount++);
        //过滤一下有些评价后面没有关键字
        if (words.length == 2) {
            String[] pjs = words[1].split(" ");
            for (String pj : pjs) {
                //如果含有非中文的就过滤掉
                if (isAllChinese(pj)) {
                    context.write(lineCount, new Text(pj));
                }
            }

        }

    }

reducer achieve

First load model trained using a static code block into a HashMap, easy to use prediction

	static Map<String, Integer> wordMap = new HashMap<>();

    static {
        //    1.读取hdfs的文件 ==>HDFS API
        Path input = new Path("/output_2017081119/part-r-00000");
        try {
            //获取hdfs文件系统
            FileSystem fs = null;

            fs = FileSystem.get(new URI("hdfs://192.168.199.200:8020"), new Configuration(), "hadoop");

            RemoteIterator<LocatedFileStatus> iterator = fs.listFiles(input, false);//不递归的获取文件

            while (iterator.hasNext()) {
                LocatedFileStatus file = iterator.next();
                FSDataInputStream in = fs.open(file.getPath());
                BufferedReader reader = new BufferedReader(new InputStreamReader(in));
                String line = "";
                while ((line = reader.readLine()) != null) {//读取到的行不为空

                    String[] split = line.split("\\s+");
                    if (split.length == 2) {
                        wordMap.put(split[0], Integer.parseInt(split[1]));
                    }
                }
                reader.close();
                in.close();

            }

            fs.close();
        } catch (Exception e) {
            e.printStackTrace();
        }


    }

Write a function to determine a set of keywords corresponding to predict whether a line is praise,

 /**
     * 预测评价可能的结果
     *
     * @param values
     * @return
     */
    public Boolean checkIsGood(Iterable<Text> values) {

        Integer goodNum = 0;
        Double good_factor = 0.0;
        Double bad_factor = 0.0;


        //遍历获得每个关键字
        Iterator<Text> iterator = values.iterator();
        while (iterator.hasNext()) {
            Text value = iterator.next();
            //todo 一个方法,可以获取关键字对应好评行和差评行的计数
            Integer good = wordMap.get("好评_" + value);
            Integer bad = wordMap.get("差评_" + value);
            //todo 通过计数计算各个词的好评权重
      
            /**
             * 倍数计数法
             * 如果为null就设为1
             * 如果好评>差评,计算好评/差评向上取整
             * 如果好评<差评,计算-差评/好评向上取整
             * 如果好评=差评,计算值为0
             * 结果876:1124
             */
            good = good == null ? 1 : good + 1;
            bad = bad == null ? 1 : bad + 1;
            if (good != bad) {
                //差评比较多,影响因素大多也是差评,所以降低了差评的权重,向下取整
                Double v = (good > bad) ? Math.ceil(good*1.0 / bad) : -Math.floor(bad*1.0 / good);
                goodNum += v.intValue();
            }

        }
   
        return goodNum >= 0;
   }

In the context of the writing is then <line number, the word evaluation>, because the first test data 1000 is received, after 1000 to the negative feedback, where the determination of the correct write dead, then the difference in the number and accuracy assessment received write context

@Override
    protected void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        Boolean isGood = checkIsGood(values);
        context.write(key, new Text(isGood ? "好评" : "差评"));
        //好评和差评的计数
        if (isGood) ++goodCount;
        else ++badCount;

        //判断正确的计数
        if (Integer.parseInt(key.toString()) <= 1000
                && isGood
                || Integer.parseInt(key.toString()) > 1000
                && Integer.parseInt(key.toString()) <= 2000
                && !isGood) {
            correctCount++;
        }


        //最后一条的后面输出一个计数
        if (key.toString().equals("2000")) {
            context.write(new IntWritable(2017081119), new Text("好评统计:" + goodCount));
            context.write(new IntWritable(2017081119), new Text("差评统计:" + badCount));
            context.write(new IntWritable(2017081119), new Text("预测正确率:" + correctCount / 2000.0));
        }
    }

Praise judgment algorithm

Integer good = wordMap.get("好评_" + value);
Integer bad = wordMap.get("差评_" + value);
good = good == null ? 1 : good + 1;
bad = bad == null ? 1 : bad + 1;
if (good != bad) {
       Double v = (good > bad) ? Math.ceil(good /bad) : -Math.ceil(bad / good);
       goodNum += v.intValue();
}

Traversing the set of keywords, the stitching on each set of keywords which get from the training model number, if set to 1 is empty, if not empty is set to n + 1

goodNum coefficients representative of a line received words, if> = 0 is received on, otherwise it is negative feedback

If a word of praise and Poor count as coefficients Good 0

If the number of a received word> difference in the number of comments, by the number of Good / Bad rounded up number as received assessment coefficient

If a number of the entire word again Good <difference in the number of assessment, to take up a difference in the number Comments / Good Good Number of coefficients as the opposite of

The coefficients for each word of praise to accumulate get the word line evaluation

Good bad statistics and calculate the correct rate

Adding reducer in three counters, the number of praise were calculated, the number of poor assessment, correct statistics

After the completion of one line data evaluation, judgment is received on received count by one, a count is incremented anyway Poor

The prediction file format, the former can be found as Good 1000, 1000 of the negative feedback, the predicted results with determination, if the same, correct count by one

After the last three parameters written out on line 2000 in the context of <school, _ describe the number (ratio)> format write context

Page implementation

According to the experimental requirements needed to implement a file upload interface forecast and predict the results and data display interface

It is used here to build a rapid SpringBoot separate front and rear ends of the back, and then implements two restful style interface to exchange data, specifically implemented as follows:

File Upload

File upload using a step by step upload files first to MultipartFile uploaded to a file in the system deployment platform (or file server) folder, and get to the path to the file in the deployment platform, and then upload the file to hdfs

 @PostMapping(value = "/upLoadFile")
    public CommonreturnType upLoadFile(@RequestParam(value = "file") MultipartFile file) throws Exception {

        /**
         * 将上传到代码平台的代码上传到HDFS
         */
        //将文件从浏览器端上传到服务器端
        String fileUrl = uploadFile(file);
        System.out.println(fileUrl);
        //将文件从服务器端传输到hadoop
        fileUpLoadToHdfs(fileUrl);
        //进行预测生成文件
//        new PredictApp();
        return CommonreturnType.create(200);
    }

    /**
     * 将文件上传到HDFS
     *
     * @param filePath
     * @throws URISyntaxException
     * @throws IOException
     * @throws InterruptedException
     */
    private void fileUpLoadToHdfs(String filePath) throws URISyntaxException, IOException, InterruptedException {


        Configuration configuration = new Configuration();
//        设置副本数为1
        configuration.set("dfs.replication", "1");
        /**
         * 参数1:hdfs的uri
         * 参数2:客户端指定的配置参数
         * 参数3:客户端的身份,就是操作用户名
         */
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.199.200:8020"), configuration, "hadoop");
        Path src = new Path(filePath);
        Path dst = new Path("/predict_input_2017081119/test.txt");
        fileSystem.copyFromLocalFile(src, dst);
        configuration = null;
        fileSystem = null;
    }


    /**
     * 文件上传工具类
     * 将文件上传到部署平台
     *
     * @param file
     * @throws Exception
     */
    private String uploadFile(MultipartFile file) {

        String fileName = file.getOriginalFilename();
        //文件上传到本项目所在平台的某个路径下
        String filePath = "H:/WorkSpace/intellijWorkspace/hadoop-word-predict/upload/";

        try {
            File targetFile = new File(filePath);
            if (!targetFile.exists()) {
                targetFile.mkdirs();
            }
            FileOutputStream out = new FileOutputStream(filePath + fileName);
            out.write(file.getBytes());
            out.flush();
            out.close();
        } catch (Exception e) {

            e.printStackTrace();
            return null;
        }
        return filePath + fileName;
    }

Data Display Interface

First, write the json return object

private Double goodCount;//好评数
private Double badCount;//差评数
private Double correct;//正确率
private List<PredictResult> predictResults;//评价词组


PredictResult:
private String lineNum;//行号
private String pResult;//预测结果
private String tResult;//实际结果

Data show, the prediction results are paged table display, statistical evaluation and correct rate of call echart.js be achieved in a pie chart

Here Insert Picture Description

Probably the idea to share here, the complete code, see my github

Published 15 original articles · won praise 4 · Views 2207

Guess you like

Origin blog.csdn.net/qq_41170102/article/details/104485231