结对第二次—文献摘要热词统计及进阶需求

课程名称：软件工程1916|W

作业链接：结对第二次—文献摘要热词统计及进阶需求

结对学号：221600421-孔伟民 | 221600422-李东权

作业正文

效能分析与 PSP

PSP 2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划
• Estimate	• 估计这个任务需要多少时间	600
Development	开发
• Analysis	• 需求分析 (包括学习新技术)	200	240
• Design Spec	• 生成设计文档	60	60
• Design Review	• 设计复审	30	40
• Coding Standard	• 代码规范 (为目前的开发制定合适的规范)
• Design	• 具体设计	60
• Coding	• 具体编码	400	600
• Code Review	• 代码复审	100	200
• Test	• 测试（自我测试，修改代码，提交修改）	60	60
Reporting	报告
• Test Report	• 测试报告	60	100
• Size Measurement	• 计算工作量	30
• Postmortem & Process Improvement Plan	• 事后总结, 并提出过程改进计划
	总计	1000

分工

221600422 李东权
- 主要代码实现
- 需求分析讨论
- 辅助博客撰写
- 单元测试
221600421 孔伟民
- 博客撰写
- 爬虫程序编写
- 需求分析讨论
- 单元测试

基本需求

我们看到题目之后首先思考的时单词、字符、分隔符的定义分别是什么，经过了群里大家的多天讨论后还是没有得出特别准确的结论，就开始编写初版程序，具体思路是每一行读取后根据正则表达式匹配并且分割成数组，对数组进行遍历然后看具体的单词情况，行数和字符的实现只需要读取一遍数据就可以得出了。

其中主要的功能全都放在 WordCount 类中，而 Main 中则是对命令行参数的一些处理，类图如下：

WordCount 中有三个主要的方法，即字符、单词、行数统计，对于关键函数 WordCout 单词统计的实现过程是这样的：

爬虫部分

爬虫的部分使用了「Jsoup」库，首先到 CVPR2018 的网站获取到论文的列表，可以看到列表的 HTML 的结构如下：

具体的论文链接都在类为 ptitle 的 dt 标签下，通过 Elements elements = doc.select(".ptitle a") 就可以选择到所有的链接。注意：这里的链接都是相对路径，不包含主机名，所有我们在抓具体的论文时要加上前缀http://openaccess.thecvf.com/ 。

具体的论文详情页：

HTML 结构依然很简单，其中框出来的地方就是我们需要的内容，分别是标题、作者、摘要，通过 doc.select("#papertitle").text() 等就可以获取到具体的信息。

在抓取的过程中发现一个个顺序爬太慢了，于是就使用了多线程加快爬取的速度，ExecutorService pool = Executors.newScheduledThreadPool(8); 创建一个线程池，遍历时就新建一个对应的线程加入到线程池中，可以加快爬取的速度。

进阶需求

我们在主程序中先获取到了输入的各个参数，然后把参数传入 CountArchieve 类构造对象，CountArchieve 类实现了行数统计、字符数统计、单词数统计以及进阶需求中的词组统计和权重要求，具体实现流程图如下：

在基础的功能上加入了词组统计以及权重的计算。其中在排序这个方面，即要求按照字典序输出，我们使用了TreeMap ，它具有按照字典序自动排序的功能。

具体代码分析

charCount 和 lineCount 的实现比较简单，从文件流的开头开始遍历，一边遍历一边数就可以了

public int LineCount() throws IOException {
        int count=0;
        bufferedReader.reset();
        String line;
        while ((line=bufferedReader.readLine())!=null){
            if(!line.isEmpty())
                count++;
        }
        return count;
    }

public int CharCount() throws IOException {   //不能区分回车和/r/n
        int count=0;
        bufferedReader.reset();
        int temp;
        while ((temp=bufferedReader.read())!=-1){
            count++;
            if(temp==13)
                bufferedReader.read();
        }
        return count;
    }

最为核心的函数就是 WordCount

@Override
    public int WordCount() throws IOException {
        int count=0;
        String line;
        bufferedReader.reset();
        StringBuffer stringBuffer=new StringBuffer();
        while ((line=bufferedReader.readLine())!=null)
            stringBuffer.append(line+"\n");
        String content=stringBuffer.toString();

        //分割文本,分别以分隔符划分和字母数字划分，得到分隔符数组和字母数字数组
        String [] words=content.split("([^a-zA-Z0-9]|\n)+");//1
        String [] division=content.split("[a-zA-Z0-9]+");//2

        //判断文本是数字字母先出现还是分隔符先出现，用于M的词组统计中的分隔符位置
        int whofirst=1;
        if(words.length>0&&division.length>0){
            if(content.indexOf(words[0])<content.indexOf(division[0]))
                whofirst=1;
            else
                whofirst=2;
        }
        else if(words.length>0)
            whofirst=1;
        else if(division.length>0)
            whofirst=2;


        //用于存放长度为M的词组
        List<String> wordgroup=new ArrayList<>();

        //单词的正则表达式
        Pattern pattern=Pattern.compile("^[a-zA-Z]{4,}[0-9]*[a-zA-Z]*");
        Integer value=0;
        String temp="";
        int weight=1; //权重

//        for(int i=0;i<words.length;i++)
//            System.out.println(words[i].toLowerCase());

        for(int i=0,record=0;i<words.length;i++){
            //System.out.println(words[i].toLowerCase());
            /*
            关于权重的判断,因为Title和Abstract相当于两部分需要清空积累量
            变量解释:
                temp 用于存放获取到的单词组 当M=2时，可能存放为 [A+]B ，即单词分隔符单词
                wordgroup与temp类似，唯一的区别是[A+][B+]，即B后面还要存放紧跟的换行符
                record记录当前有多少个单词满足了
             */

            if(weightjudge){
                if(words[i].equals("Title")){
                    weight=10;
                    temp="";
                    record=0;
                    wordgroup.clear();
                    continue;
                }
                else if(words[i].equals("Abstract")){
                    weight=1;
                    temp="";
                    record=0;
                    wordgroup.clear();
                    continue;
                }
            }
            else{
                if(words[i].equals("Title")||words[i].equals("Abstract")){
                    temp="";
                    record=0;
                    wordgroup.clear();
                    continue;
                }
            }

            //匹配单词
            if(pattern.matcher(words[i]).matches()){
                count++;   //单词数+1
                words[i]=words[i].toLowerCase(); //转化为小写
                temp+=words[i];
                record++;    //满足词组长度+1

                //这个判断是用来判断词组问题即，temp=单词+换行符，中换行符的位置，是否需要换行符，1为单词先，2为单词后
                if(this.wordlength>1&&record<this.wordlength){  //输出第m个字符后temp不需要分隔符
                    if((whofirst==1)&&(i<division.length)){
                        temp+=division[i];
                        wordgroup.add((words[i]+division[i]));
                    }
                    else if((whofirst==2)&&((i+1)<division.length)){
                        temp+=division[i+1];
                        wordgroup.add((words[i]+division[i+1]));
                    }
                    else
                        wordgroup.add((words[i]));
                }
                else if (this.wordlength>1&&record==this.wordlength){ //wordgroup后需要temp+分隔符
                    if((whofirst==1)&&(i<division.length))
                        wordgroup.add((words[i]+division[i]));
                    else if((whofirst==2)&&((i+1)<division.length))
                        wordgroup.add((words[i]+division[i+1]));
                    else
                        wordgroup.add((words[i]));
                }
                if(record==this.wordlength){   //满足词组长度
                    if(treeMap.containsKey(temp)) {
                        value = treeMap.get(temp) + weight;   //查找是否存在
                        treeMap.put(temp, value);
                    }
                    else{
                        treeMap.put(temp,weight);
                    }
                    temp="";
                    if(this.wordlength>1){
                        //由于 a b c d，当M=3时有 <abc> <bcd>两个词组，这时候就要依靠
                        //wordgroup保存bc两个单词，此时wordgroup弹出a，留下bc,
                        //temp修改为b+c+，这就是前面group比temp多保存一个换行符的原因
                        for(int x=1;x<wordgroup.size();x++)
                            temp+=wordgroup.get(x);
                        wordgroup.remove(0);
                    }
                    record--;
                }
                else;
            }
            else{
                temp="";
                record=0;
                wordgroup.clear();
            }
        }
        return count;
    }

性能分析

使用了 JProfiler 性能测试工具，可以看到程序的主要时间花费都用在了字符串的分割和正则的匹配，即 split 和 match 函数上，wordcount 函数是程序中主要的函数，运行时间占到了 15%。

单元测试

我们构造了若干组测试数据，利用 idea 已有的 junit 进行单元测试，主要是测试 charCount、wordCount、lineCount 这三个函数的输出符不符合我们的预期输出，其中单元测试类如下：

package Test;

import demo.CountAchieve;
import org.junit.Assert;
import org.junit.Test;

import static org.junit.Assert.*;

public class CountAchieveTest {

    private String load = "D:\\program\\IntellijIdeaProjects\\WordCount\\src\\Test\\";

    // 测试文件列表
    private String[] files = {
            "test1.txt", "test2.txt", "test3.txt", "test4.txt", "test5.txt",
            "test6.txt", "test7.txt", "test8.txt", "test9.txt", "test10.txt"
    };
    // 以下是预期输出
    private int[] chars = {
            33, 34,0, 10, 67,
            40, 32, 56,55, 19
    };
    private int[] lines = {
            2, 2,0, 0, 3,
            3, 5, 3,2, 5
    };
    private int[] words = {
            1, 1,0, 0, 5,
            4, 3,6,6, 0
    };


    @Test
    public void charCount() throws Exception {
        CountAchieve t;
        for (int i = 0; i < files.length; i++) {
            t = new CountAchieve(load + files[i], "1.txt", 1, 10, false);
            Assert.assertEquals("字符统计错误"+files[i],chars[i],t.CharCount());
            t.CloseFile();
        }
    }

    @Test
    public void wordCount() throws Exception {
        CountAchieve t;
        for (int i = 0; i < files.length; i++) {
            t = new CountAchieve(load + files[i], "1.txt", 1, 10, false);
            Assert.assertEquals("单词统计错误"+files[i],words[i],t.WordCount());
            t.CloseFile();
        }
    }

    @Test
    public void lineCount() throws Exception {
        CountAchieve t;
        for (int i = 0; i < files.length; i++) {
            t = new CountAchieve(load + files[i], "1.txt", 1, 10, false);
            Assert.assertEquals("行数统计错误"+files[i],lines[i],t.LineCount());
            t.CloseFile();
        }
    }
}

我们通过一个测试文件集的列表输入测试文件，然后在每一个测试方法中循环统计这些测试文件的行数、符号、单词数等。

部分测试文件实例如下：

blank line
java is awesome!!!

sp#ec(ial ch*arac>ters

期望输出：characters:56 words:6 lines:3

123ABC<>?abc123)(=abcd111
*&^%$#@

期望输出：characters:34 words:2 lines:1

心得与体会

刚刚开始需要对两个人的分工进行统一，以及后面每个人负责的部分交付给另外一个人时要做好代码的管理，否则可能会出现代码不统一的情况，了解两人协作需要磨合。对于git的操作方面也有一些新的了解，单元测试以及性能分析是我们之前没有接触过的，在这次作业中有了初步的接触。