[ICSE 2018]论文阅读-Are Mutation Scores Correlated with Real Fault Detection？ A Large Scale Empirica ...

前言

新的一天从阅读论文开始，本文旨在阅读[ICSE 2018]论文-Are Mutation Scores Correlated with Real Fault Detection？ A Large Scale Empirical study on the Relationship Between Mutants and Real Faults

基本信息

作者：
Mike Papadakis, Donghwan Shin, Shin Yoo, Doo-Hwan Bae

第一位作者来自University of Luxembourg
第2,3,4位来自Korea Advanced Institute of Science and Technology

1 第一次看到有小标题的

Are Mutation Scores Correlated with Real Fault Detection?
A Large Scale Empirical study on the Relationship Between Mutants and Real Faults

第二行就是小标题。
很好奇作者是怎么研究这种相关性的。

2 背景写的太好了，让人一看就觉得很有道理

Empirical validation of software testing studies is increasingly relying on mutants.

一开始我还在想：为什么研究mutant score？这个很重要？
看了第一句，我就明白：原来这个mutants这么被依赖。
见微知著

3 作者干的工作

1）Empirical validation of software testing studies is increasingly relying on mutants. This practice is motivated by the strong correlation between mutant scores and real fault detection that is reported in the literature. In contrast, our study shows that correlations are the results of the confounding eﬀects of the test suite size.
2）In particular, we investigate the relation between two independent variables, mutation score and test suite size, with one dependent variable the detection of (real) faults.
3）We use two data sets, CoreBench and Defects4J, with large C and Java programs and real faults and provide evidence that all correlations between mutation scores and real fault detection are weak when controlling for test suite size.

其实这三条都可以算作一条- investigate the correlation among mutation score, test suite size, and the detection of (real) faults.

4 作者的发现

1）We also fnd that both independent variables signifcantly inﬂuence the dependent one, with signifcantly better fts, but overall with relative low prediction power.
2）By measuring the fault detection capability of the top ranked, according to mutation score, test suites (opposed to randomly selected test suites of the same size), we fnd that achieving higher mutation scores improves signifcantly the fault detection.

Taken together, our data suggest that mutants provide good guidance for improving the fault detection of test suites, but their correlation with fault detection are weak

这个taken together的词组比较少见。
结论让我一点点费解：为什么mutants提供了good guidance，但是同时又用weak这个词呢？

5 原来摘要里面没出现的词汇也可以用来做keywords

mutation testing, real faults, test suite eﬀectiveness

6 introduction

1）What is the relation between mutants and real faults? To date, this fundamental question remains open and, to large extent, unknown if not controversial. Though, a large body (approximately 19% [34]) of the software testing studies rely on mutants.

[34] Mike Papadakis, Christopher Henard, Mark Harman, Yue Jia, and Yves Le Traon. 2016. Threats to the validity of mutation-based test assessment. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July 18-20, 2016. 354–365. https://doi.org/10.1145/2931037. 2931040

厉害了，这么强的吗。有点厉害，每年都发顶会，感觉很不一般。这必须有很强的感觉。和很多的投入。

而且人家2015年也有顶会。
[35] Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial Compiler Equivalence: A Large Scale Empirical Study of a Simple, Fast and Eﬀective Equivalent Mutant Detection Technique. In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 1. 936–946. https://doi.org/10.1109/ICSE.2015.103

佩服佩服

2）我觉得简介里第二段挺像 related work的
Recent research investigated certain aspects of the fault and mutant relation, such as the correlation between mutant kills with real fault detection [3, 24] and the fault detection capabilities of mutation testing [8]. Just et al. [24] report that there is “a statistically signifcant correlation between mutant detection and real fault detection, independently of code coverage”, while Chekam et al. [8] that “fault revelation starts to increase signifcantly only once relatively high levels of coverage are attained”.

从这一段可以看出，当前得出的结论是不太一致的。还有东西可以做。

[24] René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014. 654–665. https://doi.org/10.1145/2635868.2635929
大牛文章。日常顶会。

Although these studies provide evidence supporting the use of mutants in empirical studies, this is contradictory to the fndings of other studies, e.g., study of Namin and Kakarla [28], and to some extent between themselves (as they do not agree on the strength and nature of the investigated relations). Furthermore, there are many aspects of the mutant-fault relation that still remain unknown.

3）

我感觉作者在文章构思上实在写的太优秀了，虽然我不做也不太懂mutation testing，但是我感觉他每次都能把问题讲的很清楚，而且他摆出来的事实都很有说服力，这个让我非常之佩服。
值得学习。

The diﬀerences between these two evaluation metrics is important as they are extensively used in empirical studies [36]. Yet, it is unclear whether there are any practically signifcant diﬀerences between them. In case the diﬀerences are signifcant, one could draw diﬀerent conclusions by using one metric over the other. Thus, investigating the potential diﬀerences between these metrics can be useful to other studies that compare test criteria and test methods.

讲明自己工作的重要性。

4）学习as … as possible

To perform our analysis in a reliable and as generic as possible way

注意as generic as possible way

5）作者实验的具体设定
To perform our analysis in a reliable and as generic as possible way, we use the developer test suites, augmented by state-of-the-art test generation tools, KLEE for C [7], Randoop [32] and EvoSuite [15] for Java. These tools helped us composing a large, diverse and relatively strong test pool from which we sample multiple test suites. To ensure the validity of our analysis, we also repeat it with the developer and automatically generated test suites and found insignifcant diﬀerences

用的工具很多，而且说了：these tools helped us composing a large diverse and relatively strong test pool from which we sample multiple test suites.

7 作者的第二章节：Mutation Analysis

Andrews et al. [3, 4] used a C program (named space) of approximately 5,000 lines of code with 38 faults and demonstrated that mutant kills and fault detection ratios have similar trends. In a later study, Namin and Kakarla [28] used the same program and fault set and came to the conclusion that there is a weak correlation between mutants and fault detection ratios. Recently, Just et al. [24] used a large number of real faults from fve Java projects and demonstrated that mutant detection rates have a strong positive correlation with fault detection rates. Since the study of Just et al. [24] did not consider test suite size and its results contradict the ones of Namin and Kakarla [28], it remains unclear whether mutation score actually correlates with fault detection when test suite size is controlled.

Papadakis and Malevris [37] used the space program, C program of approximately 5,000 lines of code, with 38 faults and found that mutants provide good guidance towards improving test suites independent of test suite size. Shin et al. [40] and Ramler et al. [38] came to similar conclusions (mutants can help improving the fault detection of test suites). However, both these studies did not account for the size eﬀects of the test suites.

从这两段可以看出，mutation scores和这个defect联系很紧密，而且很久以前就有人做了，现在还在做。

8 有关第三章节：experimental procedure

In our study, we use two sets of subjects, CoREBench [5] and Defects4J [23]. We choose these subjects as they form instances of relatively large projects that are accompanied by mature test suites as well as many real faults. CoREBench consists of four C programs named “Coreutils”, “Findutils”, “Grep” and “Make”. Defects4J consists of fve Java programs named “JFreeChart”, “Closure”, “Commons-Lang”, “Commons-Math” and “Joda-Time”.

为什么不选Siemens suite的理由：
Most of the previous studies rely on the programs from the Software Infrastructure Repository (SIR) [11, 20], typically using the programs composing the Siemens Suite, space and Unix utilities. Many of these programs includes artifcially seeded faults and, consequently, are less relevant to this study, simply because we investigate the representativeness of mutants (which are artifcially seeded faults themselves).

9 感觉作者每一段第一句都是很概括性的或者很与那一段主题相关的，这样确实很方便理解。

作者用的correlation 方法

To further investigate the association between a) and b), we used Kendall and Pearson correlation on the data where we had statistically signifcant fault detection improvements

kendall

11 仓促小结

感觉最近状态不太好，也可能是太久没看论文了。

这篇论文还有很多可以看的，比如：correlation study，threats to validity，箱线图的绘制，实验结论的得出，实验分析等。

现在不想看了，大概花了四十分钟完成这次阅读，先到此为止。