优秀论文阅读——An Investigation into the Use of Mutation Analysis for Automated Program Repair [SSBSE 2017]

前言

看到现在，我发现最开始看的那几篇的论文好像又快要忘记的差不多了。。。

怎么克服遗忘？
1）如果你发现你哪一篇忘了，请返回去复习（反正我之前都在CSDN上做了记录的）；
2）我觉得还是要对文章感兴趣吧，当自己感兴趣的时候读的特别快。
3）要不厌其烦的去读，反复读，因为温故可以知新。

本文内容

本文旨在介绍基于搜索的软件工程领域相关文章——An Investigation into the Use of Mutation Analysis for Automated Program Repair [来自SSBSE 2017]

SSBSE 是啥？

The SSBSE is the premier international research symposium on Search-based Software Engineering. 我感觉SSBSE应该是基于搜索的软件工程领域的著名研讨会。

symposium
英 [sɪmˈpəʊziəm] 美 [sɪmˈpoʊziəm]
n.专题讨论会，座谈会，学术报告会;专题论文集;（古希腊）酒宴，宴会

SSBSE 18:
http://ssbse18.irisa.fr/

SSBSE 17:
http://ssbse17.github.io/
从17年接受文章来看，数量也不多，full paper只有7篇。

1 基本信息

作者： Christopher Steven Timperley1, Susan Stepney2, and Claire Le Goues1

2 摘要说啥了？

1）APR现状
Research in Search-Based Automated Program Repair has demonstrated promising results, but has nevertheless been largely confined to small, single-edit patches using a limited set of mutation operators.

就算是号称multi-line修复的angelix，我想也没有做到多点修复吧，也算是在single-edit这类里面？这个值得研究。我感觉人家是走在前面的。从这一句话就可以看出来人家对APR的认知，我还想不到APR的这个方面。

2）如何解决更复杂问题（bug）？

Tackling a broader spectrum of bugs will require multiple edits and a
larger set of operators, leading to a combinatorial explosion of the search
space. 从而有——This motivates the need for more efficient search techniques.

需要对search space的组合探索。

combinatorial
英 [ˌkɒmbɪnə’tɔ:rɪəl] 美 [kəmˌbaɪnə’tɒrɪrl]
adj. 组合的

3）作者的idea
We propose to use the test case results of candidate patches to localise suitable fix locations. We analysed the test suite results of single-edit patches, generated from a random walk across 28 bugs in 6 programs.

3 好奇，感觉作者工作量不够？28 bugs in 6 programs？值得研究。

4 震惊，第一次看到作者自己的工作没有达到预期，但是作者依旧写了出来，还分析了原因。这个比较少见，值得一读。

Based on the findings of this analysis, we propose a number of mutation-based fault localisation techniques, which we subsequently evaluate by measuring how accurately they locate the statements at which the search was able to generate a solution.

After demonstrating that these techniques fail to result in a significant improvement,
we discuss why this may be the case, despite the successes of mutation-based fault localisation in previous studies.

5 introduction说了什么

1) 软件缺陷修复代理的开销
The worldwide cost of debugging and repairing software bugs is estimated to be
$312 billion per year; on average, programmers spend roughly 50% of their time
finding and fixing bugs [1].

[1] Cambridge University Study States Software Bugs Cost Economy $312 Billion Per Year. http://www.prweb.com/releases/2013/1/prweb10298185.htm, Accessed April, 2017

我是真的佩服这些人，每次都能够找到最新的参考文献。

2）有关 SBFL，G&V 修复

Research in automated program repair (APR) seeks to tackle this problem. Generate-and-validate (G&V) is one approach to APR, also known as search-based program repair, which uses meta-heuristics|such as random search [18] or genetic programming [2,8]|to discover patches that lead a program to pass a given set of test cases. At a high level, G&V begins with fault localisation, followed by continual processes of generation and validation. Fault localisation is typically performed using spectra-based fault localisation techniques (SBFL) [25]. SBFL assigns suspiciousness values to statements in the program, based on their dynamic association with the failing tests. Patches are generated by selecting statements according to their suspiciousness, and sampling edits at those statements from the repair space. This repair space is defined by a set of transformation schemas, describing transformation shapes (e.g., insert statement, tighten if condition, replace call argument), and transformation ingredients, supplying the parameters necessary to complete shapes (e.g. a particular statement). Candidate patches are evaluated for correctness by running the patched program on the original test suite; repair is indicated by passing all of the tests.

这一段写的挺好的，值得学习。同时也展示了作者对于自动修复的理解。

3）现阶段有什么基于搜索的修复中都有哪些搜索方式？
Different G&V approaches vary in their mutation operators and traversal
techniques. For example, GenProg [8] constructs patches that may append, replace or delete statements within the program, reusing existing statements within
the program as fix ingredients. Other transformation schemas have been proposed based on human-produced patches [6] or a value search to reduce the cost
of patch evaluation [10]. Search space traversal schemes employed include genetic
programming [8], random search [18], and a deterministic walk [23].

[6] Kim, D., Nam, J., Song, J., Kim, S.: Automatic Patch Generation Learned from
Human-written Patches. In: International Conference on Software Engineering. pp.
802{811. ICSE ’13 (2013)
[10] Long, F., Rinard, M.: Staged program repair with condition synthesis. In: Joint
Meeting on Foundations of Software Engineering. pp. 166{178. ESEC/FSE ’15
(2015)

4）局限性
Despite promising early results, most G&V techniques are currently limited
to generating patches for a relatively small sub-set of single-line bugs [23,18,11].
To repair a wider variety of bugs, techniques will need to use richer, more granular transformation schemas, and to construct multiple-line patches. However,
this produces a combinatorial explosion in the size of the search space. This
motivates a need for methods to prune the exploded search space.

[23] Weimer, W., Fry, Z.P., Forrest, S.: Leveraging program equivalence for adaptive
program repair: Models and first results. In: International Conference on Automated Software Engineering. pp. 356{366. ASE ’13 (2013)
[18] Qi, Y., Mao, X., Lei, Y., Dai, Z., Wang, C.: The Strength of Random Search on
Automated Program Repair. In: International Conference on Software Engineering.
pp. 254{265. ICSE ’14 (2014)
[11] Long, F., Rinard, M.: Automatic patch generation by learning correct code. In:
Principles of Programming Languages. pp. 298{312. POPL ’16 (2016)

5) idea来源和自己的工作

Inspired by recent work in mutation testing [16,14], we propose to use candidate test suite evaluations to identify suitable fix locations online. Mutationbased fault localisation show promising results when ranking statements as candidates for human modification; We explicitly evaluate their utility in assigning
suspiciousness scores to candidate repair locations, the key concern in localisation for repair. To determine whether the results of candidate patch evaluations
may be used to localise the fault, we first perform a mutation analysis on a sample of a particular G&V repair search space across 28 bugs in six real-world C
programs. We use the same ground truth as previous studies on fault localisation,
assuming the location(s) of the human-written repair or the injected fault to be
a suitable fix location [14,16,25]. For the sake of convenience, we refer to these
locations as \faulty”; non-modified statements are considered to be \correct”.

我感觉真的是一篇很经典的文章，感觉学到了。这样的文章每一句都不舍得放过。。。

[14] Moon, S., Kim, Y., Kim, M., Yoo, S.: Ask the Mutants: Mutating Faulty Programs for Fault Localization. In: International Conference on Software Testing,
Verification and Validation. pp. 153{162. ICST ’14 (2014)
[16] Papadakis, M., Le Traon, Y.: Metallaxis-FL: mutation-based fault localization.
Software Testing, Verification and Reliability 25(5-7), 605{628 (2015)

6）作者的工作（值得一提）

@1 A detailed mutation analysis sampled from GenProg’s search space, covering
28 bugs across six real-world C programs.
@2 An evaluation of several alternative fault localisation techniques which use
the test case outcomes of mutants produced during the search.
@3 An informed discussion of the limitations of GenProg’s statement-level mutation operators in identifying faulty locations, and how these limitations
might be addressed by alternative mutation operators

6 为什么作者这么有信心：认为 the principles generalise to most existing techniques in APR ? 值得研究

We focus this discussion primarily on GenProg’s approach
to fault localisation, for illustration, but the principles generalise to most existing
techniques in APR.

Alternative weighting schemes have been explored since [9],
including those that draw directly on advances in spectrum-based fault localisation [19]. Statements are sampled in proportion to their weight.

[9] Le Goues, C., Weimer, W., Forrest, S.: Representations and Operators for Improving Evolutionary Software Repair. In: Genetic and Evolutionary Computation
Conference. pp. 959{966. GECCO ’12 (2012)
[19] Qi, Y., Mao, X., Lei, Y., Wang, C.: Using Automated Program Repair for Evaluating the Effectiveness of Fault Localization Techniques. In: International Symposium
on Software Testing and Analysis. pp. 191{201. ISSTA ’13 (2013)

seminal
英 [ˈsemɪnl] 美 [ˈsɛmənəl]
adj. 种子的，精液的;升值的
1
(formal) （对以后的发展）影响深远的，有重大意义的
very important and having a strong influence on later developments
a seminal work/article/study
有巨大影响的著作 / 文章 / 研究
2
[usually before noun] (technical 术语) 精液的；含精液的
of or containing semen

Two seminal approaches to MBFL are MUSE [14] and Metallaxis [16].

7 MUSE 和 Metallaxis的原理是啥？

Both of these approaches
share a common intuition: mutants generated at the fault location should exhibit different test suite outcomes to those generated at non-faulty locations.
Despite sharing this intuition, each technique generates its suspiciousness values
according to contradictory set of assumptions

暂时有点没看懂。

8 我突然发现：这篇文章和其他文章的写作风格不一样

比如：We include these
bugs to determine whether GenProg’s repair operators may be used to perform
MBFL, rather than traditional mutation testing operators, used by existing approaches [16,14].

We use GenProg, a search-based program repair technique with well-established
and commonly-used mutation operators, to focus this evaluation.

中间都喜欢插一段。

后面也是：
To encourage further investigation, all results from this study, together with
the files used to conduct it, are available at

anticipate
英 [ænˈtɪsɪpeɪt] 美 [ænˈtɪsəˌpet]
vt.
预见;预料;预感;先于…行动
vi.
预测;过早地提出;过早地考虑（或说、做）一件事;（在口头或用文字）预言

9 实验硬件配置

We used a C4.Large instance on Amazon EC2 for the artificial bugs, and a DS1 V2 instance on Microsoft Azure for
the real-world bugs.

10 源代码

Source code and a Docker image for the version of GenProg used by this study is
publicly available at: https://bitbucket.org/ChrisTimperley/gp3

给力，现代竟然用docker了，不过我觉得确实比虚拟机方便多了。

除了没有图形界面之外。。。

11 SBFL是自动修复中的主流缺陷定位技术

To date, most automated repair techniques
exclusively use SBFL; it is general and low-cost. SBFL approaches, to which
GenProg’s default fault localisation method belongs, use the test case coverage information for the program to assign suspiciousness values to each of its
locations.

12 这篇文章竟然说Jaccard不是最好的定位技术？

Qi et al. [19] conduct a study of the effectiveness of various SBFL techniques when used with GenProg, finding that the Jaccard suspiciousness metric
produced the best fault localisation information, as measured by the number of
candidate repairs required to find a solution. In contrast to our study, we find
no one approach to fault localisation is dominant.

13 太厉害了，竟然把公布源码、results都说的这么有意义

To encourage further investigation, all results from this study, together with
the files used to conduct it, are available at:
https://bitbucket.org/ChrisTimperley/ssbse-2017-data.

14 这篇文章已经看了一个小时了，我觉得只要再明白MUSE和metallaxis的原理就可以结束了。

Prior to computing suspiciousness values for each statement s, Metallaxis first computes explicit suspiciousness values for each mutant m.

MUSE, on the other hand, computes statement suspiciousness directly, based
on the average passing-to-failing rate p2f and failing-to-passing f2p rate of mutants at that statement. p2f describes the fraction of previously passing tests
that are failed by the mutant; f2p describes the fraction of previously failing
tests that are passed by the mutant. MUSE discards all of neutral mutants (i.e.,
mutants whose test outcomes are the same as the original program), and computes suspiciousness as:

These behaviours highlight a contradiction between the techniques’ underlying assumptions: MUSE seespartial solutions as signs of a repair, whilst Metallaxis views them as either
irrelevant, or the result of overfitting.

15 oh, 重要消息：MUSE 【2014 ICST】那篇文章被批评了，原因是没有用随机，而是用了rank-based metric

Both techniques have demonstrated significant improvement over previous
fault localisation approaches. However, evaluations have been limited to manuallyseeded faults in small-to-medium sized programs, and use metrics that have been
shown to be inappropriate for automated program repair [19], where the degree
of difference in suspiciousness is more important than rank.

这篇文章太优秀了。真的有用。
这个是值得关注的。