标题:新的软件自动修复数据集——Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies
前言
本文旨在介绍Bears,一个新的Java缺陷程序数据集(用于软件自动修复研究)。
基本信息
Madeiral, F., Urli, S., Maia, M., & Monperrus, M. (2019, February). Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 468-478). IEEE.
四作还是熟悉的:Martin Monpperrus
阅读之前。(方法随想)
如果要从头到尾完完整整阅读这篇文章,那肯定是目测至少都要半小时到一小时的。
面对浩如烟海的软件自动修复文献,自然得分清主次,搭好自己的框架。
我认为,评价自己有没有读透一篇文章,
不在于自己是不是逐字逐句的读了,
而是在于:
1)我有没有明白其中心思想;
2)其idea从何而来;
3)该论文被录取(接收)的原因(论文之亮点);
4)不足之处。
从这几方面入手,感觉需要的更多是思考,而不是简单的阅读。
未来阅读计划
这篇文章,如今已有6次引用,看来数据集这方面的进展是越来越快的,门槛也越来越高了。
以前只有一个defects4j,长时间占据了学术界的市场,现在大家发现,补丁生成没那么好搞了,
相对来说,做数据集稍微轻松一些,另一方面,数据集做的人少。
从而出现了这样的数据集井喷式现象。
接下来,我从引用这篇文章(以及被这篇文章引用)的列表中找出几篇文章,想细读一下:
- Dmeiri, N., Tomassi, D. A., Wang, Y., Bhowmick, A., Liu, Y. C., Devanbu, P., … & Rubio-González, C. (2019). BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. arXiv preprint arXiv:1903.06725.
- Majd, A., Vahidi-Asl, M., Khalilian, A., Baraani-Dastjerdi, A., & Zamani, B. (2019). Code4Bench: A Multidimensional Benchmark of Codeforces Data for Different Program Analysis Techniques. Journal of Computer Languages.
- Durieux, T., Madeiral, F., Martinez, M., & Abreu, R. (2019). Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts. arXiv preprint arXiv:1905.11973.
- Durieux, T., & Abreu, R. (2019). Critical Review of BugSwarm for Fault Localization and Program Repair. arXiv preprint arXiv:1905.09375.
- Le Goues C, Holtschulte N, Smith E K, et al. The ManyBugs and IntroClass benchmarks for automated repair of C programs[J]. IEEE Transactions on Software Engineering, 2015, 41(12): 1236-1256.
- Saha, R., Lyu, Y., Lam, W., Yoshida, H., & Prasad, M. (2018, May). Bugs. jar: a large-scale, diverse dataset of real-world java bugs. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) (pp. 10-13). IEEE.
- Benton, S., Ghabari, A., & Zhang, L. (2019). Defexts: A Curated Dataset of Reproducible Real-World Bugs for Modern JVM Languages. In Proceedings of International Conference on Software Engineering (ICSE’19). To appear.
- Ponta, S. E., Plate, H., Sabetta, A., Bezzi, M., & Dangremont, C. (2019). A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software. arXiv preprint arXiv:1902.02595.
- Tan, S. H., Yi, J., Mechtaev, S., & Roychoudhury, A. (2017, May). Codeflaws: a programming competition benchmark for evaluating automated program repair tools. In Proceedings of the 39th International Conference on Software Engineering Companion (pp. 180-182). IEEE Press.
- Lin, D., Koppel, J., Chen, A., & Solar-Lezama, A. (2017, October). Quixbugs: a multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (pp. 55-56). ACM.
- Sobreira, V., Durieux, T., Madeiral, F., Monperrus, M., & de Almeida Maia, M. (2018, March). Dissection of a bug dataset: Anatomy of 395 patches from Defects4J. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 130-140). IEEE.
略读:
- Gyimesi, P., Vancsics, B., Stocco, A., Mazinanian, D., Beszédes, A., Ferenc, R., & Mesbah, A. (2019). BUGSJS: A Benchmark of JavaScript Bugs. In Proceedings of the 12th International Conference on Software Testing, Verification, and Validation, ICST. To appear.
- Lu, S., Li, Z., Qin, F., Tan, L., Zhou, P., & Zhou, Y. (2005, June). Bugbench: Benchmarks for evaluating bug detection tools. In Workshop on the evaluation of software defect detection tools (Vol. 5).
- Dallmeier, V., & Zimmermann, T. (2007, November). Extraction of bug localization benchmarks from history. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering (pp. 433-436). ACM.
1 基本内容
通过近年来Github上很流行的CI(continuous integration 持续集成)以及Maven构建工具,来从github上 重现(reproduce),收集 缺陷程序和补丁程序对(pair)。
整个bears分为:
- BEARS-COLLECTOR:在CI中自动识别缺陷程序和补丁程序;(收集并存储程序)
- BEARS- BENCHMARK:真正的benchmark
有点强,工作量不小。
2 idea从何而来
1)作者在CI上有比较丰富经验,比如Nopol就是用CI去集成的(编译测试,都是自动的,还是很方便的,我之前的文章也介绍过CI这个自动集成工具);
2)主要是发现了Defects4J的问题,因为我们修复领域不可能永远都是去过拟合这395个缺陷,肯定要做新的,但是新的数据集,要有什么亮点呢?(肯定要有能够超过Defects4J的亮点),所以作者在这里主打的是可扩展,主要就是这一个核心idea,并将其实现之。
我感觉这个idea可强可弱。有点拿不准。
不过文章还提了一个pipeline,是借助CI来做的:
The uniqueness of Bears is the usage of CI (builds) to identify buggy and patched program version candidates.
这个应该也是亮点。
毕竟都写到摘要里面了。
3)
Durieux et al. [4] also pointed out that creating a benchmark of bugs is challenging. They reported that it is difficult to reproduce failures, and it can take up to one day to find and reproduce a single null pointer exception bug
原来是以前的future work衍生出来的一个点。
做数据集(因为数据集做起来很麻烦)。
3 论文的亮点
1)使用了CI;
2)可扩展,而且扩展很方便;
3)做了一个数据集,我没想到做数据集这么难的吗?
They reported that it is difficult to reproduce failures, and it can take up to one day to find and reproduce a single null pointer exception bug. The
。。。
4)很多细节要考虑,这也算亮点的(扎扎实实的在实验过程中遇到并解决困难)
4 论文的不足
1)局限在Maven项目中,没有考虑gradle项目,ant项目这些。
(当然,要考虑的话工作量太大。)
果然,在文章的最后,我看到了作者的这一段描述:
更多不足之处的描述请见:我的印象笔记。
总结
可以说是很认真了= =
现在时间:2019年6月17日10:56:01
都快11点了。时间过得真快啊!