The Patch Overfitting Problem in Automated Program Repair: Practical Magnitude and a Baseline for Realistic Benchmarking
Created by W.Langdon from
gp-bibliography.bib Revision:1.8120
- @InProceedings{Petke:2024:FSEIVR,
-
author = "Justyna Petke and Matias Martinez and
Maria Kechagia and Aldeida Aleti and Federica Sarro",
-
title = "The Patch Overfitting Problem in Automated Program
Repair: Practical Magnitude and a Baseline for
Realistic Benchmarking",
-
booktitle = "FSE, Ideas, Visions and Reflections",
-
year = "2024",
-
address = "Porto de Galinhas, Brazil",
-
month = "15-19 " # jul,
-
keywords = "genetic algorithms, genetic programming, genetic
improvement, APR, Overfitting, Automated Program
Repair, Patch Assessment",
-
URL = "https://2024.esec-fse.org/details/fse-2024-ideas--visions-and-reflections/1/The-Patch-Overfitting-Problem-in-Automated-Program-Repair-Practical-Magnitude-and-a-",
-
URL = "https://2024.esec-fse.org/profile/matiasmartinez2#",
-
URL = "http://www.cs.ucl.ac.uk/staff/J.Petke/papers/Petke_2024_FSEIVR.pdf",
-
size = "5 pages",
-
abstract = "Automated program repair techniques aim to generate
plausible patches for software bugs, mainly relying on
testing to check their validity. It is possible that
such a plausible patch may not be correct, as there may
still exist test inputs under which the bug prevails.
The generation of a large number of plausible yet
incorrect patches is widely believed to hinder wider
application of APR in practice, which has motivated
research in automated patch assessment techniques.
Herein, we reflect on the validity of this motivation
and carry out an empirical study to analyse the extent
to which 10 program repair tools suffer from the
overfitting problem in practice. We observe that the
number of plausible patches generated by any of the
program repair tools analysed for a given bug from the
Defects4J dataset is remarkably low, a median of 2,
indicating that a developer only needs to consider 2
patches in most cases to be confident to find a fix or
confirming its nonexistence. This study unveils that
the overfitting problem might not be as bad as
previously thought. We also reflect on current
evaluation strategies of patch assessment techniques
and propose a Random Selection baseline to assess
whether and when using automated patch overfitting
assessment is beneficial for reducing human effort in
patch assessment. We thus advocate future work
proposing/using automated patch overfitting assessment
should evaluate the benefit arising from its usage
against the random baseline.",
- }
Genetic Programming entries for
Justyna Petke
Matias Sebastian Martinez
Maria Kechagia
Aldeida Aleti
Federica Sarro
Citations