research-article

Open Access

Fully Autonomous Programming with Large Language Models

Authors:
Vadim Liventsev

Technical University of Eindhoven, Eindhoven, Netherlands

Philips Research, Eindhoven, Netherlands

Technical University of Eindhoven, Eindhoven, Netherlands

Philips Research, Eindhoven, Netherlands

https://orcid.org/0000-0002-6670-6909
View Profile

,
Anastasiia Grishina

Simula Research Laboratory, Oslo, Norway

University of Oslo, Oslo, Norway

Simula Research Laboratory, Oslo, Norway

University of Oslo, Oslo, Norway

https://orcid.org/0000-0003-3139-0200
View Profile

,
Aki Härmä

Philips Research, Eindhoven, Netherlands

Philips Research, Eindhoven, Netherlands

https://orcid.org/0000-0002-2966-3305
View Profile

,
Leon Moonen

Simula Research Laboratory, Oslo, Norway

BI Norwegian Business School, Oslo, Norway

Simula Research Laboratory, Oslo, Norway

BI Norwegian Business School, Oslo, Norway

https://orcid.org/0000-0002-1761-6771
View Profile

GECCO '23: Proceedings of the Genetic and Evolutionary Computation ConferenceJuly 2023Pages 1146–1155https://doi.org/10.1145/3583131.3590481

Published:12 July 2023Publication History

GECCO '23: Proceedings of the Genetic and Evolutionary Computation Conference

Pages 1146–1155

ABSTRACT

Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome": they tend to generate programs that semantically resemble the correct answer (as measured by text similarity metrics or human evaluation), but achieve a low or even zero accuracy as measured by unit tests due to small imperfections, such as the wrong input or output format. This calls for an approach known as Synthesize, Execute, Debug (SED), whereby a draft of the solution is generated first, followed by a program repair phase addressing the failed tests. To effectively apply this approach to instruction-driven LLMs, one needs to determine which prompts perform best as instructions for LLMs, as well as strike a balance between repairing unsuccessful programs and replacing them with newly generated ones. We explore these trade-offs empirically, comparing replace-focused, repair-focused, and hybrid debug strategies, as well as different template-based and model-based prompt-generation techniques. We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.

References

Z. Manna and R. J. Waldinger. "Toward Automatic Program Synthesis." In: Communications of the ACM 14.3 (1971), pp. 151--165. Google ScholarCross Ref
O. Bastani, J. P. Inala, and A. Solar-Lezama. "Interpretable, Verifiable, and Robust Reinforcement Learning via Program Synthesis." In: xxAI - Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers. Ed. by A. Holzinger, R. Goebel, R. Fong, T. Moon, K.-R. Müller, and W. Samek. Lecture Notes in Computer Science. Springer, 2022, pp. 207--228. doi: 10.j5zn.Google Scholar
S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup, and M. Shah. "A Survey of On-Device Machine Learning: An Algorithms and Learning Theory Perspective." In: ACM Transactions on Internet of Things 2.3 (2021), pp. 1--49. Google ScholarCross Ref
T. M. Connolly, M. Soflano, and P. Papadopoulos. "Systematic Literature Review: XAI and Clinical Decision Support." In: Diverse Perspectives and State-of-the-Art Approaches to the Utilization of Data-Driven Clinical Decision Support Systems. IGI Global, 2023, pp. 161--188. Google ScholarCross Ref
Y. Jia, J. McDermid, T. Lawton, and I. Habli. "The Role of Explainability in Assuring Safety of Machine Learning in Healthcare." In: IEEE Transactions on Emerging Topics in Computing 10.4 (Oct. 2022), pp. 1746--1760. Google ScholarCross Ref
Z. Manna and R. Waldinger. "Fundamentals of Deductive Program Synthesis." In: IEEE Transactions on Software Engineering 18.8 (1992), pp. 674--704. Google ScholarCross Ref
R. Alur et al. "Syntax-Guided Synthesis." In: Dependable Software Systems Engineering (2015), pp. 1--25. Google ScholarCross Ref
M. T. Ahvanooey, Q. Li, M. Wu, and S. Wang. "A Survey of Genetic Programming and Its Applications." In: KSII Transactions on Internet and Information Systems (TIIS) 13.4 (2019), pp. 1765--1794.Google Scholar
M. Chen et al. Evaluating Large Language Models Trained on Code. July 2021. arXiv: 2107.03374.Google Scholar
Y. Li et al. "Competition-Level Code Generation with AlphaCode." In: Science 378.6624 (Dec. 2022), pp. 1092--1097. Google ScholarCross Ref
S. Ren et al. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. Sept. 2020. arXiv: 2009.10297.Google Scholar
R. Bavishi, H. Joshi, J. Cambronero, A. Fariha, S. Gulwani, V. Le, I. Radicek, and A. Tiwari. "Neurosymbolic Repair for Low-Code Formula Languages." In: OOPSLA. Dec. 2022.Google ScholarDigital Library
M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. "A Survey of Machine Learning for Big Code and Naturalness." In:ACM Computing Surveys 51.4 (July 2018), 81:1--81:37. Google ScholarCross Ref
X. Chen, D. Song, and Y. Tian. "Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages." In: Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021, pp. 22196--22208.Google Scholar
O. Polozov and S. Gulwani. "FlashMeta: A Framework for Inductive Program Synthesis." In: OOPSLA. OOPSLA 2015. New York, NY, USA: Association for Computing Machinery, Oct. 2015, pp. 107--126. Google ScholarCross Ref
V. Liventsev, A. Härmä, and M. Petković. BF++: A Language for General-Purpose Program Synthesis. Jan. 2021. arXiv: 2101.09571.Google Scholar
S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu. "A Survey of Deep Learning Techniques for Autonomous Driving." In: Journal of Field Robotics 37.3 (2020), pp. 362--386. Google ScholarCross Ref
M. Marcano, S. Díaz, J. Pérez, and E. Irigoyen. "A Review of Shared Control for Automated Vehicles: Theory and Applications." In: IEEE Transactions on Human-Machine Systems 50.6 (2020), pp. 475--491. Google ScholarCross Ref
S.Lu et al. "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation." In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Dec. 2021.Google Scholar
C. Niu, C. Li, V. Ng, and B. Luo. CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code Models. Feb. 2023. arXiv: 2302.04030.Google Scholar
H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. June 2020. arXiv: 1909.09436.Google Scholar
B. Roziere, M.-A. Lachaux, G. Lample, and L. Chanussot. "Unsupervised Translation of Programming Languages." In: 34th Conference on Neural Information Processing Systems. Vancouver, Canada, 2020.Google Scholar
E. Fernandes, J. Oliveira, G. Vale, T. Paiva, and E. Figueiredo. "A Review-Based Comparative Study of Bad Smell Detection Tools." In: Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering. EASE '16. New York, NY, USA: Association for Computing Machinery, June 2016, pp. 1--12. Google ScholarCross Ref
S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. "Deep Learning Based Vulnerability Detection: Are We There Yet?" In: IEEE Transactions on Software Engineering 48.9 (Sept. 2022), pp. 3280--3296. Google ScholarCross Ref
J. Petke, S. O. Haraldsson, M. Harman, W. B. Langdon, D. R. White, and J. R. Woodward. "Genetic Improvement of Software: A Comprehensive Survey." In: IEEE Transactions on Evolutionary Computation 22.3 (June 2018), pp. 415--432. Google ScholarCross Ref
C. Le Goues, M. Pradel, and A. Roychoudhury. "Automated Program Repair." In: Communications of the ACM 62.12 (Nov. 2019), pp. 56--65. Google ScholarCross Ref
TIOBE Index. https://www.tiobe.com/tiobe-index/.Google Scholar
K. Gupta, P. E. Christensen, X. Chen, and D. Song. "Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis." In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 17685--17695.Google Scholar
S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping Language to Code in Programmatic Context. Aug. 2018. arXiv: 1808.09588.Google Scholar
D. C. Halbert. "Programming by Example." PhD thesis. University of California, Berkeley, 1984.Google Scholar
S. Gulwani. Programming by Examples (and Its Applications in Data Wrangling). Tech. rep. Redmond, WA, USA: Microsoft Corportation, 2016, p. 22.Google Scholar
M. Zavershynskyi, A. Skidanov, and I. Polosukhin. NAPS: Natural Program Synthesis Dataset. 2018. arXiv: 1807.0316.Google Scholar
L. Ouyang et al. Training Language Models to Follow Instructions with Human Feedback. Mar. 2022. arXiv: 2203.02155.Google Scholar
Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Automated Repair of Programs from Large Language Models. Jan. 2023. arXiv: 2205.10583.Google Scholar
J. Blain. Nine Worlds of Seid-magic: Ecstasy and Neo-shamanism in North European Paganism. Routledge, 2002.Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. "Attention Is All You Need." In: International Conference on Neural Information Processing Systems (NeurIPS). Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 5998--6008.Google Scholar
OpenAI API. https://platform.openai.com.Google Scholar
S. de Bruin, V. Liventsev, and M. Petković. Autoencoders as Tools for Program Synthesis. Sept. 2021. arXiv: 2108.07129.Google Scholar
J. R. Koza. Genetic Programming II. Vol. 17. MIT Press, 1994.Google ScholarDigital Library
N. Jack and F. Van der Duyn Schouten. "Optimal Repair-Replace Strategies for a Warranted Product." In: International Journal of Production Economics 67.1 (Aug. 2000), pp. 95--100. Google ScholarCross Ref
S. J. Russell. Artificial Intelligence a Modern Approach. Pearson Education, Inc., 2010.Google Scholar
H. Joshi, J. Cambronero, S. Gulwani, V. Le, I. Radicek, and G. Verbruggen. Repair Is Nearly Generation: Multilingual Program Repair with LLMs. Sept. 2022. arXiv: 2208.11640.Google Scholar
D. Shrivastava, H. Larochelle, and D. Tarlow. Repository-Level Prompt Generation for Large Language Models of Code. Oct. 2022. arXiv: 2206.12839.Google Scholar
Q. Huang, Z. Yuan, Z. Xing, X. Xu, L. Zhu, and Q. Lu. Prompt-Tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code. Aug. 2022. arXiv: 2208.05361.Google Scholar
B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. Fixing Hardware Security Bugs with Large Language Models. Feb. 2023. arXiv: 2302. 01215.Google Scholar
D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama. "QuixBugs: A MultiLingual Program Repair Benchmark Set Based on the Quixey Challenge." In: Companion of the SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. Vancouver BC Canada: ACM, Oct. 2017, pp. 55--56. Google ScholarCross Ref
J. A. Prenner, H. Babii, and R. Robbes. "Can OpenAI's Codex Fix Bugs?: An Evaluation on QuixBugs." In: 2022 IEEE/ACM International Workshop on Automated Program Repair (APR). May 2022, pp. 69--75. Google ScholarCross Ref
D. Sobania, M. Briesch, C. Hanna, and J. Petke. An Analysis of the Automatic Bug Fixing Performance of ChatGPT. Jan. 2023. arXiv: 2301. 08653.Google Scholar
K. Kuznia, S. Mishra, M. Parmar, and C. Baral. Less Is More: Summary of Long Instructions Is Better for Program Synthesis. Oct. 2022. arXiv: 2203.08597.Google Scholar
T. Helmuth and P. Kelly. "Applying Genetic Programming to PSB2: The next Generation Program Synthesis Benchmark Suite." In: Genetic Programming and Evolvable Machines 23.3 (Sept. 2022), pp. 375--404. Google ScholarCross Ref
T. Helmuth and L. Spector. "Problem-Solving Benefits of Down-Sampled Lexicase Selection." In: Artificial Life 27.3--4 (Mar. 2022), pp. 183--203. Google ScholarCross Ref
T. Helmuth and L. Spector. "General Program Synthesis Benchmark Suite." In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. Madrid Spain: ACM, July 2015, pp. 1039--1046. Google ScholarCross Ref
D. Sobania, M. Briesch, and F. Rothlauf. "Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of Github Copilot and Genetic Programming." In: Proceedings of the Genetic and Evolutionary Computation Conference. Boston Massachusetts: ACM, July 2022, pp. 1019--1027. Google ScholarCross Ref
H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions. Dec. 2021. arXiv: 2108.09293.Google Scholar

Index Terms

Fully Autonomous Programming with Large Language Models
1. Computing methodologies
2. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Software design engineering

Recommendations

Training Language Models for Programming Feedback Using Automated Repair Tools
Artificial Intelligence in Education
Abstract
In introductory programming courses, automated repair tools (ARTs) are used to provide feedback to students struggling with debugging. Most successful ARTs take advantage of context-specific educational data to construct repairs to students’ buggy ...
Read More
Evaluating Explanations for Software Patches Generated by Large Language Models
Search-Based Software Engineering
Abstract
Large language models (LLMs) have recently been integrated in a variety of applications including software engineering tasks. In this work, we study the use of LLMs to enhance the explainability of software patches. In particular, we evaluate the ...
Read More
Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests
ICER '23: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1

Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence.

Objectives: In this article, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GECCO '23: Proceedings of the Genetic and Evolutionary Computation Conference
July 2023
1667 pages
ISBN:9798400701191
DOI:10.1145/3583131
Chair:
Sara Silva,
Program Chair:
Luís Paquete
Copyright © 2023 Owner/Author(s)
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 July 2023
Check for updates
Author Tags
automatic programming
large language models
program repair
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,669of4,410submissions,38%
Upcoming Conference
GECCO '24

Sponsor:

sigevo

Genetic and Evolutionary Computation Conference

July 14 - 18, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 754
  Total Downloads
- Downloads (Last 12 months)754
- Downloads (Last 6 weeks)140
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fully Autonomous Programming with Large Language Models

GECCO '23: Proceedings of the Genetic and Evolutionary Computation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Training Language Models for Programming Feedback Using Automated Repair Tools

Evaluating Explanations for Software Patches Generated by Large Language Models

Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fully Autonomous Programming with Large Language Models

GECCO '23: Proceedings of the Genetic and Evolutionary Computation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Training Language Models for Programming Feedback Using Automated Repair Tools

Evaluating Explanations for Software Patches Generated by Large Language Models

Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media