ABSTRACT
Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome": they tend to generate programs that semantically resemble the correct answer (as measured by text similarity metrics or human evaluation), but achieve a low or even zero accuracy as measured by unit tests due to small imperfections, such as the wrong input or output format. This calls for an approach known as Synthesize, Execute, Debug (SED), whereby a draft of the solution is generated first, followed by a program repair phase addressing the failed tests. To effectively apply this approach to instruction-driven LLMs, one needs to determine which prompts perform best as instructions for LLMs, as well as strike a balance between repairing unsuccessful programs and replacing them with newly generated ones. We explore these trade-offs empirically, comparing replace-focused, repair-focused, and hybrid debug strategies, as well as different template-based and model-based prompt-generation techniques. We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
- Z. Manna and R. J. Waldinger. "Toward Automatic Program Synthesis." In: Communications of the ACM 14.3 (1971), pp. 151--165. Google ScholarCross Ref
- O. Bastani, J. P. Inala, and A. Solar-Lezama. "Interpretable, Verifiable, and Robust Reinforcement Learning via Program Synthesis." In: xxAI - Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers. Ed. by A. Holzinger, R. Goebel, R. Fong, T. Moon, K.-R. Müller, and W. Samek. Lecture Notes in Computer Science. Springer, 2022, pp. 207--228. doi: 10.j5zn.Google Scholar
- S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup, and M. Shah. "A Survey of On-Device Machine Learning: An Algorithms and Learning Theory Perspective." In: ACM Transactions on Internet of Things 2.3 (2021), pp. 1--49. Google ScholarCross Ref
- T. M. Connolly, M. Soflano, and P. Papadopoulos. "Systematic Literature Review: XAI and Clinical Decision Support." In: Diverse Perspectives and State-of-the-Art Approaches to the Utilization of Data-Driven Clinical Decision Support Systems. IGI Global, 2023, pp. 161--188. Google ScholarCross Ref
- Y. Jia, J. McDermid, T. Lawton, and I. Habli. "The Role of Explainability in Assuring Safety of Machine Learning in Healthcare." In: IEEE Transactions on Emerging Topics in Computing 10.4 (Oct. 2022), pp. 1746--1760. Google ScholarCross Ref
- Z. Manna and R. Waldinger. "Fundamentals of Deductive Program Synthesis." In: IEEE Transactions on Software Engineering 18.8 (1992), pp. 674--704. Google ScholarCross Ref
- R. Alur et al. "Syntax-Guided Synthesis." In: Dependable Software Systems Engineering (2015), pp. 1--25. Google ScholarCross Ref
- M. T. Ahvanooey, Q. Li, M. Wu, and S. Wang. "A Survey of Genetic Programming and Its Applications." In: KSII Transactions on Internet and Information Systems (TIIS) 13.4 (2019), pp. 1765--1794.Google Scholar
- M. Chen et al. Evaluating Large Language Models Trained on Code. July 2021. arXiv: 2107.03374.Google Scholar
- Y. Li et al. "Competition-Level Code Generation with AlphaCode." In: Science 378.6624 (Dec. 2022), pp. 1092--1097. Google ScholarCross Ref
- S. Ren et al. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. Sept. 2020. arXiv: 2009.10297.Google Scholar
- R. Bavishi, H. Joshi, J. Cambronero, A. Fariha, S. Gulwani, V. Le, I. Radicek, and A. Tiwari. "Neurosymbolic Repair for Low-Code Formula Languages." In: OOPSLA. Dec. 2022.Google ScholarDigital Library
- M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. "A Survey of Machine Learning for Big Code and Naturalness." In:ACM Computing Surveys 51.4 (July 2018), 81:1--81:37. Google ScholarCross Ref
- X. Chen, D. Song, and Y. Tian. "Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages." In: Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021, pp. 22196--22208.Google Scholar
- O. Polozov and S. Gulwani. "FlashMeta: A Framework for Inductive Program Synthesis." In: OOPSLA. OOPSLA 2015. New York, NY, USA: Association for Computing Machinery, Oct. 2015, pp. 107--126. Google ScholarCross Ref
- V. Liventsev, A. Härmä, and M. Petković. BF++: A Language for General-Purpose Program Synthesis. Jan. 2021. arXiv: 2101.09571.Google Scholar
- S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu. "A Survey of Deep Learning Techniques for Autonomous Driving." In: Journal of Field Robotics 37.3 (2020), pp. 362--386. Google ScholarCross Ref
- M. Marcano, S. Díaz, J. Pérez, and E. Irigoyen. "A Review of Shared Control for Automated Vehicles: Theory and Applications." In: IEEE Transactions on Human-Machine Systems 50.6 (2020), pp. 475--491. Google ScholarCross Ref
- S.Lu et al. "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation." In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Dec. 2021.Google Scholar
- C. Niu, C. Li, V. Ng, and B. Luo. CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code Models. Feb. 2023. arXiv: 2302.04030.Google Scholar
- H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. June 2020. arXiv: 1909.09436.Google Scholar
- B. Roziere, M.-A. Lachaux, G. Lample, and L. Chanussot. "Unsupervised Translation of Programming Languages." In: 34th Conference on Neural Information Processing Systems. Vancouver, Canada, 2020.Google Scholar
- E. Fernandes, J. Oliveira, G. Vale, T. Paiva, and E. Figueiredo. "A Review-Based Comparative Study of Bad Smell Detection Tools." In: Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering. EASE '16. New York, NY, USA: Association for Computing Machinery, June 2016, pp. 1--12. Google ScholarCross Ref
- S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. "Deep Learning Based Vulnerability Detection: Are We There Yet?" In: IEEE Transactions on Software Engineering 48.9 (Sept. 2022), pp. 3280--3296. Google ScholarCross Ref
- J. Petke, S. O. Haraldsson, M. Harman, W. B. Langdon, D. R. White, and J. R. Woodward. "Genetic Improvement of Software: A Comprehensive Survey." In: IEEE Transactions on Evolutionary Computation 22.3 (June 2018), pp. 415--432. Google ScholarCross Ref
- C. Le Goues, M. Pradel, and A. Roychoudhury. "Automated Program Repair." In: Communications of the ACM 62.12 (Nov. 2019), pp. 56--65. Google ScholarCross Ref
- TIOBE Index. https://www.tiobe.com/tiobe-index/.Google Scholar
- K. Gupta, P. E. Christensen, X. Chen, and D. Song. "Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis." In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 17685--17695.Google Scholar
- S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping Language to Code in Programmatic Context. Aug. 2018. arXiv: 1808.09588.Google Scholar
- D. C. Halbert. "Programming by Example." PhD thesis. University of California, Berkeley, 1984.Google Scholar
- S. Gulwani. Programming by Examples (and Its Applications in Data Wrangling). Tech. rep. Redmond, WA, USA: Microsoft Corportation, 2016, p. 22.Google Scholar
- M. Zavershynskyi, A. Skidanov, and I. Polosukhin. NAPS: Natural Program Synthesis Dataset. 2018. arXiv: 1807.0316.Google Scholar
- L. Ouyang et al. Training Language Models to Follow Instructions with Human Feedback. Mar. 2022. arXiv: 2203.02155.Google Scholar
- Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Automated Repair of Programs from Large Language Models. Jan. 2023. arXiv: 2205.10583.Google Scholar
- J. Blain. Nine Worlds of Seid-magic: Ecstasy and Neo-shamanism in North European Paganism. Routledge, 2002.Google Scholar
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. "Attention Is All You Need." In: International Conference on Neural Information Processing Systems (NeurIPS). Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 5998--6008.Google Scholar
- OpenAI API. https://platform.openai.com.Google Scholar
- S. de Bruin, V. Liventsev, and M. Petković. Autoencoders as Tools for Program Synthesis. Sept. 2021. arXiv: 2108.07129.Google Scholar
- J. R. Koza. Genetic Programming II. Vol. 17. MIT Press, 1994.Google ScholarDigital Library
- N. Jack and F. Van der Duyn Schouten. "Optimal Repair-Replace Strategies for a Warranted Product." In: International Journal of Production Economics 67.1 (Aug. 2000), pp. 95--100. Google ScholarCross Ref
- S. J. Russell. Artificial Intelligence a Modern Approach. Pearson Education, Inc., 2010.Google Scholar
- H. Joshi, J. Cambronero, S. Gulwani, V. Le, I. Radicek, and G. Verbruggen. Repair Is Nearly Generation: Multilingual Program Repair with LLMs. Sept. 2022. arXiv: 2208.11640.Google Scholar
- D. Shrivastava, H. Larochelle, and D. Tarlow. Repository-Level Prompt Generation for Large Language Models of Code. Oct. 2022. arXiv: 2206.12839.Google Scholar
- Q. Huang, Z. Yuan, Z. Xing, X. Xu, L. Zhu, and Q. Lu. Prompt-Tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code. Aug. 2022. arXiv: 2208.05361.Google Scholar
- B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. Fixing Hardware Security Bugs with Large Language Models. Feb. 2023. arXiv: 2302. 01215.Google Scholar
- D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama. "QuixBugs: A MultiLingual Program Repair Benchmark Set Based on the Quixey Challenge." In: Companion of the SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. Vancouver BC Canada: ACM, Oct. 2017, pp. 55--56. Google ScholarCross Ref
- J. A. Prenner, H. Babii, and R. Robbes. "Can OpenAI's Codex Fix Bugs?: An Evaluation on QuixBugs." In: 2022 IEEE/ACM International Workshop on Automated Program Repair (APR). May 2022, pp. 69--75. Google ScholarCross Ref
- D. Sobania, M. Briesch, C. Hanna, and J. Petke. An Analysis of the Automatic Bug Fixing Performance of ChatGPT. Jan. 2023. arXiv: 2301. 08653.Google Scholar
- K. Kuznia, S. Mishra, M. Parmar, and C. Baral. Less Is More: Summary of Long Instructions Is Better for Program Synthesis. Oct. 2022. arXiv: 2203.08597.Google Scholar
- T. Helmuth and P. Kelly. "Applying Genetic Programming to PSB2: The next Generation Program Synthesis Benchmark Suite." In: Genetic Programming and Evolvable Machines 23.3 (Sept. 2022), pp. 375--404. Google ScholarCross Ref
- T. Helmuth and L. Spector. "Problem-Solving Benefits of Down-Sampled Lexicase Selection." In: Artificial Life 27.3--4 (Mar. 2022), pp. 183--203. Google ScholarCross Ref
- T. Helmuth and L. Spector. "General Program Synthesis Benchmark Suite." In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. Madrid Spain: ACM, July 2015, pp. 1039--1046. Google ScholarCross Ref
- D. Sobania, M. Briesch, and F. Rothlauf. "Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of Github Copilot and Genetic Programming." In: Proceedings of the Genetic and Evolutionary Computation Conference. Boston Massachusetts: ACM, July 2022, pp. 1019--1027. Google ScholarCross Ref
- H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions. Dec. 2021. arXiv: 2108.09293.Google Scholar
Index Terms
- Fully Autonomous Programming with Large Language Models
Recommendations
Training Language Models for Programming Feedback Using Automated Repair Tools
Artificial Intelligence in EducationAbstractIn introductory programming courses, automated repair tools (ARTs) are used to provide feedback to students struggling with debugging. Most successful ARTs take advantage of context-specific educational data to construct repairs to students’ buggy ...
Evaluating Explanations for Software Patches Generated by Large Language Models
Search-Based Software EngineeringAbstractLarge language models (LLMs) have recently been integrated in a variety of applications including software engineering tasks. In this work, we study the use of LLMs to enhance the explainability of software patches. In particular, we evaluate the ...
Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests
ICER '23: Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence.
Objectives: In this article, we ...
Comments