skip to main content
10.1145/3583131.3590481acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
research-article
Open Access

Fully Autonomous Programming with Large Language Models

Published:12 July 2023Publication History

ABSTRACT

Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome": they tend to generate programs that semantically resemble the correct answer (as measured by text similarity metrics or human evaluation), but achieve a low or even zero accuracy as measured by unit tests due to small imperfections, such as the wrong input or output format. This calls for an approach known as Synthesize, Execute, Debug (SED), whereby a draft of the solution is generated first, followed by a program repair phase addressing the failed tests. To effectively apply this approach to instruction-driven LLMs, one needs to determine which prompts perform best as instructions for LLMs, as well as strike a balance between repairing unsuccessful programs and replacing them with newly generated ones. We explore these trade-offs empirically, comparing replace-focused, repair-focused, and hybrid debug strategies, as well as different template-based and model-based prompt-generation techniques. We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.

References

  1. Z. Manna and R. J. Waldinger. "Toward Automatic Program Synthesis." In: Communications of the ACM 14.3 (1971), pp. 151--165. Google ScholarGoogle ScholarCross RefCross Ref
  2. O. Bastani, J. P. Inala, and A. Solar-Lezama. "Interpretable, Verifiable, and Robust Reinforcement Learning via Program Synthesis." In: xxAI - Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers. Ed. by A. Holzinger, R. Goebel, R. Fong, T. Moon, K.-R. Müller, and W. Samek. Lecture Notes in Computer Science. Springer, 2022, pp. 207--228. doi: 10.j5zn.Google ScholarGoogle Scholar
  3. S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup, and M. Shah. "A Survey of On-Device Machine Learning: An Algorithms and Learning Theory Perspective." In: ACM Transactions on Internet of Things 2.3 (2021), pp. 1--49. Google ScholarGoogle ScholarCross RefCross Ref
  4. T. M. Connolly, M. Soflano, and P. Papadopoulos. "Systematic Literature Review: XAI and Clinical Decision Support." In: Diverse Perspectives and State-of-the-Art Approaches to the Utilization of Data-Driven Clinical Decision Support Systems. IGI Global, 2023, pp. 161--188. Google ScholarGoogle ScholarCross RefCross Ref
  5. Y. Jia, J. McDermid, T. Lawton, and I. Habli. "The Role of Explainability in Assuring Safety of Machine Learning in Healthcare." In: IEEE Transactions on Emerging Topics in Computing 10.4 (Oct. 2022), pp. 1746--1760. Google ScholarGoogle ScholarCross RefCross Ref
  6. Z. Manna and R. Waldinger. "Fundamentals of Deductive Program Synthesis." In: IEEE Transactions on Software Engineering 18.8 (1992), pp. 674--704. Google ScholarGoogle ScholarCross RefCross Ref
  7. R. Alur et al. "Syntax-Guided Synthesis." In: Dependable Software Systems Engineering (2015), pp. 1--25. Google ScholarGoogle ScholarCross RefCross Ref
  8. M. T. Ahvanooey, Q. Li, M. Wu, and S. Wang. "A Survey of Genetic Programming and Its Applications." In: KSII Transactions on Internet and Information Systems (TIIS) 13.4 (2019), pp. 1765--1794.Google ScholarGoogle Scholar
  9. M. Chen et al. Evaluating Large Language Models Trained on Code. July 2021. arXiv: 2107.03374.Google ScholarGoogle Scholar
  10. Y. Li et al. "Competition-Level Code Generation with AlphaCode." In: Science 378.6624 (Dec. 2022), pp. 1092--1097. Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Ren et al. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. Sept. 2020. arXiv: 2009.10297.Google ScholarGoogle Scholar
  12. R. Bavishi, H. Joshi, J. Cambronero, A. Fariha, S. Gulwani, V. Le, I. Radicek, and A. Tiwari. "Neurosymbolic Repair for Low-Code Formula Languages." In: OOPSLA. Dec. 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. "A Survey of Machine Learning for Big Code and Naturalness." In:ACM Computing Surveys 51.4 (July 2018), 81:1--81:37. Google ScholarGoogle ScholarCross RefCross Ref
  14. X. Chen, D. Song, and Y. Tian. "Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages." In: Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021, pp. 22196--22208.Google ScholarGoogle Scholar
  15. O. Polozov and S. Gulwani. "FlashMeta: A Framework for Inductive Program Synthesis." In: OOPSLA. OOPSLA 2015. New York, NY, USA: Association for Computing Machinery, Oct. 2015, pp. 107--126. Google ScholarGoogle ScholarCross RefCross Ref
  16. V. Liventsev, A. Härmä, and M. Petković. BF++: A Language for General-Purpose Program Synthesis. Jan. 2021. arXiv: 2101.09571.Google ScholarGoogle Scholar
  17. S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu. "A Survey of Deep Learning Techniques for Autonomous Driving." In: Journal of Field Robotics 37.3 (2020), pp. 362--386. Google ScholarGoogle ScholarCross RefCross Ref
  18. M. Marcano, S. Díaz, J. Pérez, and E. Irigoyen. "A Review of Shared Control for Automated Vehicles: Theory and Applications." In: IEEE Transactions on Human-Machine Systems 50.6 (2020), pp. 475--491. Google ScholarGoogle ScholarCross RefCross Ref
  19. S.Lu et al. "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation." In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Dec. 2021.Google ScholarGoogle Scholar
  20. C. Niu, C. Li, V. Ng, and B. Luo. CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code Models. Feb. 2023. arXiv: 2302.04030.Google ScholarGoogle Scholar
  21. H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. June 2020. arXiv: 1909.09436.Google ScholarGoogle Scholar
  22. B. Roziere, M.-A. Lachaux, G. Lample, and L. Chanussot. "Unsupervised Translation of Programming Languages." In: 34th Conference on Neural Information Processing Systems. Vancouver, Canada, 2020.Google ScholarGoogle Scholar
  23. E. Fernandes, J. Oliveira, G. Vale, T. Paiva, and E. Figueiredo. "A Review-Based Comparative Study of Bad Smell Detection Tools." In: Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering. EASE '16. New York, NY, USA: Association for Computing Machinery, June 2016, pp. 1--12. Google ScholarGoogle ScholarCross RefCross Ref
  24. S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. "Deep Learning Based Vulnerability Detection: Are We There Yet?" In: IEEE Transactions on Software Engineering 48.9 (Sept. 2022), pp. 3280--3296. Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Petke, S. O. Haraldsson, M. Harman, W. B. Langdon, D. R. White, and J. R. Woodward. "Genetic Improvement of Software: A Comprehensive Survey." In: IEEE Transactions on Evolutionary Computation 22.3 (June 2018), pp. 415--432. Google ScholarGoogle ScholarCross RefCross Ref
  26. C. Le Goues, M. Pradel, and A. Roychoudhury. "Automated Program Repair." In: Communications of the ACM 62.12 (Nov. 2019), pp. 56--65. Google ScholarGoogle ScholarCross RefCross Ref
  27. TIOBE Index. https://www.tiobe.com/tiobe-index/.Google ScholarGoogle Scholar
  28. K. Gupta, P. E. Christensen, X. Chen, and D. Song. "Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis." In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 17685--17695.Google ScholarGoogle Scholar
  29. S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping Language to Code in Programmatic Context. Aug. 2018. arXiv: 1808.09588.Google ScholarGoogle Scholar
  30. D. C. Halbert. "Programming by Example." PhD thesis. University of California, Berkeley, 1984.Google ScholarGoogle Scholar
  31. S. Gulwani. Programming by Examples (and Its Applications in Data Wrangling). Tech. rep. Redmond, WA, USA: Microsoft Corportation, 2016, p. 22.Google ScholarGoogle Scholar
  32. M. Zavershynskyi, A. Skidanov, and I. Polosukhin. NAPS: Natural Program Synthesis Dataset. 2018. arXiv: 1807.0316.Google ScholarGoogle Scholar
  33. L. Ouyang et al. Training Language Models to Follow Instructions with Human Feedback. Mar. 2022. arXiv: 2203.02155.Google ScholarGoogle Scholar
  34. Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Automated Repair of Programs from Large Language Models. Jan. 2023. arXiv: 2205.10583.Google ScholarGoogle Scholar
  35. J. Blain. Nine Worlds of Seid-magic: Ecstasy and Neo-shamanism in North European Paganism. Routledge, 2002.Google ScholarGoogle Scholar
  36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. "Attention Is All You Need." In: International Conference on Neural Information Processing Systems (NeurIPS). Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 5998--6008.Google ScholarGoogle Scholar
  37. OpenAI API. https://platform.openai.com.Google ScholarGoogle Scholar
  38. S. de Bruin, V. Liventsev, and M. Petković. Autoencoders as Tools for Program Synthesis. Sept. 2021. arXiv: 2108.07129.Google ScholarGoogle Scholar
  39. J. R. Koza. Genetic Programming II. Vol. 17. MIT Press, 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. Jack and F. Van der Duyn Schouten. "Optimal Repair-Replace Strategies for a Warranted Product." In: International Journal of Production Economics 67.1 (Aug. 2000), pp. 95--100. Google ScholarGoogle ScholarCross RefCross Ref
  41. S. J. Russell. Artificial Intelligence a Modern Approach. Pearson Education, Inc., 2010.Google ScholarGoogle Scholar
  42. H. Joshi, J. Cambronero, S. Gulwani, V. Le, I. Radicek, and G. Verbruggen. Repair Is Nearly Generation: Multilingual Program Repair with LLMs. Sept. 2022. arXiv: 2208.11640.Google ScholarGoogle Scholar
  43. D. Shrivastava, H. Larochelle, and D. Tarlow. Repository-Level Prompt Generation for Large Language Models of Code. Oct. 2022. arXiv: 2206.12839.Google ScholarGoogle Scholar
  44. Q. Huang, Z. Yuan, Z. Xing, X. Xu, L. Zhu, and Q. Lu. Prompt-Tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code. Aug. 2022. arXiv: 2208.05361.Google ScholarGoogle Scholar
  45. B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. Fixing Hardware Security Bugs with Large Language Models. Feb. 2023. arXiv: 2302. 01215.Google ScholarGoogle Scholar
  46. D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama. "QuixBugs: A MultiLingual Program Repair Benchmark Set Based on the Quixey Challenge." In: Companion of the SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. Vancouver BC Canada: ACM, Oct. 2017, pp. 55--56. Google ScholarGoogle ScholarCross RefCross Ref
  47. J. A. Prenner, H. Babii, and R. Robbes. "Can OpenAI's Codex Fix Bugs?: An Evaluation on QuixBugs." In: 2022 IEEE/ACM International Workshop on Automated Program Repair (APR). May 2022, pp. 69--75. Google ScholarGoogle ScholarCross RefCross Ref
  48. D. Sobania, M. Briesch, C. Hanna, and J. Petke. An Analysis of the Automatic Bug Fixing Performance of ChatGPT. Jan. 2023. arXiv: 2301. 08653.Google ScholarGoogle Scholar
  49. K. Kuznia, S. Mishra, M. Parmar, and C. Baral. Less Is More: Summary of Long Instructions Is Better for Program Synthesis. Oct. 2022. arXiv: 2203.08597.Google ScholarGoogle Scholar
  50. T. Helmuth and P. Kelly. "Applying Genetic Programming to PSB2: The next Generation Program Synthesis Benchmark Suite." In: Genetic Programming and Evolvable Machines 23.3 (Sept. 2022), pp. 375--404. Google ScholarGoogle ScholarCross RefCross Ref
  51. T. Helmuth and L. Spector. "Problem-Solving Benefits of Down-Sampled Lexicase Selection." In: Artificial Life 27.3--4 (Mar. 2022), pp. 183--203. Google ScholarGoogle ScholarCross RefCross Ref
  52. T. Helmuth and L. Spector. "General Program Synthesis Benchmark Suite." In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. Madrid Spain: ACM, July 2015, pp. 1039--1046. Google ScholarGoogle ScholarCross RefCross Ref
  53. D. Sobania, M. Briesch, and F. Rothlauf. "Choose Your Programming Copilot: A Comparison of the Program Synthesis Performance of Github Copilot and Genetic Programming." In: Proceedings of the Genetic and Evolutionary Computation Conference. Boston Massachusetts: ACM, July 2022, pp. 1019--1027. Google ScholarGoogle ScholarCross RefCross Ref
  54. H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions. Dec. 2021. arXiv: 2108.09293.Google ScholarGoogle Scholar

Index Terms

  1. Fully Autonomous Programming with Large Language Models

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            GECCO '23: Proceedings of the Genetic and Evolutionary Computation Conference
            July 2023
            1667 pages
            ISBN:9798400701191
            DOI:10.1145/3583131

            Copyright © 2023 Owner/Author(s)

            This work is licensed under a Creative Commons Attribution International 4.0 License.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 July 2023

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,669of4,410submissions,38%

            Upcoming Conference

            GECCO '24
            Genetic and Evolutionary Computation Conference
            July 14 - 18, 2024
            Melbourne , VIC , Australia

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader