Abstract
For the past seven years, researchers in genetic programming and other program synthesis disciplines have used the General Program Synthesis Benchmark Suite (PSB1) to benchmark many aspects of systems that conduct programming by example, where the specifications of the desired program are given as input/output pairs. PSB1 has been used to make notable progress toward the goal of general program synthesis: automatically creating the types of software that human programmers code. Many of the systems that have attempted the problems in PSB1 have used it to demonstrate performance improvements granted through new techniques. Over time, the suite has gradually become outdated, hindering the accurate measurement of further improvements. The field needs a new set of more difficult benchmark problems to move beyond what was previously possible and ensure that systems do not overfit to one benchmark suite. In this paper, we describe the 25 new general program synthesis benchmark problems that make up PSB2, a new benchmark suite. These problems are curated from a variety of sources, including programming katas and college courses. We selected these problems to be more difficult than those in the original suite, and give results using PushGP showing this increase in difficulty. We additionally give an example of benchmarking using a state-of-the-art parent selection method, showing improved performance on PSB2 while still leaving plenty of room for improvement. These new problems will help guide program synthesis research for years to come.
Similar content being viewed by others
Notes
Also known as automatic programming or software synthesis.
Reference implementation, datasets, and other resources can be found on this paper’s companion website: https://cs.hamilton.edu/~thelmuth/PSB2/PSB2.html.
References
T. Helmuth, L. Spector, General program synthesis benchmark suite. in GECCO ’15: Proceedings of the 2015 Conference on Genetic and Evolutionary Computation Conference. (ACM, Madrid, Spain 2015). pp. 1039–1046 https://doi.org/10.1145/2739480.2754769
S. Forstenlechner, D. Fagan, M. Nicolau, M. O’Neill, A grammar design pattern for arbitrary program synthesis problems in genetic programming. in EuroGP 2017: Proceedings of the 20th European Conference on Genetic Programming. LNCS, vol. 10196, (Springer, Amsterdam 2017). pp. 262–277 https://doi.org/10.1007/978-3-319-55696-3_17
E. Hemberg, J. Kelly, U.-M. O’Reilly, On domain knowledge and novelty to improve program synthesis performance with grammatical evolution. in GECCO ’19: Proceedings of the Genetic and Evolutionary Computation Conference, (ACM, Prague, Czech Republic, 2019), pp. 1039–1046 https://doi.org/10.1145/3321707.3321865
A. Lalejini, C. Ofria, Tag-accessed memory for genetic programming. in GECCO ’19: Proceedings of the Genetic and Evolutionary Computation Conference Companion, ACM, Prague, (Czech Republic, 2019), pp. 346–347 https://doi.org/10.1145/3319619.3321892
C.D. Rosin, Stepping stones to inductive synthesis of low-level looping programs. in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence. AAAI ’19, vol. 33. AAAI Press, Palo Alto, California USA (2019)
J. Lim, S. Yoo, Field report: Applying monte carlo tree search for program synthesis. in International Symposium on Search Based Software Engineering, pp. 304–310 (2016). Springer
T. Helmuth, L. Spector, Explaining and exploiting the advantages of down-sampled lexicase selection. In: Artificial Life Conference Proceedings, pp. 341–349. MIT Press, Online (2020). https://doi.org/10.1162/isal_a_00334
T. Helmuth, L. Spector, Problem-Solving Benefits of Down-Sampled Lexicase Selection. Artificial Life, 1–21 (2021) https://direct.mit.edu/artl/article-pdf/doi/10.1162/artl_a_00341/1960075/artl_a_00341.pdf. https://doi.org/10.1162/artl_a_00341
J. McDermott, D.R. White, S. Luke, L. Manzoni, M. Castelli, L. Vanneschi, W. Jaskowski, K. Krawiec, R. Harper, K. De Jong, U.-M. O’Reilly, Genetic programming needs better benchmarks. in GECCO ’12: Proceedings of the Genetic and Evolutionary Computation Conference, (ACM, Philadelphia, Pennsylvania, USA 2012). pp. 791–798 https://doi.org/10.1145/2330163.2330273
D.R. White, J. Mcdermott, M. Castelli, L. Manzoni, B.W. Goldman, G. Kronberger, W. Jaśkowski, U.-M. O’Reilly, S. Luke, Better GP benchmarks: community survey results and proposals. Genet. Program Evolvable Mach. 14(1), 3–29 (2013). https://doi.org/10.1007/s10710-012-9177-2
J. Woodward, S. Martin, J. Swan, Benchmarks that matter for genetic programming. In: GECCO 2014 4th Workshop on Evolutionary Computation for the Automated Design of Algorithms, (ACM, Vancouver, BC, Canada 2014). pp. 1397–1404 https://doi.org/10.1145/2598394.2609875
M. O’Neill, L. Spector, Automatic programming: the open issue? Genet. Program Evolvable Mach. 21(1–2), 251–262 (2020). https://doi.org/10.1007/s10710-019-09364-2. (Twentieth Anniversary Issue)
T. Helmuth, P. Kelly, PSB2: The second program synthesis benchmark suite. in 2021 Genetic and Evolutionary Computation Conference. GECCO ’21. (ACM, Lille, France, 2021). https://doi.org/10.1145/3449639.3459285
D. Sobania, D. Schweim, F. Rothlauf, Recent developments in program synthesis with evolutionary algorithms. arXiv (2021) arXiv:2108.12227 [cs.NE]
T. Helmuth, L. Spector, N.F. McPhee, S. Shanabrook, Linear genomes for structured programs. In: Genetic Programming Theory and Practice XIV. Genetic and Evolutionary Computation. Springer, Ann Arbor, USA (2016)
T. Helmuth, N.F. McPhee, E. Pantridge, L. Spector, Improving generalization of evolved programs through automatic simplification. in Proceedings of the Genetic and Evolutionary Computation Conference. GECCO ’17, ACM, Berlin, Germany (2017). pp. 937–944https://doi.org/10.1145/3071178.3071330
T. Helmuth, N.F. McPhee, L. Spector, Program synthesis using uniform mutation by addition and deletion. in Proceedings of the Genetic and Evolutionary Computation Conference. GECCO ’18, pp. 1127–1134. ACM, Kyoto, Japan (2018). https://doi.org/10.1145/3205455.3205603
T. Helmuth, E. Pantridge, G. Woolson, L. Spector, Genetic source sensitivity and transfer learning in genetic programming. In: Artificial Life Conference Proceedings, pp. 303–311. MIT Press, Online (2020). https://doi.org/10.1162/isal_a_00326. https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00326
A.K. Saini, L. Spector, Using modularity metrics as design features to guide evolution in genetic programming. In: Banzhaf, W., Goodman, E., Sheneman, L., Trujillo, L., Worzel, B. (eds.) Genetic Programming Theory and Practice XVII, pp. 165–180. Springer, East Lansing, MI, USA (2019). https://doi.org/10.1007/978-3-030-39958-0_9
A.K Saini, L. Spector, Why and when are loops useful in genetic programming? in Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion. GECCO ’20, pp. 247–248. Association for Computing Machinery, internet (2020). https://doi.org/10.1145/3377929.3389919
E. Pantridge, L. Spector, Code building genetic programming. in Proceedings of the 2020 Genetic and Evolutionary Computation Conference. GECCO ’20, pp. 994–1002. Association for Computing Machinery, internet (2020). https://doi.org/10.1145/3377930.3390239. https://arxiv.org/abs/2008.03649
S. Forstenlechner, D. Fagan, M. Nicolau, M. O’Neill, Towards understanding and refining the general program synthesis benchmark suite with genetic programming. In: Vellasco, M. (ed.) 2018 IEEE Congress on Evolutionary Computation (CEC). IEEE, Rio de Janeiro, Brazil (2018). https://doi.org/10.1109/CEC.2018.8477953
S. Forstenlechner, D. Fagan, M. Nicolau, M. O’Neill, Extending program synthesis grammars for grammar-guided genetic programming. In: Auger, A., Fonseca, C.M., Lourenco, N., Machado, P., Paquete, L., Whitley, D. (eds.) 15th International Conference on Parallel Problem Solving from Nature. LNCS, vol. 11101, pp. 197–208. Springer, Coimbra, Portugal (2018). https://doi.org/10.1007/978-3-319-99253-2_16. https://www.springer.com/gp/book/9783319992587
S. Forstenlechner, D. Fagan, M. Nicolau, M. O’Neill, Towards effective semantic operators for program synthesis in genetic programming. in GECCO ’18: Proceedings of the Genetic and Evolutionary Computation Conference, (ACM, Kyoto, Japan 2018). pp 1119–1126. https://doi.org/10.1145/3205455.3205592
J. Kelly, E. Hemberg, U.-M. O’Reilly, Improving genetic programming with novel exploration - exploitation control. In: Sekanina, L., Hu, T., Lourenço, N., Richter, H., García-Sánchez, P. (eds.) EuroGP 2019: Proceedings of the 22nd European Conference on Genetic Programming, (Springer, Leipzig, Germany 2019), pp. 64–80
M. O’Neill, A. Brabazon, Mutational robustness and structural complexity in grammatical evolution. In: Coello, C.A.C. (ed.) 2019 IEEE Congress on Evolutionary Computation, CEC 2019, pp. 1338–1344. (IEEE Press, Wellington, New Zealand 2019). https://doi.org/10.1109/CEC.2019.8790010. IEEE Computational Intelligence Society
D. Sobania, F. Rothlauf, Challenges of program synthesis with grammatical evolution. in: Hu, T., Lourenco, N., Medvet, E. (eds.) EuroGP 2020: Proceedings of the 23rd European Conference on Genetic Programming. LNCS, vol. 12101, (Springer, Seville, Spain, 2020). pp. 211–227 https://doi.org/10.1007/978-3-030-44094-7_14
D. Lynch, J. McDermott, M. O’Neill, Program synthesis in a continuous space using grammars and variational autoencoders. In: Baeck, T., Preuss, M., Deutz, A., Wang2, H., Doerr, C., Emmerich, M., Trautmann, H. (eds.) 16th International Conference on Parallel Problem Solving from Nature, Part II. LNCS, vol. 12270, (Springer, Leiden, Holland 2020). pp. 33–47 https://doi.org/10.1007/978-3-030-58115-2_3
J.G. Hernandez, A. Lalejini, E. Dolson, C. Ofria, Random subsampling improves performance in lexicase selection. In: GECCO ’19: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 2028–2031. ACM, (Prague, Czech Republic 2019). https://doi.org/10.1145/3319619.3326900
A.J. Ferguson, J.G. Hernandez, D. Junghans, A. Lalejini, E. Dolson, C. Ofria, Characterizing the effects of random subsampling and dilution on lexicase selection, in Genetic Program. Theory and Practice XVII. ed. by W. Banzhaf, E. Goodman, L. Sheneman, L. Trujillo, B. Worzel (East Lansing, MI, USA, 2019)
S. Gulwani, Automating string processing in spreadsheets using input-output examples. SIGPLAN Not. 46(1), 317–330 (2011). https://doi.org/10.1145/1925844.1926423
S. Katayama, Recent improvements of MagicHaskeller. in Approaches and Applications of Inductive Programming, (Springer, Berlin, Heidelberg, 2010). https://doi.org/10.1007/978-3-642-11931-6_9
E. Pantridge, T. Helmuth, N.F. McPhee, L. Spector, On the difficulty of benchmarking inductive program synthesis methods. in Proceedings of the Genetic and Evolutionary Computation Conference Companion. GECCO ’17, ACM, (Berlin, Germany, 2017). pp. 1589–1596 https://doi.org/10.1145/3067695.3082533
T. Helmuth, P. Kelly, General Program Synthesis Benchmark Suite Datasets. https://github.com/thelmuth/program-synthesis-benchmark-datasets
E. Wastl, Advent of Code: Not Quite Lisp. Accessed: 2020-01-20. https://adventofcode.com/2015/day/1
g964: Code Wars: Bouncing Balls. Accessed: 2020-01-20. https://www.codewars.com/kata/5544c7a5cb454edb3c000047
dnolan: Code Wars: Ten-Pin Bowling. Accessed: 2020-01-20. https://www.codewars.com/kata/5531abe4855bcc8d1f00004c/javascript
jhoffner: Code Wars: Convert String to Camel Case. Accessed: 2020-01-20. https://www.codewars.com/kata/517abf86da9663f1d2000003
P. Euler, Project Euler: Coin Sums. Accessed: 2020-01-20. https://projecteuler.net/problem=31
myjinxin2015: Code Wars: Fastest Code: Half It IV. Accessed: 2020-01-20. https://www.codewars.com/kata/5719b28964a584476500057d
P. Euler, Project Euler: Dice Game. Accessed: 2020-01-20. https://projecteuler.net/problem=205
E. Wastl, Advent of Code: Report Repair. Accessed: 2020-01-20. https://adventofcode.com/2020/day/1
stephenyu: Code Wars: Fizz Buzz. Accessed: 2020-01-20. https://www.codewars.com/kata/5300901726d12b80e8000498
E. Wastl, Advent of Code: The Tyranny of the Rocket Empire. Accessed: 2020-01-20. https://adventofcode.com/2019/day/1
RVdeKoning: Code Wars: Greatest Common Divisor. Accessed: 2020-01-20. https://www.codewars.com/kata/5500d54c2ebe0a8e8a0003fd/python
smile67: Code Wars: Text Search. Accessed: 2020-01-20. https://www.codewars.com/kata/56b78faebd06e61870001191
MrZizoScream: Code Wars: Array Leaders. Accessed: 2020-01-20. https://www.codewars.com/kata/5a651865fd56cb55760000e0
mcclaskc: Code Wars: Validate Credit Card Number. Accessed: 2020-01-20. https://www.codewars.com/kata/5418a1dd6d8216e18a0012b2
Shivo: Code Wars: Get the Middle Character. Accessed: 2020-01-20. https://www.codewars.com/kata/56747fd5cb988479af000028
E. Wastl, Advent of Code: Inverse Captcha. Accessed: 2020-01-20. https://adventofcode.com/2017/day/1
rb50: Code Wars: Shopping List. Accessed: 2020-01-20. https://www.codewars.com/kata/596266482f9add20f70001fc
KenKamau: Code Wars: The Boolean Order. Accessed: 2020-01-20. https://www.codewars.com/kata/59eb1e4a0863c7ff7e000008
xDranik: Code Wars: Stop gninnipS My sdroW! Accessed: 2020-01-20. https://www.codewars.com/kata/5264d2b162488dc400000001
MysteriousMagenta: Code Wars: Square Every Digit. Accessed: 2020-01-20. https://www.codewars.com/kata/546e2562b03326a88e000020
jacobb: Code Wars: Simple Substitution Cipher Helper. Accessed: 2020-01-20. https://www.codewars.com/kata/52eb114b2d55f0e69800078d
StephenLastname2: Code Wars: Distance Between Two Points. Accessed: 2020-01-20. https://www.codewars.com/kata/5a0b72484bebaefe60001867
T. Helmuth, P. Kelly, PSB2: The Second Program Synthesis Benchmark Suite. Zenodo (2021). https://doi.org/10.5281/zenodo.4678739
R.S. Olson, W. La Cava, P. Orzechowski, R.J. Urbanowicz, J.H. Moore, Pmlb: a large benchmark suite for machine learning evaluation and comparison. BioData Mining 10(1), 36 (2017). https://doi.org/10.1186/s13040-017-0154-4
T.T. Le, W. La Cava, J.D. Romano, J.T. Gregg, D.J. Goldberg, P. Chakraborty, N.L. Ray, D. Himmelstein, W. Fu, J.H. Moore, Pmlb v1.0: an open source dataset collection for benchmarking machine learning methods. arXiv preprint arXiv:2012.00058 (2020)
D. Dua, C. Graff, UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml
T. Bartz-Beielstein, C. Doerr, D. van den Berg, J. Bossek, S. Chandrasekaran, T. Eftimov, A. Fischbach, P. Kerschke, W.L. Cava, M. Lopez-Ibanez, K.M. Malan, J.H. Moore, B. Naujoks, P. Orzechowski, V. Volz, M. Wagner, T. Weise, Benchmarking in optimization: Best practice and open issues. arXiv (2020) arXiv:2007.03488 [cs.NE]
L. Spector, A. Robinson, Genetic programming and autoconstructive evolution with the push programming language. Genet. Program Evolvable Mach. 3(1), 7–40 (2002). https://doi.org/10.1023/A:1014538503543
L. Spector, J. Klein, M. Keijzer, The Push3 execution stack and the evolution of control. In: GECCO 2005: Proceedings of the 2005 Conference on Genetic and Evolutionary Computation, vol. 2, ACM Press, (Washington DC, USA, 2005). pp. 1689–1696. https://doi.org/10.1145/1068009.1068292
T. Helmuth, L. Spector, J. Matheson, Solving uncompromising problems with lexicase selection. IEEE Trans. Evol. Comput. 19(5), 630–643 (2015). https://doi.org/10.1109/TEVC.2014.2362729
L. Spector, Assessment of problem modality by differential performance of lexicase selection in genetic programming: A preliminary report. In: McClymont, K., Keedwell, E. (eds.) 1st Workshop on Understanding Problems (GECCO-UP), pp. 401–408. ACM, Philadelphia, Pennsylvania, USA (2012). https://doi.org/10.1145/2330784.2330846. http://hampshire.edu/lspector/pubs/wk09p4-spector.pdf
A. Robinson, Genetic programming: Theory, implementation, and the evolution of unconstrained solutions. Division III thesis, Hampshire College (May 2001). http://hampshire.edu/lspector/robinson-div3.pdf
S. Gulwani, Automating string processing in spreadsheets using input-output examples. in Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL ’11, ACM, (New York, NY, USA, 2011). pp. 317–330 https://doi.org/10.1145/1926385.1926423
A.K. Menon, O. Tamuz, S. Gulwani, B. Lampson, A.T. Kalai, A Machine Learning Framework for Programming by Example. ICML, 9 (2013)
M. Balog, A.L. Gaunt, M. Brockschmidt, S. Nowozin, D. Tarlow, DeepCoder: Learning to write programs. In: ICLR (2017)
I. Bladek, K. Krawiec, Evolutionary program sketching. In: Castelli, M., McDermott, J., Sekanina, L. (eds.) EuroGP 2017: Proceedings of the 20th European Conference on Genetic Programming. LNCS, vol. 10196, (Springer, Amsterdam 2017). pp. 3–18. https://doi.org/10.1007/978-3-319-55696-3_1. http://repozytorium.put.poznan.pl/publication/495662
A. Zohar, L. Wolf, Automatic Program Synthesis of Long Programs with a Learned Garbage Collector. NIPS (2018). arXiv: 1809.04682. Accessed 2021-10-10
S. Gulwani, K. Pathak, A. Radhakrishna, A. Tiwari, A. Udupa, Quantitative programming by examples. arXiv (2019) arXiv:1909.05964 [cs.PL]
A. Cropper, R. Morel, Learning programs by learning from failures. Mach. Learn. 110(4), 801–856 (2021). https://doi.org/10.1007/s10994-020-05934-z. (Accessed 2021-10-09)
I. Polosukhin, A. Skidanov, Neural program search: Solving programming tasks from description and examples. arXiv (2018) arXiv:1802.04335 [cs.AI]
J. Bednarek, K. Piaskowski, K. Krawiec, Ain’t Nobody Got Time for Coding: Structure-Aware Program Synthesis from Natural Language. arXiv, 12 (2019)
K. Rahmani, M. Raza, S. Gulwani, V. Le, D. Morris, A. Radhakrishna, G. Soares, A. Tiwari, Multi-modal Program Inference: a Marriage of Pre-trained Language Models and Component-based Synthesis. arXiv:2109.02445 [cs] (2021). arXiv: 2109.02445. Accessed 2021-09-15
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, , S. Gray, Ryder, , Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W., Nichol, A., Babuschkin, I., Balaji, S., Jain, S., Carr, A., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., Zaremba, W.: Evaluating Large Language Models Trained on Code. arXiv (2021)
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, C. Sutton, Program Synthesis with Large Language Models. arXiv (2021). arXiv: 2108.07732. Accessed 2021-08-21
A. Solar-Lezama, Program sketching. Int. J. Softw. Tools Technol. Transfer 15(5), 475–495 (2013)
R.R. Alur, R. Bodik, G. Juniwal, M.M.K. Martin, M. Raghothaman, S.A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, A. Udupa, Syntax-guided synthesis. In: 2013 Formal Methods in Computer-Aided Design, pp. 1–8 (2013). https://doi.org/10.1109/FMCAD.2013.6679385
R. Alur, R. Singh, D. Fisman, A. Solar-Lezama, Search-based program synthesis. Commun. ACM 61(12), 84–93 (2018). https://doi.org/10.1145/3208071
W. Lee, K. Heo, R. Alur, M. Naik, Accelerating search-based program synthesis using learned probabilistic models. ACM SIGPLAN Notices (2018). https://doi.org/10.1145/3192366.3192410
T. Welsch, V. Kurlin, Synthesis through unification genetic programming. in Proceedings of the 2020 Genetic and Evolutionary Computation Conference. GECCO ’20, Association for Computing Machinery, internet (2020). pp. 1029–1036 https://doi.org/10.1145/3377930.3390208. https://doi.org/10.1145/3377930.3390208
Acknowledgements
The authors would like to thank Lee Spector, Grace Woolson, and Amr Abdelhady for discussions that helped shape this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A Push Instructions
Appendix A Push Instructions
Our experiments using PushGP use instruction sets based on the data types relevant to the problem, as discussed in Section 8. Tables 7 and 8 contain the Push instructions for each of the types referenced in Tables 3 and 4. Experiments using PSB2 do not need to match this instruction set exactly, and indeed for non-Push program synthesis systems, there may be wildly different instruction sets. However, we include the exhaustive list of instructions that we used in order to give the full details of the system and allow comparisons with instruction sets in other systems.
Rights and permissions
About this article
Cite this article
Helmuth, T., Kelly, P. Applying genetic programming to PSB2: the next generation program synthesis benchmark suite. Genet Program Evolvable Mach 23, 375–404 (2022). https://doi.org/10.1007/s10710-022-09434-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10710-022-09434-y