Abstract
In some situations, the interpretability of the machine learning models plays a role as important as the model accuracy. Interpretability comes from the need to trust the prediction model, verify some of its properties, or even enforce them to improve fairness. Many model-agnostic explanatory methods exists to provide explanations for black-box models. In the regression task, the practitioner can use white-boxes or gray-boxes models to achieve more interpretable results, which is the case of symbolic regression. When using an explanatory method, and since interpretability lacks a rigorous definition, there is a need to evaluate and compare the quality and different explainers. This paper proposes a benchmark scheme to evaluate explanatory methods to explain regression models, mainly symbolic regression models. Experiments were performed using 100 physics equations with different interpretable and non-interpretable regression methods and popular explanation methods, evaluating the performance of the explainers performance with several explanation measures. In addition, we further analyzed four benchmarks from the GP community. The results have shown that Symbolic Regression models can be an interesting alternative to white-box and black-box models that is capable of returning accurate models with appropriate explanations. Regarding the explainers, we observed that Partial Effects and SHAP were the most robust explanation models, with Integrated Gradients being unstable only with tree-based models. This benchmark is publicly available for further experiments.
Similar content being viewed by others
Notes
Open source module available at https://github.com/gAldeia/iirsBenchmark.
Also called Marginal Effects in the literature, but this term can be misleading, as mentioned in [80].
References
M. Medvedeva, M. Vols, M. Wieling, Using machine learning to predict decisions of the European Court of Human Rights. Artif. Intell. Law 28(2), 237–266 (2020). https://doi.org/10.1007/s10506-019-09255-y
G. Winter, Machine learning in healthcare A review. Br. J. Health Care Manag. 25(2), 100–101 (2019). https://doi.org/10.12968/bjhc.2019.25.2.100
R. Roscher, B. Bohn, M.F. Duarte, J. Garcke, Explainable machine learning for scientific insights and discoveries. IEEE Access 8, 42200–42216 (2020). https://doi.org/10.1109/ACCESS.2020.2976199arXiv:1905.08883
C. Modarres, M. Ibrahim, M. Louie, J. Paisley, Towards explainable deep learning for credit lending: a case study, 1–8 (2018) arXiv:1811.06471
S. Yoo, X. Xie, F.-C. Kuo, T.Y. Chen, M. Harman, Human competitiveness of genetic programming in spectrum-based fault localisation: theoretical and empirical analysis. ACM Trans. Softw. Eng. Methodol. (2017). https://doi.org/10.1145/3078840
M.A. Lones, J.E. Alty, J. Cosgrove, P. Duggan-Carter, S. Jamieson, R.F. Naylor, A.J. Turner, S.L. Smith, A new evolutionary algorithm-based home monitoring device for parkinson’s dyskinesia. J. Med. Syst. (2017). https://doi.org/10.1007/s10916-017-0811-7
D. Lynch, M. Fenton, D. Fagan, S. Kucera, H. Claussen, M. O’Neill, Automated self-optimization in heterogeneous wireless communications networks. IEEE/ACM Trans. Netw. 27(1), 419–432 (2019). https://doi.org/10.1109/TNET.2018.2890547
D. Izzo, L.F. Simões, M. Märtens, G.C.H.E. de Croon, A. Heritier, C.H. Yam, Search for a grand tour of the jupiter galilean moons. In: proceedings of the 15th annual conference on genetic and evolutionary computation. GECCO ’13, pp. 1301–1308. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2463372.2463524
Y. Semet, B. Berthelot, T. Glais, C. Isbérie, A. Varest, Expert competitive traffic light optimization with evolutionary algorithms. In: VEHITS, pp. 199–210 (2019)
M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15(90), 3133–3181 (2014)
R. Guidotti, A. Monreale, F. Giannotti, D. Pedreschi, S. Ruggieri, F. Turini, Factual and counterfactual explanations for black box decision making. IEEE Intell. Syst. 34(6), 14–23 (2019). https://doi.org/10.1109/MIS.2019.2957223
S.M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions. In: proceedings of the 31st international conference on neural information processing systems. NIPS’17, pp. 4768–4777. Curran Associates Inc., Red Hook, NY, USA (2017)
A. Adadi, M. Berrada, Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018). https://doi.org/10.1109/ACCESS.2018.2870052
G.S.I. Aldeia, F.O. de França, Measuring feature importance of symbolic regression models using partial effects. In: proceedings of the genetic and evolutionary computation conference. GECCO ’21. ACM, New York, NY, USA (2021). https://doi.org/10.1145/3449639.3459302
P. Orzechowski, W.L. Cava, J.H. Moore, Where are we now?: A large benchmark study of recent symbolic regression methods. In: proceedings of the genetic and evolutionary computation conference. GECCO ’18, pp. 1183–1190. ACM, New York, NY, USA (2018). https://doi.org/10.1145/3205455.3205539
W.L. Cava, P. Orzechowski, B. Burlacu, F.O. de França, M. Virgolin, Y. JIN, M. Kommenda, J.H. Moore, Contemporary symbolic regression methods and their relative performance. In: thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 1) (2021). https://openreview.net/forum?id=xVQMrDLyGst
G. Kronberger, F.O. de França, B. Burlacu, C. Haider, M. Kommenda, Shape-constrained symbolic regression–improving extrapolation with prior knowledge. Evolutionary Computation, 1–24
M. Affenzeller, S.M. Winkler, G. Kronberger, M. Kommenda, B. Burlacu, S. Wagner, Gaining deeper insights in symbolic regression, in Genetic Programming Theory and Practice XI. ed. by R. Riolo, J.H. Moore, M. Kotanchek (Springer, New York, NY, 2014), pp. 175–190. https://doi.org/10.1007/978-1-4939-0375-7_10
F.O. de França, A greedy search tree heuristic for symbolic regression. Inf. Sci. 442–443, 18–32 (2018). https://doi.org/10.1016/j.ins.2018.02.040
L.A. Ferreira, F.G. Guimaraes, R. Silva, Applying genetic programming to improve interpretability in machine learning models. In: 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE, New York (2020). https://doi.org/10.1109/cec48606.2020.9185620
F.O. de França, G.S.I. Aldeia, Interaction-transformation evolutionary algorithm for symbolic regression. Evolut. Comput. 29(3), 367–390 (2021). https://doi.org/10.1162/evco_a_00285
F.O. de França, M.Z. de Lima, Interaction-transformation symbolic regression with extreme learning machine. Neurocomputing 423, 609–619 (2021)
D. Kantor, F.J. Von Zuben, F.O. de França, Simulated annealing for symbolic regression. In: proceedings of the genetic and evolutionary computation conference, pp. 592–599 (2021)
R.M. Filho, A. Lacerda, G.L. Pappa, Explaining symbolic regression predictions. In: 2020 IEEE Congress on Evolutionary Computation (CEC). IEEE, New York (2020). https://doi.org/10.1109/cec48606.2020.9185683
R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models. ACM Comput. Surv. 51(5), 1–45 (2018). https://doi.org/10.1145/3236009arXiv:1802.01933
L. Ljung, Perspectives on system identification. Ann. Rev. Control 34(1), 1–12 (2010). https://doi.org/10.1016/j.arcontrol.2009.12.001
R. Marcinkevičs, J.E. Vogt, Interpretability and Explainability: A Machine Learning Zoo Mini-tour, 1–24 (2020) arXiv:2012.01805
C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1(5), 206–215 (2019) arXiv:1811.10154. https://doi.org/10.1038/s42256-019-0048-x
Z.F. Wu, J. Li, M.Y. Cai, Y. Lin, W.J. Zhang, On membership of black-box or white-box of artificial neural network models. In: 2016 IEEE 11th conference on industrial electronics and applications (ICIEA), pp. 1400–1404 (2016). https://doi.org/10.1109/ICIEA.2016.7603804
O. Loyola-González, Black-box vs. white-box: understanding their advantages and weaknesses from a practical point of view. IEEE Access 7, 154096–154113 (2019). https://doi.org/10.1109/ACCESS.2019.2949286
S.M. Julia Angwin, J. Larson, P. Lauren Kirchner, Machine bias: there’s software used across the country to predict future criminals. and it’s biased against blacks (2016)
A. Datta, M.C. Tschantz, A. Datta, Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination. CoRR abs/1408.6491 (2014) arXiv:1408.6491
Z.C. Lipton, The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31–57 (2018). https://doi.org/10.1145/3236386.3241340
D.V. Carvalho, E.M. Pereira, J.S. Cardoso, Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 832 (2019). https://doi.org/10.3390/electronics8080832
A.B. Arrieta, N. D íaz-Rodríguez, J.D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 58, 82–115 (2020). https://doi.org/10.1016/j.inffus.2019.12.012
F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017)
L.H. Gilpin, D. Bau, B.Z. Yuan, A. Bajwa, M. Specter, L. Kagal, Explaining explanations: an overview of interpretability of machine learning. Proceedings—2018 IEEE 5th international conference on data science and advanced analytics, DSAA 2018, 80–89 (2019) arXiv:1806.00069. https://doi.org/10.1109/DSAA.2018.00018
M. Sendak, M.C. Elish, M. Gao, J. Futoma, W. Ratliff, M. Nichols, A. Bedoya, S. Balu, C. O’Brien, The human body is a black box. In: proceedings of the 2020 conference on fairness, accountability, and transparency. ACM, New York, NY, USA (2020). https://doi.org/10.1145/3351095.3372827
M. Ghassemi, L. Oakden-Rayner, A.L. Beam, The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 3(11), 745–750 (2021). https://doi.org/10.1016/s2589-7500(21)00208-9
I. Banerjee, A.R. Bhimireddy, J.L. Burns, L.A. Celi, L.-C. Chen, R. Correa, N. Dullerud, M. Ghassemi, S.-C. Huang, P.-C. Kuo, M.P. Lungren, L. Palmer, B.J. Price, S. Purkayastha, A. Pyrros, L. Oakden-Rayner, C. Okechukwu, L. Seyyed-Kalantari, H. Trivedi, R. Wang, Z. Zaiman, H. Zhang, J.W. Gichoya, Reading Race: AI Recognises Patient’s Racial Identity In Medical Images (2021)
M. Yang, B. Kim, Benchmarking Attribution Methods with Relative Feature Importance (2019)
O.-M. Camburu, E. Giunchiglia, J. Foerster, T. Lukasiewicz, P. Blunsom, The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal Sufficient Subsets (2020)
T. Laugel, M.-J. Lesot, C. Marsala, X. Renard, M. Detyniecki, The dangers of post-hoc interpretability: unjustified counterfactual explanations. In: proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI-19, pp. 2801–2807. International joint conferences on artificial intelligence organization, California, USA (2019). https://doi.org/10.24963/ijcai.2019/388
O. Camburu, E. Giunchiglia, J. Foerster, T. Lukasiewicz, P. Blunsom, Can I trust the explainer? verifying post-hoc explanatory methods. CoRR abs/1910.02065 (2019) arXiv:1910.02065
D. Alvarez-Melis, T.S. Jaakkola, On the robustness of interpretability methods (Whi) (2018) arXiv:1806.08049
G. Hooker, L. Mentch, Please stop permuting features: an explanation and alternatives, 1–15 (2019) arXiv:1905.03151
M. Orcun Yalcin, X. Fan, On Evaluating Correctness of Explainable AI Algorithms: an Empirical Study on Local Explanations for Classification (April), 0–7 (2021)
R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, N. Elhadad, Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’15, pp. 1721–1730. Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2783258.2788613
C. Molnar, G. König, J. Herbinger, T. Freiesleben, S. Dandl, C.A. Scholbeck, G. Casalicchio, Grosse-Wentrup, M., Bischl, B.: General pitfalls of model-agnostic interpretation methods for machine learning models (01) (2020) arXiv:2007.04131
M. Yang, B. Kim, BIM: towards quantitative evaluation of interpretability methods with ground truth. CoRR abs/1907.09701 (2019) arXiv:1907.09701
R. Guidotti, Evaluating local explanation methods on ground truth. Artif. Intell. 291, 103428 (2021). https://doi.org/10.1016/j.artint.2020.103428
S. Hooker, D. Erhan, P.-J. Kindermans, B. Kim, A benchmark for interpretability methods in deep neural networks. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 9737–9748. Curran Associates, Inc., Red Hook, NY, USA (2019). https://proceedings.neurips.cc/paper/2019/file/fe4b8556000d0f0cae99daa5c5c5a410-Paper.pdf
J.W. Vaughan, H. Wallach, A human-centered agenda for intelligible machine learning (Getting Along with Artificial Intelligence, Machines We Trust, 2020)
D.R. White, J. McDermott, M. Castelli, L. Manzoni, B.W. Goldman, G. Kronberger, W. Jaśkowski, U.-M. O’Reilly, S. Luke, Better GP benchmarks: community survey results and proposals. Genet. Program. Evolv. Mach. 14(1), 3–29 (2012). https://doi.org/10.1007/s10710-012-9177-2
J. McDermott, K.D. Jong, U.-M. O’Reilly, D.R. White, S. Luke, L. Manzoni, M. Castelli, L. Vanneschi, W. Jaskowski, K. Krawiec, R. Harper, Genetic programming needs better benchmarks. In: proceedings of the fourteenth international conference on genetic and evolutionary computation conference—GECCO ’12. ACM Press, New York, NY, USA (2012). https://doi.org/10.1145/2330163.2330273
S.M., Udrescu, M. Tegmark, AI Feynman: A physics-inspired method for symbolic regression. Science Advances 6(16) (2020) arXiv:1905.11481. https://doi.org/10.1126/sciadv.aay2631
S.-M. Udrescu, A. Tan, J. Feng, O. Neto, T. Wu, M. Tegmark, Ai feynman 2.0: pareto-optimal symbolic regression exploiting graph modularity. Adv. Neural Inf. Process. Syst. 33, 4860–4871 (2020)
Y. Yasui, X. Wang, Statistical Learning from a Regression Perspective 65, 1309–1310 (2009). https://doi.org/10.1111/j.1541-0420.2009.01343_5.x
D. Kuonen, Regression modeling strategies: with applications to linear models. Logist. Regres. Surv. Anal. 13, 415–416 (2004). https://doi.org/10.1177/096228020401300512
M.Z. Asadzadeh, H.-P. Gänser, M. Mücke, Symbolic regression based hybrid semiparametric modelling of processes: an example case of a bending process. Appl. Eng. Sci. 6, 100049 (2021). https://doi.org/10.1016/j.apples.2021.100049
J.R. Koza, Genetic programming: on the programming of computers by means of natural selection. A Bradford book. Bradford, Bradford, PA (1992). https://books.google.com.br/books?id=Bhtxo60BV0EC
M. Kommenda, B. Burlacu, G. Kronberger, M. Affenzeller, Parameter identification for symbolic regression using nonlinear least squares. Genet. Program. Evolv. Mach. 21(3), 471–501 (2019). https://doi.org/10.1007/s10710-019-09371-3
M. Kommenda, B. Burlacu, G. Kronberger, M. Affenzeller, Parameter identification for symbolic regression using nonlinear least squares. Genet. Program. Evolv. Mach. 21(3), 471–501 (2020)
B. Burlacu, G. Kronberger, M. Kommenda, Operon c++: an efficient genetic programming framework for symbolic regression. In: proceedings of the genetic and evolutionary computation conference companion. GECCO ’20, pp. 1562–1570. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3377929.3398099
S. Luke, Two fast tree-creation algorithms for genetic programming. Trans. Evol. Comp. 4(3), 274–283 (2000). https://doi.org/10.1109/4235.873237
G.S.I. Aldeia, Avaliação da interpretabilidade em regressão simbólica. Master’s thesis, Universide Federal do ABC, Santo André, SP (December 2021)
L. Breiman, Random forests 45(1), 5–32 (2001). https://doi.org/10.1023/a:1010933404324
M.T. Ribeiro, S. Singh, C. Guestrin, why should i trust you?: Explaining the predictions of any classifier. In: proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16, pp. 1135–1144. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939778
R. Miranda Filho, A. Lacerda, G.L. Pappa, Explaining symbolic regression predictions. In: 2020 IEEE congress on evolutionary computation (CEC), pp. 1–8 (2020). IEEE
I. Covert, S. Lundberg, S.-I. Lee, Understanding global feature contributions with additive importance measures (2020)
M.D. Morris, Factorial sampling plans for preliminary computational experiments. Technometrics 33(2), 161–174 (1991)
H. Nori, S. Jenkins, P. Koch, R. Caruana, Interpretml: a unified framework for machine learning interpretability. CoRR arxiv: abs/1909.09223 (2019)
M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks (2017)
R.J. Aumann, L.S. Shapley, Values of Non-atomic Games (Princeton University Press, Princeton, NJ, USA, 2015)
D. Lüdecke, ggeffects: tidy data frames of marginal effects from regression models. J. Open Source Softw. 3(26), 772 (2018)
E.C. Norton, B.E. Dowd, M.L. Maciejewski, Marginal effects-quantifying the effect of changes in risk factors in logistic regression models. JAMA 321(13), 1304–1305 (2019). https://doi.org/10.1001/jama.2019.1954
J.S. Long, S.A. Mustillo, Using predictions and marginal effects to compare groups in regression models for binary outcomes 50(3), 1284–1320 (2018). https://doi.org/10.1177/0049124118799374
T.D. Mize, L. Doan, J.S. Long, A general framework for comparing predictions and marginal effects across models. Sociol. Methodol. 49(1), 152–189 (2019). https://doi.org/10.1177/0081175019852763
E. Onukwugha, J. Bergtold, R. Jain, A primer on marginal effects—part i: theory and formulae. PharmacoEconomics 33(1), 25–30 (2015). https://doi.org/10.1007/s40273-014-0210-6
A. Agresti, C. Tarantola, Simple ways to interpret effects in modeling ordinal categorical data. Stat. Neerl. 72(3), 210–223 (2018). https://doi.org/10.1111/stan.12130
E.C. Norton, B.E. Dowd, M.L. Maciejewski, Marginal effects—quantifying the effect of changes in risk factors in logistic regression models. JAMA 321(13), 1304 (2019). https://doi.org/10.1001/jama.2019.1954
G. Plumb, M. Al-Shedivat, E.P. Xing, A. Talwalkar, Regularizing black-box models for improved interpretability. CoRR arxiv: abs/1902.06787 (2019)
D. Alvarez-Melis, T.S. Jaakkola, Towards Robust Interpretability with Self-Explaining Neural Networks (2018)
C.K. Yeh, C.Y. Hsieh, A.S. Suggala, D.I. Inouye, P. Ravikumar, On the (In)fidelity and sensitivity of explanations. Advances in Neural Information Processing Systems 32(NeurIPS) (2019) arXiv:1901.09392
Z. Zhou, G. Hooker, F. Wang, S-lime. Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining (2021). https://doi.org/10.1145/3447548.3467274
W.-L. Loh et al., On latin hypercube sampling. Ann. Stat. 24(5), 2058–2080 (1996)
J. Demšar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(1), 1–30 (2006)
S. Lee, D.K. Lee, What is the proper way to apply the multiple comparison test? Korean J. Anesthesiol. 71(5), 353–360 (2018). https://doi.org/10.4097/kja.d.18.00242
Acknowledgements
This work was funded by Federal University of ABC (UFABC), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), Grant Number 2018/14173-8.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Aldeia, G.S.I., de França, F.O. Interpretability in symbolic regression: a benchmark of explanatory methods using the Feynman data set. Genet Program Evolvable Mach 23, 309–349 (2022). https://doi.org/10.1007/s10710-022-09435-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10710-022-09435-x