On Comprehension of Genetic Programming Solutions: A Controlled Experiment on Semantic Inference

Slivnik, Boštjan; Kovačević, Željko; Mernik, Marjan; Kosar, Tomaž

doi:10.3390/math10183386

Open AccessArticle

On Comprehension of Genetic Programming Solutions: A Controlled Experiment on Semantic Inference

¹

Faculty of Computer and Information Science, University of Ljubljana, Večna Pot 113, 1000 Ljubljana, Slovenia

²

Department of Computer Science and Informatics, Zagreb University of Applied Sciences, Vrbik 8, 10000 Zagreb, Croatia

³

Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46, 2000 Maribor, Slovenia

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(18), 3386; https://doi.org/10.3390/math10183386

Submission received: 17 August 2022 / Revised: 14 September 2022 / Accepted: 15 September 2022 / Published: 18 September 2022

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Applied to the problem of automatic program generation, Genetic Programming often produces code bloat, or unexpected solutions that are, according to common belief, difficult to comprehend. To study the comprehensibility of the code produced by Genetic Programming, attribute grammars obtained by Genetic Programming-based semantic inference were compared to manually written ones. According to the established procedure, the research was carried out as a controlled classroom experiment that involved two groups of students from two universities, and consisted of a background questionnaire, two tests and a feedback questionnaire after each test. The tasks included in the tests required the identification of various properties of attributes and grammars, the identification of the correct attribute grammar from a list of choices, or correcting a semantic rule in an attribute grammar. It was established that solutions automatically generated by Genetic Programming in the field of semantic inference, in this study attribute grammars, are indeed significantly harder to comprehend than manually written ones. This finding holds, regardless of whether comprehension correctness, i.e., how many attribute grammars were correctly comprehended, or comprehension efficiency is considered, i.e., how quickly attribute grammars were correctly comprehended.

Keywords:

genetic programming; program comprehension; controlled experiment; semantic inference; attribute grammars

MSC:

68Q55; 68Q42

1. Introduction

Genetic Programming (GP) [1,2,3] has been used as an automatic programming approach to efficiently solve many NP problems and real-world challenges [4,5,6,7,8,9]. However, solutions found by GP are often unnatural due to code bloat and/or unexpected solutions [10,11,12]. A code bloat is a phenomenon of variable-length solution representation (e.g., a tree-based GP) for the rapid growth tendency of a code. Often, such growth does not change the semantics of a code (it is so-called neutral code, or introns). Some common observations about code bloat are:

“Genetic programming (GP), a widely used evolutionary computing technique, suffers from bloat—the problem of excessive growth in individuals’ sizes. As a result, its ability to efficiently explore complex search spaces reduces. The resulting solutions are less robust and generalizable. Moreover, it is difficult to understand and explain models which contain bloat” [11].
“Unnecessary growth in program size is known as the bloat problem in Genetic Programming. Bloat not only increases computational expenses during evolution, but also impairs the understandability and execution performance of evolved final solutions” [12].

One of the most famous and early examples of an unexpected solution was the application of a symbolic regression for the trigonometric identity discovery of

cos 2 x

, where an automatically found solution was [1]

\begin{matrix} (sin (- & (- 2 (* x 2)) \\ (sin (sin (sin (sin (sin (sin (* & (sin (sin 1)) \\ (sin (sin 1))))))))))) . \end{matrix}

According to Koza [1], this solution was mysterious, and required additional testing and implementation in the Mathematica software package to verify that the solution indeed closely fit the

cos 2 x

curve. Extra effort and mathematical skills were needed to show that the mysterious solution was actually equivalent to a well-known trigonometric identity involving a phase shift:

cos 2 x = sin (\frac{π}{2} - 2 x)

. As such, it is a common belief that the comprehension of automatically generated solutions using GP is more often difficult, due to code bloat problems and unexpected solutions [10,11,12]. This paper takes a more skeptical approach to the acceptance of those beliefs. For this purpose, we resorted to control experiments, which were also used in our previous work to tackle a similar belief that programs written in domain-specific languages (DSLs) [13,14] are easier to comprehend than equivalent programs written in general-purpose languages (GPLs) [15,16]. To test the comprehension correctness and comprehension efficiency of automatically generated GP solutions, a control experiment was designed in the area of semantic inference [17,18]. Semantic inference is an extension of grammatical inference [19], also with the aim to induce, in addition to the language syntax, its semantics. As such, a complete compiler/interpreter can be automatically generated from formal specifications [20]. To automatically generate a compiler/interpreter for small DSLs solely from programs, GP was applied in our previous research work [18,21,22], where a complete compiler/interpreter was produced from automatically generated attribute grammars (AGs) [23,24] using the tool LISA [25]. Because the control experiment had students as the participants, we followed the guidelines for conducting experiments in a classroom environment [26].

The main contributions of this work are:

Providing empirical evidence about the comprehension correctness of manually written AGs vs. automatically generated AGs using GP.
Providing empirical evidence about the comprehension efficiency of manually written AGs vs. automatically generated AGs using GP.

The remainder of this paper is organized as follows. Section 2 describes related work on program comprehension and on the code bloat problem in GP. Section 3 highlights the main ingredients of the controlled experiment. Section 4 gives the experimental results and data analysis. Section 5 discusses threats to validity. Section 6 summarizes the key findings and outlines future work.

2. Related Work

The program comprehension of GP solutions, to the best of our knowledge, has not been investigated before. On the other hand, program comprehension [27] in other research fields has been examined extensively. For example, the comprehension of block-based programming vs. text-based programming [28,29], studying the impact of anti-patterns on program comprehension [30], comprehension of DSL programs vs. GPL programs [15,16,31,32], model comprehension [33], comparison of a textual vs. a graphical notation in domain models [34], the influence of visual notation to requirement specification comprehension [35], comprehension of charts for visually impaired individuals [36], program comprehension in virtual reality [37], software maintenance [38], and software engineering [39].

Code bloat in GP is an undesired property of GP, since it increases computational complexity [1,22], and according to common beliefs, it hampers the comprehension of GP solutions. Therefore, many researchers have tackled the code bloat problem in GP. Subsequently, a variety of approaches are described briefly. Most of these approaches can roughly be classified as tree-based, working at the individual or population level. For example, limiting the maximum tree depth for individual solutions [1], or limiting the maximum number of tree nodes for the entire population [40]. Furthermore, there are approaches based on controlling the distribution of program sizes [41], techniques that punish large individuals during evaluation using the fitness function [42], and many other approaches that do not belong to any specific category. Note that all current approaches do not eliminate the code bloat problem completely, but they are successful in significantly reducing it.

The Dynamic Limits approach was presented in [43,44]. It is a collective name for the Dynamic Maximum Tree Depth technique [43] and its variations, where a dynamically limited depth or size of individuals (programs in GP) is used [44]. Any individual who crosses the current limit is rejected unless it is currently the best rated individual. In that case, a new limit is set that corresponds to the limit used in such an individual. The same authors also proposed the resource-limited GP approach [45], a resource-limiting technique applied to the number of tree nodes or lines of code. Compared with the previous approach, this one does not work on an individual, but on a population level. Substituting a subtree with an Approximate Terminal (SAT-GP) is yet another tree-based code bloat prevention approach which is based on the concept of approximate trees [46]. It is applied to a portion of the largest individuals in the population by replacing a random subtree of such individuals with an approximate terminal of similar semantics. SAT-GP performance is generally better compared to the standard GP and neat-GP bloat control approach [47]. The latter is based on the NeuroEvolution of Augmenting Topologies (NEAT) algorithm adopted for the GP and Flat Operator Equalization bloat control method (Flat-OE) [48], which explicitly shapes the program size distributions towards a uniform or flat shape. In general, the Operator Equalization technique [41] controls the distribution of sizes in a population by accepting individuals based on their size, directing exploration towards smaller size solutions. To address the code bloat problem in GP, the authors in [49] used the principle of Minimum Description Length (MDL) [50] within the fitness function, where the size of the composite operator is taken into account during the evaluation process. The hard size limit can restrict the GP search with a lesser likelihood of discovering the effective composite operators. However, according to MDL, large composite operators do not have high fitness, and will eventually be eliminated in the selection process. Therefore, the hard limit can be removed in such cases. GP programs can also be simplified online during the evolutionary process [51]. Programs are represented as expressions stored in a tree representation [1]. During evaluation, programs are checked recursively using the postfix order traversal mode, and algebraic simplification rules and hashing are applied to simplify programs and reduce code bloat [52]. To tackle the code bloat problem in the standard tree-based GP it is possible to substitute standard subtree crossover with the one-point crossover (OPX) developed by Langdon and Poli in [53], while maintaining all other GP standard aspects [54]. Bloat and diversity can be positively affected by using multi-objective methods with the idea to search for multiple solutions, each of which satisfies different objectives to a different extent. The selection of the final solution with a certain combination of objective values is postponed until the moment when it is known which combinations of objective values exist [55]. In addition to the above, there are many other methods to control code bloat, such as:

Tarpeian (individuals with above-average size are assigned a poor fitness with the predetermined probability) [10,56].
Double Tournament (one tournament based on fitness, and the second based on the size of the individuals).
The Waiting Room (larger individuals wait longer to enter the population) [57].

The code bloat problem in semantic inference [18,21,22] affects performance, because more CPU time is required to evaluate a solution. Improved long-term memory assistance (ILTMA) [22,58], multi-threaded fitness evaluation, and a reduced number of input programs were used to address this [22]. After evaluating each individual in the population, ILTMA generates a set of similar individuals who share the same fitness value by creating a Cartesian product [59,60] of the individual’s equivalent genes. The hash table then expands faster, increasing the chances that the remaining individuals are already in the hash table, instead of spending the CPU time on their evaluation. Depending on the number of available CPU cores, multi-threaded fitness evaluation can significantly improve semantic inference performance by performing parallel evaluations, while a reduced number of fitness cases can reduce the CPU time required for evaluation using a smaller number of structurally complete input programs [22]. In addition, the limited depth of the expression tree was used to control the code bloat in semantic inference.

3. Method

The goal of the controlled experiment was to compare researchers’ comprehension of programming language specifications manually written by a language designer vs. their comprehension of programming language specifications generated automatically by GP [1] in the field of semantic inference [17,18]. Particularly, we wanted to verify whether the program comprehension of automatically generated solutions using GP is indeed more difficult due to the code bloat problem and unexpected solutions. For this purpose, we prepared a controlled experiment consisting of several tasks, using AGs [23,24] as a specification formalism. Since it is difficult to obtain a larger group of language designers, the participants in our study were undergraduate students in the Computer Science programs at the University of Maribor and at the University of Ljubljana. Both programs are leading undergraduate Computer Science programs in Slovenia, attracting the best students, and have similar—although not identical—curricula. The experiment was conducted as part of the Compiling Programming Languages course (second year) at the University of Maribor, Faculty of Electrical Engineering and Computer Science (FERI), taught by the third author, and of the Compilers course (third year) at the University of Ljubljana, Faculty of Computer and Information Science (FRI), taught by the first author. Both courses cover similar topics. In particular, topics at the University of Maribor were: regular expressions and lexical analysis, context-free grammar (CFG), Backus–Naur form (BNF), top–down and bottom–up parsing, backtracking parser, LL(1) grammars, left recursion elimination, left factoring, recursive-descent parser, LR(0) parser, BNF for simple DSLs (e.g., [61]), AGs (S-attributed grammars, L-attributed grammars, absolutely non-circular attribute grammars), compiler generators, denotational semantics, and operational semantics. At the University of Ljubljana, the topics included in a course cover all phases of the non-optimizing compiler: lexical and syntax analysis (regular expression based lexers, LL and LR parsers), type systems and semantic analysis, stack frames and calling conventions, translation into intermediate code, basic blocks and traces, instruction selection, liveness analysis and register allocation; through the course, a compiler is written by each student individually.

The experiment was performed simultaneously for both groups and close to the end of the course, with the aim that all the needed materials were covered before the experiment. The students performed the experimental tasks as part of the course assignments. The students’ participation was optional, and their participation was rewarded. Based on the correctness of their answers, the participants earned up to a 10% bonus on their assignment grade for the course. We chose to base the bonus on correctness to help motivate the participants to devote adequate effort to the tasks.

The experiment consisted of a background questionnaire, two tests and a feedback questionnaire after each test. The aim of the background questionnaire was to obtain participants’ perceptions/beliefs on their own knowledge of programming, compilers, AGs and their interest in programming and compilers. The first test was common to both groups of participants, with the aim to measure general knowledge about AGs. It consisted of seven tasks:

Identification of the kind of an attribute (synthesized or inherited).
Identification of a missing semantic rule for an unknown attribute.
What is the meaning of a simple AG?
Identification of AG type (S-attributed grammar, L-attributed grammar, absolutely non-circular AG, circular AG).
Is the provided specification an L-attributed AG?
Identification of a wrong AG among several correctly provided AGs.
Identification of a superficial semantic rule.

The second test was different for both groups. We made a conscious decision that the group that would perform better on the first test would be confronted with automatically generated AGs. Therefore, Group I (students from the University of Maribor, FERI) received tasks for AGs written by a language designer, whilst Group II (students from the University of Ljubljana, FRI) received tasks from automatically generated AGs. The tasks were the same, but the AGs were different (manually written vs. automatically generated). Hence, this was a between-subjects study. The second test consisted of seven tasks:

Identification of a correct AG for the $a^{n} b^{n} c^{n}$ language.
Identification of a correct AG for simple expressions.
Identification of a correct AG for the Robot language.
Identification of a correct AG for a simple where statement.
Correct a wrong semantic rule in AG for the $a^{n} b^{n} c^{n}$ language.
Correct a wrong semantic rule in AG for simple expressions.
Correct a wrong semantic rule in AG for the Robot language.

After each test, the feedback questionnaires were given to participants measuring the participants’ individual perspectives on the experiment. Namely, the participants expressed their opinion about the simplicity of the AGs in the experimental tasks.

Figure 1 summarizes the overall design of the controlled experiment. After giving lectures on AGs, the participants were exposed to the first test, which lasted for 60 min, followed by 5 min of background and feedback questionnaires. In Group I (FERI), there were 33 participants, and in Group II (FRI), there were 16 participants. Seven participants from Group I (FERI) decided not to participate in the second test, which lasted for 75 min, followed by the 5 min feedback questionnaire. Note that participation in both experiments was voluntary.

We used Moodle as an online system for conducting the tests and feedback/background studies. The code inside the questions was translated into Figures with indentation and highlighted syntax. Code execution was prevented in this manner. Thus, debugging support (watch, variable list, etc.), editor functionalities (auto-complete, find and replace), and other specialized auxiliary tools (e.g., Figure 2 or Figure 3) were also prevented. Only manual program comprehension was possible.

As the basis for this experiment, we formulated two hypotheses, one on correctness and one on efficiency:

$H 1_{null}$ There is no significant difference in the correctness of the participants’ comprehension of AGs when using manually written AGs vs. automatically generated AGs.
$H 1_{alt}$ There is a significant difference in the correctness of the participants’ comprehension of AGs when using manually written AGs vs. automatically generated AGs.
$H 2_{null}$ There is no significant difference in the efficiency of the participants’ comprehension of AGs when using manually written AGs vs. automatically generated AGs.
$H 2_{alt}$ There is a significant difference in the efficiency of the participants’ comprehension of AGs when using manually written AGs vs. automatically generated AGs.

The background questionnaire, two tests, feedback questionnaires, and the participants’ results are available at https://github.com/slivnik/AG_experiement2022, (accessed on 12 August 2022, commit cf47740).

To give the reader a glimpse of a difference between a manually written vs. automatically generated AG, we provided an example for the Robot language. Listing 1 shows a manually written AG which is easy to comprehend. The initial robot position

(0, 0)

is set in the rule

Start

and represented with two inherited attributes,

inx

and

iny

. The final position is stored into two synthesized attributes:

outx

and

outy

, whilst the propagation of a position from one command to the next command is described in the rule

commands

. Changing a position after one command (

left

,

right

,

up

,

down

) is described in the rule

command

. In Figure 2, the semantic tree for the program

begin right up left down end

is shown. The propagation of the position can be observed nicely. The robot starts at position

(0, 0)

, and the command

right

changed it to the position

(1, 0)

and passed it to the command

up

, which resulted in intermediate position

(1, 1)

. This position is then passed to the command

left

, which changes the robot’s position to

(0, 1)

. Finally, this position is passed to the command

down

, which moves the robot back to the position

(0, 0)

. This is the final position that is propagated to the root node (Figure 2).

Listing 1. Manually written AG for the Robot language.

language Robot {

lexicon { keywords begin | end

operation left | right | up | down

ignore [\0x0D\0x0A\ ]

}

attributes int *.inx; int *.iny; int *.outx; int *.outy;

rule start {

START ::= begin COMMANDS end compute {

START.outx = COMMANDS.outx;

START.outy = COMMANDS.outy;

COMMANDS.inx = 0;

COMMANDS.iny = 0;

};

}

rule commands {

COMMANDS ::= COMMAND COMMANDS compute {

COMMANDS[0].outx = COMMANDS[1].outx;

COMMANDS[0].outy = COMMANDS[1].outy;

COMMAND.inx = COMMANDS[0].inx;

COMMAND.iny = COMMANDS[0].iny;

COMMANDS[1].inx = COMMAND.outx;

COMMANDS[1].iny = COMMAND.outy;

}

| epsilon compute {

COMMANDS[0].outx = COMMANDS[0].inx;

COMMANDS[0].outy = COMMANDS[0].iny;

};

}

rule command {

COMMAND ::= left compute {

COMMAND.outx = COMMAND.inx - 1;

COMMAND.outy = COMMAND.iny;

};

COMMAND ::= right compute {

COMMAND.outx = COMMAND.inx + 1;

COMMAND.outy = COMMAND.iny;

};

COMMAND ::= up compute {

COMMAND.outx = COMMAND.inx;

COMMAND.outy = COMMAND.iny + 1;

};

COMMAND ::= down compute {

COMMAND.outx = COMMAND.inx;

COMMAND.outy = COMMAND.iny - 1;

};

}

On the other hand, an automatically generated AG using Semantic Inference (Listing 2) is not intuitive anymore. The solution found by GP is an unexpected solution and contains code bloat (e.g.,

COMMANDS [1] . inx = 0 + COMMANDS [1] . outy;

in the rule

commands

). The robot’s position is not propagated anymore. The important information about one command (

left

,

right

,

up

,

down

) is stored in a command node. For example, the meaning of the command

up

is always

(1, 0)

. After the sequence of two commands, both positions need to be summed up. The computation is further complicated due to exchanging the

x

and

y

coordinates (see Figure 3). Just comparing the code sizes of Listings 1 and 2, it can be noticed that, in the 20 semantic equations which appear in both Listings, there are 28 operands and operators in Listing 1, whilst 42 operands and operators in Listing 2. The average number of operands and operators per semantic rule is 1.4 in Listing 1 and 2.1 in Listing 2. This is a 50% increase, which shows how the complexity of automatically generated solutions increases over manually written solutions. The increased complexity can be attributed to both code bloat and unexpected solutions.

Listing 2. Automatically generated AG for the Robot language.

language Robot {

lexicon {

keywords begin | end

operation left | right | up | down

ignore [\0x0D\0x0A\ ]

}

attributes int *.inx; int *.iny;

int *.outx; int *.outy;

rule start {

START ::= begin COMMANDS end compute {

START.outx = COMMANDS.outy;

START.outy = COMMANDS.iny+COMMANDS.outx;

COMMANDS.inx=0;

COMMANDS.iny=0;

};

}

rule commands {

COMMANDS ::= COMMAND COMMANDS compute {

COMMANDS[0].outx = COMMANDS[1].outx+COMMAND.outx;

COMMANDS[0].outy = COMMANDS[1].outy-COMMAND.outy;

COMMAND.inx = 0;

COMMAND.iny = 0;

COMMANDS[1].inx = 0+COMMANDS[1].outy;

COMMANDS[1].iny = COMMAND.iny;

}

| epsilon compute {

COMMANDS[0].outx = COMMANDS[0].iny-COMMANDS[0].outy;

COMMANDS[0].outy = 0;

};

}

rule command {

COMMAND ::= left compute {

COMMAND.outx = COMMAND.iny-0;

COMMAND.outy = 1+0;

};

COMMAND ::= right compute {

COMMAND.outx = COMMAND.inx-COMMAND.iny;

COMMAND.outy = 0-1;

};

COMMAND ::= up compute {

COMMAND.outx = 1;

COMMAND.outy = 0+0;

};

COMMAND ::= down compute {

COMMAND.outx = 0-1;

COMMAND.outy = COMMAND.iny;

};

}

4. Results

This section compares the comprehension correctness and comprehension efficiency of programming language specifications written manually by a language designer vs. their comprehension correctness and comprehension efficiency of programming language specifications automatically generated by GP in the field of semantic inference [17,18]. It also presents the results from the background and feedback questionnaires. All the observations were statistically tested with

α = 0.05

as a threshold for judging significance. The Shapiro–Wilk test of normal distribution was performed for all the data. If the data were not normally distributed, we performed a non-parametric Mann–Whitney test for two independent samples. If the data were normally distributed, we performed the parametric independent Sample T-test.

Table 1 shows the results from the background questionnaire wherein the participants’ own beliefs were captured on their knowledge on programming, compilers, and AGs. Here, we used a five-point Likert scale, with 1 representing low and 5 representing high knowledge. From Table 1, it can be observed that there was no statistically significant difference in the participants’ opinions about their knowledge on programming and AGs. However, the participants in Group II (FRI) expressed a statistically significant difference in their opinion about their knowledge about compilers. Similarly, from Table 2, where participants’ own interest in programming and compilers are presented, it can be observed that there was no statistical significance in participants’ opinions about their interest in programming. However, there was a significant statistical difference in participants’ opinions about their interest in compilers. Both statistical significant differences in participants’ opinions about their knowledge and interest in compilers can be explained with the fact that the compilers’ course at the University of Ljubljana is an elective course. Therefore, participants from Group II (FRI) had a higher interest in compilers, and their opinion about knowledge on compilers was higher than participants’ opinions from Group I (FERI).

Table 3 shows the results of comprehension correctness for the first test measuring general knowledge about AG. Both groups, Group I (FERI) and Group II (FRI), solved the same seven tasks. It can be observed from Table 3 that Group II (FRI) statistically performed significantly better. Its average comprehension correctness was slightly above 70%, whilst the average comprehension correctness of Group I (FERI) was only slightly above 50%. Some explanations that Group II (FRI) performed better than Group I (FERI) are:

Participants in Group II (FRI) are more experienced (third-year students) than participants in Group I (FERI) (second-year students).
The course of Group II (FRI) was elective, whilst the course of Group I (FERI) was mandatory. Therefore, the participants of Group II (FRI) had higher interest in the compiler course.

The threats to validity (Section 5) discusses other possible reasons for which Group II (FRI) performed better on the first test than Group I (FERI).

Table 4 shows the results of comprehension correctness for the second test. Both groups had seven tasks. However, Group I (FERI) was exposed to manually written AGs, whilst Group II (FRI) was exposed to automatically generated AGs using GP in the field of semantic inference [17,18]. As such, those solutions suffered from the code bloat problem and many solutions were unexpected. It can be observed from Table 4 that Group I (FERI) statistically performed significantly better than Group II (FRI) on the second test. Its average comprehension correctness was slightly below 80%, whilst the average comprehension correctness of Group II (FRI) was only slightly above 50%. AGs for simple languages such as

a^{n} b^{n} c^{n}

, simple expressions, the Robot language, and simple where statements manually written by a language designer were intuitive and easy to understand. On the other hand, Group II (FRI) performed statistically significantly worse on the second test, since the automatically generated AGs were not intuitive anymore, presumably due to the code bloat problem and unexpected solutions. Although Group II (FRI) had better results from the first test measuring general knowledge about AGs, it was statistically significantly outperformed by Group I (FERI) on the second test. Hence, we can conclude that automatically generated solutions by GP in the field of semantic inference [17,18] are indeed harder to comprehend than manually written solutions.

Table 5 shows the average correctness of tasks in both tests. In the first test, Group II’s (FRI) comprehension correctness was much higher than Group I (FERI) for every task except Q4, which was about the identification of the AG type (S-attributed grammar, L-attributed grammar, absolutely non-circular AG, circular AG). On task Q4, Group I (FERI) was only slightly better (57.58%) than Group II (FRI) (50.00%). The Mann–Whitney statistical test (Table 3) shows that the comprehension correctness of Group II (FRI) was statistically significantly better than Group I (FERI) in the first test. In the second test, the opposite was true. Group I (FERI), which was exposed to manually written AGs, performed better than Group II (FRI), which was exposed to automatically generated AGs using GP on all tasks except Q4 and Q7. Task Q4 in the second test was about the identification of a correct AG for a simple where statement. It turned out that the automatically generated AG for task Q4 was the same as the manually written AG. The GP solution, in this case, did not produce code bloat or an unexpected solution. Since participants’ knowledge about AG of Group II (FRI) was higher than in Group I (FERI) (Table 3), the correctness of Q4 in the second test was again higher within Group II (FRI) (87.50%) than within Group I (FERI) (69.23%). Task Q7 in the second test was about correcting a wrong semantic rule in an AG for the Robot language. A robot’s initial position was wrong, and finding a correct semantic rule was not so difficult, even in the automatically generated AG. Therefore, the correctness of task Q7 of Group II (FRI) was again better (87.50%) than that of Group I (FERI) (80.77%). On all other tasks (Q1–Q3, Q5–Q6), Group I (FERI) performed better than Group II (FRI). Task Q3 (identification of a correct AG for the Robot language—Listing 2), was especially difficult for participants of Group II (FRI), and nobody solved this task correctly. The Mann–Whitney statistical test (Table 4) shows that the comprehension correctness of Group I (FERI) was statistically significantly better than Group II (FRI) in the second test.

Similarly to many other studies (e.g., [15,16,33]), comprehension efficiency is measured as the ratio of the percentage of correctly answered questions to the amount of time spent answering the questions (hence, higher values are better than lower values). Table 6 shows that Group II (FRI) was statistically significantly more efficient than Group I (FERI) on the first test, whilst the opposite was true for the second test (Table 7). Overall, the results on comprehension efficiency (Table 6 and Table 7) corroborated the results on comprehension correctness (Table 3 and Table 4).

Table 8 shows the results from the feedback questionnaire after performing both tests. In the feedback questionnaire, we captured participants’ individual perspectives on the simplicity of AGs in each of the seven tasks. Again, we used a five-point Likert scale with 1 representing low and 5 representing high simplicity. In the statistical analysis (Table 8), we used the average result on the simplicity of seven AGs (tasks). In the first test, the participants of Group II (FRI) found AGs simpler than the participants of Group I (FERI). The difference was statistically significant. The comprehension correctness (Table 3) and comprehension efficiency (Table 6) of Group II (FRI) were also statistically significantly better than Group I (FERI). The results for the second test did not surprise us. There was a statistically significant difference in participants’ opinions about the simplicity of AGs. Group I (FERI) participants’ opinions were that manually written AGs are simpler (note that five participants did not return the feedback questionnaire after the second test), whilst Group II (FRI) participants’ opinions were that automatically generated AGs are not simple anymore (note that one participant did not return the feedback questionnaire after the second test). This opinion was tested and verified in a scientific manner in our controlled experiment. The results from Table 3, Table 4, Table 5, Table 6 and Table 7 allow us to reject both null hypotheses

H 1_{null}

and

H 2_{null}

and accept the alternative hypotheses formulated in Section 3:

$H 1_{alt}$ There is a significant difference in the correctness of the participants’ comprehension of AGs when using manually written AGs vs. automatically generated AGs.
$H 2_{alt}$ There is a significant difference in the efficiency of the participants’ comprehension of AGs when using manually written AGs vs. automatically generated AGs.

5. Threats to Validity

In this section, we discuss the construct, internal and external validity threats [62] to the results of our controlled experiment [63].

Construct validity is about how well we can measure the properties under consideration [64,65]. In our experiment, we wanted to measure the comprehension or understandability of manually written and automatically generated AGs using GP in the field of semantic inference. In this respect, we designed several tasks in which the participants had to understand the provided AGs. We started with simple tasks, such as the identification of the kind of an attribute (synthesized or inherited), a missing semantic rule, the identification of AG type (S-attributed grammar, L-attributed grammar, absolutely non-circular AG, and circular AG), and the identification of a superficial semantic rule. Only a partial understanding of the whole AG is needed in these tasks. In the following tasks, the participants needed to understand complete AGs (e.g., for the Robot language) in order to solve a task correctly. With the variety of tasks, we believe that construct validity has been addressed well, since solving various tasks measures participants’ comprehension ability about AGs indirectly. In addition to comprehension, efficiency was measured as well. Efficiency was measured as the ratio of correctly answered questions to the amount of time spent answering the questions. Both correct answers and time were easy to measure. Overall, we believe that threats to construct validity were minimized successfully.

Internal validity concerns inferences between the treatment and the outcome. Are there other confounding factors that could have caused the outcome? Since the answers were given to the participants, there always exists a chance that participants guessed the correct answer. This internal validity threat was mitigated with the large number of possible answers (five options) and large number of tasks (2 × 7 = 14 tasks). Note that the order of tasks was fixed, and the same for all participants without the possibility of returning to a previous task. However, possible answers for the task were assigned between participants randomly. From the aforementioned course description at both universities (Section 3), it can be observed that the topics were similar, but not the same. To eliminate some threats to internal validity, the same teaching material about AGs in the form of slides was prepared and explained to students in both groups (Group I—second-year students at the University of Maribor, FERI; Group II—third-year students at the University of Ljubljana, FRI). Although the material was the same, there were still two different instructors. Hence, we did not eliminate all threats to internal validity. However, we have no reason to believe that different styles of presenting AGs to participants and the slightly different topics covered in both courses (Compiling Programming Languages and Compilers) influenced the outcome of the experiment. There is no other reasonable explanation as to why Group II (FRI) performed worse on the second test rather than that the automatically generated AGs are harder to comprehend due to code bloat and unexpected solutions. Note that Group II (FRI) performed statistically significantly better on the first test. Therefore, despite better knowledge about AGs, Group II (FRI) performed statistically significantly worse on the second test. After the first test, seven students from Group I (FERI) chose not to participate in the second test. We also re-ran the statistics for the first test by eliminating those students from the first test, and the result was the same. Namely, Group II (FRI) still performed statistically significantly better than Group I (FERI). Hence, we believe that the students who dropped out in the second test did not change the sample considerably. However, another threat to internal validity can be the different tasks given to Group I (FERI) and Group II (FRI). The first test was the same for both groups, whilst in the second test, the tasks were the same. However, Group I (FERI) was exposed to manually written AGs and Group II (FRI) was exposed to automatically generated AGs. Since the tasks were the same, this internal threat was eliminated or at least minimized.

External validity refers to the extent to which results from a study can be generalized. Since only students participated in our experiment, there is an external validity threat whether results can also be generalized to professional programmers. Another external validity threat is generalization to all applications of GP (e.g., Lisp programming [1,3], event processing rules [66], trading rules [7], model-driven engineering artifacts [9]). In our case, GP was used in semantic inference wherein AGs were automatically generated. Due to code bloat and unexpected solutions, many believe that such solutions are harder to comprehend. In this study, it is shown that there is a statistically significant difference in the comprehension correctness and comprehension efficiency of manually written AGs vs. automatically generated AGs.

6. Conclusions

There is a common belief that programs (specifications, models) [67], which are automatically generated using the Genetic Programming (GP) approach [1,2,3], are harder to comprehend than manually written programs, due to code bloat and unexpected solutions [10,11,12]. In this paper, a more scientific approach is used to prove these claims for the first time to the best of our knowledge. A controlled experiment was designed, where the participants were exposed to manually written and automatically generated attribute grammar (AG) specifications [23,24], which were written using LISA domain-specific language (DSL). From such programs (specifications), as can be seen, for example, Listings 1 and 2, various language-based tools (e.g., compiler/interpreter, debugger, profilers) can be automatically generated [20,68]. The main findings from our experiment are that there is a significant difference in the comprehension correctness and comprehension efficiency of manually written AGs vs. automatically generated AGs. Hence, we can conclude that solutions automatically generated by GP in the field of semantic inference [17,18] are indeed harder to comprehend than manually written solutions. On the other hand, by comparing manual and automatically generated solutions, we can nicely observe how machines could be our complement in solving interesting and complex problems, despite that they are harder to understand due to human restrictions.

As with any empirical study, these results must be taken with caution. Additional replications are needed of this study. Study replications can increase the trustworthiness of the results from empirical studies if performed correctly [69]. Hence, we urge other researchers to perform such studies using GP in other research fields (e.g., Lisp programming [1,3], event processing rules [66], trading rules [7], and model-driven engineering artifacts [9]). Furthermore, additional controlled experiments are needed, involving not only students but also practitioners (e.g., professional programmers).

In this study, the controlled experiment was designed in a manner that lower comprehension correctness and comprehension efficiency can be attributed to both code bloat and unexpected solutions. As a future work, we would like to check how much code bloat alone affects comprehension correctness and comprehension efficiency. However, another future direction would be how the code bloat and unexpected solutions influence explainability. Users are more likely to apply artificial intelligence (AI) if they can understand the results of AI artifacts, and explainable artificial intelligence (XAI) refers to techniques to understand AI artifacts better [70].

Author Contributions

Conceptualization, B.S., Ž.K., M.M. and T.K.; methodology, B.S., Ž.K., M.M. and T.K.; software, Ž.K. and T.K.; validation, B.S., Ž.K., M.M. and T.K.; investigation, B.S., Ž.K., M.M. and T.K.; writing—original draft preparation, B.S., Ž.K., M.M. and T.K.; writing—review and editing, B.S., Ž.K., M.M. and T.K.; visualization, Ž.K. All authors have read and agreed to the published version of the manuscript.

Funding

The second author wishes to thank the University of Maribor, Faculty of Electrical Engineering and Computer Science, the Zagreb University of Applied Sciences and the Zagreb University Computing Center for providing computer resources. The third and fourth authors acknowledge the financial support from the Slovenian Research Agency (Research Core Funding No. P2-0041).

Institutional Review Board Statement

Ethical review and approval were waived for this study because the tests had the form of a midterm exam.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

https://github.com/slivnik/AG_experiement2022, accessed on 16 August 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Koza, J.R. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
Banzhaf, W.; Nordin, P.; Keller, R.E.; Francone, F.D. Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998. [Google Scholar]
Langdon, W.B.; Poli, R. Foundations of Genetic Programming; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Javed, F.; Bryant, B.R.; Črepinšek, M.; Mernik, M.; Sprague, A. Context-Free Grammar Induction Using Genetic Programming. In Proceedings of the 42nd Annual Southeast Regional Conference, ACM-SE 42, Huntsville, AL, USA, 2–3 April 2004; Association for Computing Machinery: New York, NY, USA, 2004; pp. 404–405. [Google Scholar]
Hrnčič, D.; Mernik, M.; Bryant, B.R.; Javed, F. A memetic grammar inference algorithm for language learning. Appl. Soft Comput. 2012, 12, 1006–1020. [Google Scholar] [CrossRef]
Cramer, S.; Kampouridis, M.; Freitas, A.A. Decomposition genetic programming: An extensive evaluation on rainfall prediction in the context of weather derivatives. Appl. Soft Comput. 2018, 70, 208–224. [Google Scholar] [CrossRef]
Michell, K.; Kristjanpoller, W. Strongly-typed genetic programming and fuzzy inference system: An embedded approach to model and generate trading rules. Appl. Soft Comput. 2020, 90, 106169. [Google Scholar] [CrossRef]
Contador, S.; Velasco, J.M.; Garnica, O.; Hidalgo, J.I. Glucose forecasting using genetic programming and latent glucose variability features. Appl. Soft Comput. 2021, 110, 107609. [Google Scholar] [CrossRef]
Batot, E.R.; Sahraoui, H. Promoting social diversity for the automated learning of complex MDE artifacts. Softw. Syst. Model. 2022, 21, 1159–1178. [Google Scholar] [CrossRef]
Poli, R. A Simple but Theoretically-Motivated Method to Control Bloat in Genetic Programming. In Genetic Programming; Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 204–217. [Google Scholar]
Javed, N.; Gobet, F.; Lane, P. Simplification of genetic programs: A literature survey. Data Min. Knowl. Discov. 2022, 36, 1279–1300. [Google Scholar] [CrossRef]
Song, A.; Chen, D.; Zhang, M. Contribution based bloat control in Genetic Programming. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
Mernik, M. Formal and Practical Aspects of Domain-Specific Languages: Recent Developments; IGI Global: Hershey, PA, USA, 2013. [Google Scholar]
Kosar, T.; Bohra, S.; Mernik, M. Domain-Specific Languages: A Systematic Mapping Study. Inf. Softw. Technol. 2016, 71, 77–91. [Google Scholar] [CrossRef]
Kosar, T.; Mernik, M.; Carver, J.C. Program comprehension of domain-specific and general-purpose languages: Comparison using a family of experiments. Empir. Softw. Eng. 2012, 17, 276–304. [Google Scholar] [CrossRef]
Kosar, T.; Gaberc, S.; Carver, J.C.; Mernik, M. Program comprehension of domain-specific and general-purpose languages: Replication of a family of experiments using integrated development environments. Empir. Softw. Eng. 2018, 23, 2734–2763. [Google Scholar] [CrossRef]
Law, M.; Russo, A.; Bertino, E.; Broda, K.; Lobo, J. Representing and Learning Grammars in Answer Set Programming. In Proceedings of the 33th AAAI Conference on Artificial Intelligence (AAAi-19), Honolulu, HI, USA, 27 January–1 February 2019; pp. 229–240. [Google Scholar]
Kovačević, Ž.; Mernik, M.; Ravber, M.; Črepinšek, M. From Grammar Inference to Semantic Inference—An Evolutionary Approach. Mathematics 2020, 8, 816. [Google Scholar] [CrossRef]
de la Higuera, C. Grammatical Inference: Learning Automata and Grammars; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Henriques, P.R.; Varanda Pereira, M.J.; Mernik, M.; Lenič, M.; Avdičaušević, E.; Žumer, V. Automatic Generation of Language-based Tools. Electron. Notes Theor. Comput. Sci. 2002, 65, 77–96. [Google Scholar] [CrossRef]
Ravber, M.; Kovačević, Ž.; Črepinšek, M.; Mernik, M. Inferring Absolutely Non-Circular Attribute Grammars with a Memetic Algorithm. Appl. Soft Comput. 2021, 100, 106956. [Google Scholar] [CrossRef]
Kovačević, Ž.; Ravber, M.; Liu, S.H.; Črepinšek, M. Automatic compiler/interpreter generation from programs for Domain-Specific Languages: Code bloat problem and performance improvement. J. Comput. Lang. 2022, 70, 101105. [Google Scholar] [CrossRef]
Deransart, P.; Jourdan, M. (Eds.) Proceedings of the International Conference WAGA on Attribute Grammars and Their Applications; Springer: Berlin/Heidelberg, Germany, 1990. [Google Scholar]
Alblas, H.; Melichar, B. (Eds.) Lecture Notes in Computer Science. In Proceedings of the Attribute Grammars, Applications and Systems, International Summer School SAGA, Prague, Czechoslovakia, 4–13 June 1991; Springer: Berlin/Heidelberg, Germany, 1991; Volume 545. [Google Scholar]
Mernik, M.; Korbar, N.; Žumer, V. LISA: A Tool for Automatic Language Implementation. SIGPLAN Not. 1995, 30, 71–79. [Google Scholar] [CrossRef]
Carver, J.C.; Jaccheri, L.; Morasca, S.; Shull, F. A checklist for integrating student empirical studies with research and teaching goals. Empir. Softw. Eng. 2010, 65, 35–59. [Google Scholar] [CrossRef]
Storey, M.A. Theories, methods and tools in program comprehension: Past, present and future. In Proceedings of the 13th International Workshop on Program Comprehension (IWPC’05), St. Louis, MO, USA, 15–16 May 2005; pp. 181–191. [Google Scholar] [CrossRef]
Weintrop, D.; Wilensky, U. How block-based, text-based, and hybrid block/text modalities shape novice programming practices. Int. J.-Child-Comput. Interact. 2018, 17, 83–92. [Google Scholar] [CrossRef]
Lin, Y.; Weintrop, D. The landscape of Block-based programming: Characteristics of block-based environments and how they support the transition to text-based programming. J. Comput. Lang. 2021, 67, 101075. [Google Scholar] [CrossRef]
Politowski, C.; Khomh, F.; Romano, S.; Scanniello, G.; Petrillo, F.; Guéhéneuc, Y.G.; Maiga, A. A large scale empirical study of the impact of Spaghetti Code and Blob anti-patterns on program comprehension. Inf. Softw. Technol. 2020, 122, 106278. [Google Scholar] [CrossRef]
Johanson, A.; Hasselbring, W. Effectiveness and efficiency of a domain-specific language for high-performance marine ecosystem simulation: A controlled experiment. Empir. Softw. Eng. 2017, 22, 2206–2236. [Google Scholar] [CrossRef]
Fronchetti, F.; Ritschel, N.; Holmes, R.; Li, L.; Soto, M.; Jetley, R.; Wiese, I.; Shepherd, D. Language impact on productivity for industrial end users: A case study from Programmable Logic Controllers. J. Comput. Lang. 2022, 69, 101087. [Google Scholar] [CrossRef]
Nugroho, A. Level of detail in UML models and its impact on model comprehension: A controlled experiment. Inf. Softw. Technol. 2009, 51, 1670–1685. [Google Scholar] [CrossRef]
Meliá, S.; Cachero, C.; Hermida, J.; Aparicio, E. Comparison of a textual versus a graphical notation for the maintainability of MDE domain models: An empirical pilot study. Softw. Qual. J. 2016, 24, 709–735. [Google Scholar] [CrossRef]
Gardner, H.; Blackwell, A.F.; Church, L. The patterns of user experience for sticky-note diagrams in software requirements workshops. J. Comput. Lang. 2020, 61, 100997. [Google Scholar] [CrossRef]
Mishra, P.; Kumar, S.; Chaube, M.K.; Shrawankar, U. ChartVi: Charts summarizer for visually impaired. J. Comput. Lang. 2022, 69, 101107. [Google Scholar] [CrossRef]
Dominic, J.; Tubre, B.; Houser, J.; Ritter, C.; Kunkel, D.; Rodeghero, P. Program Comprehension in Virtual Reality. In Proceedings of the 28th International Conference on Program Comprehension, Seoul, Korea, 13–15 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 391–395. [Google Scholar]
Al-Saiyd, N.A. Source code comprehension analysis in software maintenance. In Proceedings of the 2017 2nd International Conference on Computer and Communication Systems (ICCCS), Krakow, Poland, 11–14 July 2017; pp. 1–5. [Google Scholar] [CrossRef]
Brooks, R. Using a Behavioral Theory of Program Comprehension in Software Engineering. In Proceedings of the 3rd International Conference on Software Engineering, ICSE ’78, Atlanta, GA, USA, 10–12 May 1978; IEEE Press: Piscataway, NJ, USA, 1978; pp. 196–201. [Google Scholar]
Wagner, N.; Michalewicz, Z. Genetic programming with efficient population control for financial time series prediction. In Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation Late Breaking Papers, San Francisco, CA, USA, 7–11 July 2001; Volume 1, pp. 458–462. [Google Scholar]
Dignum, S.; Poli, R. Operator Equalisation and Bloat Free GP. In Genetic Programming; O’Neill, M., Vanneschi, L., Gustafson, S., Esparcia Alcázar, A.I., De Falco, I., Della Cioppa, A., Tarantino, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 110–121. [Google Scholar]
Poli, R.; McPhee, N.F. Parsimony Pressure Made Easy. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation (GECCO’08), Atlanta, GA, USA, 12–16 July 2008; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1267–1274. [Google Scholar] [CrossRef]
Silva, S.; Almeida, J. Dynamic Maximum Tree Depth: A Simple Technique for Avoiding Bloat in Tree-Based GP. In Proceedings of the Genetic and Evolutionary Computation-GECCO 2003: Part II, Chicago, IL, USA, 12–16 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 1776–1787. [Google Scholar]
Silva, S.; Costa, E. Dynamic Limits for Bloat Control: Variations on Size and Depth. In Proceedings of the Genetic and Evolutionary Computation—GECCO 2004, Seattle, WA, USA, 26–30 June 2004; Volume 3103, pp. 666–677. [Google Scholar]
Silva, S.; Silva, P.J.; Costa, E. Resource-Limited Genetic Programming: Replacing Tree Depth Limits. In Adaptive and Natural Computing Algorithms; Springer: Berlin/Heidelberg, Germany, 2005; pp. 243–246. [Google Scholar]
Chu, T.H.; Nguyen, Q.U. Reducing Code Bloat in Genetic Programming based on Subtree Substituting Technique. In Proceedings of the 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES), Hanoi, Vietnam, 15–17 November 2017; pp. 25–30. [Google Scholar] [CrossRef]
Trujillo, L.; Muñoz, L.; Galván-López, E.; Silva, S. neat Genetic Programming: Controlling bloat naturally. Inf. Sci. 2016, 333, 21–43. [Google Scholar] [CrossRef] [Green Version]
Silva, S. Reassembling Operator Equalisation: A Secret Revealed. ACM SIGEVOlution 2011, 5, 10–22. [Google Scholar] [CrossRef]
Lin, Y.; Bhanu, B. MDL-based Genetic Programming for Object Detection. In Proceedings of the 2003 Conference on Computer Vision and Pattern Recognition Workshop, Madison, WI, USA, 16–22 June 2003; Volume 6, p. 60. [Google Scholar] [CrossRef]
Rissanen, J. A Universal Prior for Integers and Estimation by Minimum Description Length. Ann. Stat. 1983, 11, 416–431. [Google Scholar] [CrossRef]
Wong, P.; Zhang, M. Effects of program simplification on simple building blocks in Genetic Programming. In Proceedings of the 2007 IEEE Congress on Evolutionary Computation, Singapore, 25–28 September 2007; pp. 1570–1577. [Google Scholar] [CrossRef]
Wong, P.; Zhang, M. Algebraic Simplification of GP Programs during Evolution. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO’06), Seattle, WA, USA, 8–12 July 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 927–934. [Google Scholar] [CrossRef]
Poli, R.; Langdon, W.B. Genetic Programming with One-Point Crossover. In Soft Computing in Engineering Design and Manufacturing; Chawdhry, P.K., Roy, R., Pant, R.K., Eds.; Springer: London, UK, 1998; pp. 180–189. [Google Scholar]
Trujillo, L. Genetic Programming with One-Point Crossover and Subtree Mutation for Effective Problem Solving and Bloat Control. Soft Comput. 2011, 15, 1551–1567. [Google Scholar] [CrossRef]
de Jong, E.D.; Watson, R.A.; Pollack, J.B. Reducing Bloat and Promoting Diversity Using Multi-Objective Methods. In Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation (GECCO’01), San Francisco, CA, USA, 7–11 July 2001; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001; pp. 11–18. [Google Scholar]
Poli, R. Covariant Tarpeian Method for Bloat Control in Genetic Programming. In Genetic Programming Theory and Practice VIII; Springer: New York, NY, USA, 2011; pp. 71–89. [Google Scholar] [CrossRef]
Luke, S.; Panait, L. A Comparison of Bloat Control Methods for Genetic Programming. Evol. Comput. 2006, 14, 309–344. [Google Scholar] [CrossRef]
Črepinšek, M.; Liu, S.H.; Mernik, M.; Ravber, M. Long Term Memory Assistance for Evolutionary Algorithms. Mathematics 2019, 7, 1129. [Google Scholar] [CrossRef]
Johnston, T. Chapter 3—The Relational Paradigm: Mathematics. In Bitemporal Data; Johnston, T., Ed.; Morgan Kaufmann: Boston, MA, USA, 2014; pp. 35–41. [Google Scholar] [CrossRef]
Dwyer, B. Chapter 2—Mathematical background. In Systems Analysis and Synthesis; Dwyer, B., Ed.; Morgan Kaufmann: Boston, MA, USA, 2016; pp. 23–78. [Google Scholar] [CrossRef]
Fister, I.; Fister, I.; Mernik, M.; Brest, J. Design and implementation of domain-specific language easytime. Comput. Lang. Syst. Struct. 2011, 37, 151–167. [Google Scholar] [CrossRef]
Feldt, R.; Magazinius, A. Validity Threats in Empirical Software Engineering Research—An Initial Survey. In Proceedings of the 22nd International Conference on Software Engineering & Knowledge Engineering (SEKE’2010), Redwood City, San Francisco Bay, CA, USA, 1–3 July 2010; Knowledge Systems Institute Graduate School: San Francisco Bay, CA, USA, 2010; pp. 374–379. [Google Scholar]
Wohlin, C.; Runeson, P.; Höst, M.; Ohlsson, M.C.; Regnell, B.; Wesslén, A. Experimentation in Software Engineering; Springer Publishing Company: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Ralph, P.; Tempero, E. Construct Validity in Software Engineering Research and Software Metrics. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018, EASE’18, Christchurch, New Zealand, 27–29 June 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 13–23. [Google Scholar] [CrossRef]
Sjoberg, D.I.; Bergersen, G.R. Construct Validity in Software Engineering. IEEE Trans. Softw. Eng. 2022. [Google Scholar] [CrossRef]
Bruns, R.; Dunkel, J. Bat4CEP: A bat algorithm for mining of complex event processing rules. Appl. Intell. 2022. [Google Scholar] [CrossRef]
Sun, Y.; Demirezen, Z.; Mernik, M.; Gray, J.G.; Bryant, B.R. Is My DSL a Modeling or Programming Language. In Proceedings of the 2nd International Workshop on Domain-Specific Program Development (DSPD), Nashville, TN, USA, 22 October 2008; p. 4. [Google Scholar]
Wu, H.; Gray, J.; Roychoudhury, S.; Mernik, M. Weaving a Debugging Aspect into Domain-Specific Language Grammars. In Proceedings of the 2005 ACM Symposium on Applied Computing, SAC ’05, Santa Fe, New Mexico, 13–17 March 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 1370–1374. [Google Scholar] [CrossRef]
Carver, J.C. Towards Reporting Guidelines for Experimental Replications: A Proposal. In Proceedings of the 1st International Workshop on Replication in Empirical Software Engineering, Cape Town, South Africa, 1–8 May 2010. [Google Scholar]
Evans, B.P.; Xue, B.; Zhang, M. What is inside the Black-Box? A Genetic Programming Method for Interpreting Complex Machine Learning Models. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO’19, Prague, Czech Republic, 13–17 July 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1012–1020. [Google Scholar] [CrossRef]

Figure 1. Protocol of the controlled experiment.

Figure 2. The semantic tree for the program “begin right up left down end” based on a manually written AG.

Figure 3. The semantic tree for the program “begin right up left down end” based on an automatically generated AG.

Table 1. Background study on knowledge (Mann–Whitney test).

Knowledge on	Part	N	Mean	Std. Dev.	Median	Mean Rank	Z	p-Value
Programming	Group I (FERI)	33	4.03	0.68	4	25.30	−0.239	<0.811
Programming	Group II (FRI)	16	3.94	0.77	4	24.38	−0.239	<0.811
Compilers	Group I (FERI)	33	2.88	0.89	3	21.58	−2.559	<0.011
Compilers	Group II (FRI)	16	3.63	0.72	3.5	32.06	−2.559	<0.011
Attribute	Group I (FERI)	33	2.60	0.86	2	22.50	−1.871	<0.061
Grammars	Group II (FRI)	16	3.06	0.68	3	30.16	−1.871	<0.061

Table 2. Background study on interest (Mann–Whitney Test).

Interest in	Part	N	Mean	Std. Dev.	Median	Mean Rank	Z	p-Value
Programming	Group I (FERI)	33	4.36	0.74	4	25.42	−0.335	<0.737
Programming	Group II (FRI)	16	4.38	0.50	4	24.13	−0.335	<0.737
Compilers	Group I (FERI)	33	2.45	1.15	3	19.80	−3.785	<0.001
Compilers	Group II (FRI)	16	3.94	0.85	4	35.72	−3.785	<0.001

Table 3. Comprehension correctness of the first test (Mann–Whitney test).

Part	Mean	N	Std. Dev.	Median	Mean Rank	Z	p-Value
Group I (FERI)	51.96	33	28.29	42.90	21.47	−2.513	<0.012
Group II (FRI)	72.31	16	19.84	71.40	32.28	−2.513	<0.012

Table 4. Comprehension correctness of the second test (Mann–Whitney test).

Part	Mean	N	Std. Dev.	Median	Mean Rank	Z	p-Value
Group I (FERI)	77.46	26	24.30	85.70	26.38	−3.347	<0.001
Group II (FRI)	51.79	16	19.41	42.90	13.56	−3.347	<0.001

Table 5. Average correctness of tasks in both tests.

	1st Test		2nd Test
	Group I (FERI)	Group II (FRI)	Group I (FERI)	Group II (FRI)
Q1	60.61%	87.50%	92.31%	37.50%
Q2	51.52%	87.50%	65.38%	25.00%
Q3	54.55%	75.00%	76.92%	0.00%
Q4	57.58%	50.00%	69.23%	87.50%
Q5	24.24%	37.50%	84.62%	81.25%
Q6	51.52%	93.75%	73.08%	43.75%
Q7	60.61%	75.00%	80.77%	87.50%
Average	51.52%	72.32%	77.47%	51.79%

Table 6. Comprehension efficiency of the first test (Mann–Whitney Test).

Part	Mean	N	Std. Dev.	Median	Mean Rank	Z	p-Value
Group I (FERI)	1.14	33	0.58	1.15	20.17	−3.401	<0.001
Group II (FRI)	2.08	16	1.02	1.945	34.97	−3.401	<0.001

Table 7. Comprehension efficiency of the second test (independent sample T-Test).

Part	Mean	N	Std. Dev.	Median	t	df	p-Value
Group I (FERI)	1.98	26	0.77	1.86	4.771	40	<0.001
Group II (FRI)	0.99	16	0.40	1.05	4.771	40	<0.001

Table 8. Feedback on simplicity of AGs (Mann–Whitney test).

		N	Mean	Std. Dev.	Median	Mean Rank	Z	p-Value
1st Test	Group I (FERI)	33	2.66	0.45	2.71	19.56	−3.852	<0.001
1st Test	Group II (FRI)	16	3.25	0.42	3.14	36.22	−3.852	<0.001
2nd Test	Group I (FERI)	28	3.38	0.76	3.43	25.93	−2.811	0.005
2nd Test	Group II (FRI)	15	2.73	0.36	2.71	14.67	−2.811	0.005

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Slivnik, B.; Kovačević, Ž.; Mernik, M.; Kosar, T. On Comprehension of Genetic Programming Solutions: A Controlled Experiment on Semantic Inference. Mathematics 2022, 10, 3386. https://doi.org/10.3390/math10183386

AMA Style

Slivnik B, Kovačević Ž, Mernik M, Kosar T. On Comprehension of Genetic Programming Solutions: A Controlled Experiment on Semantic Inference. Mathematics. 2022; 10(18):3386. https://doi.org/10.3390/math10183386

Chicago/Turabian Style

Slivnik, Boštjan, Željko Kovačević, Marjan Mernik, and Tomaž Kosar. 2022. "On Comprehension of Genetic Programming Solutions: A Controlled Experiment on Semantic Inference" Mathematics 10, no. 18: 3386. https://doi.org/10.3390/math10183386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Comprehension of Genetic Programming Solutions: A Controlled Experiment on Semantic Inference

Abstract

1. Introduction

2. Related Work

3. Method

4. Results

5. Threats to Validity

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI