Keywords

1 Introduction

Browsers have become the main interface to almost all online content for almost all users. As a result, they have also become extremely sophisticated. A modern browser renders content using a wide variety of interconnected components with interpreters for a growing set of languages such as JavaScript, Flash, Java, and XSLT. Small wonder that browsers have turned into prime targets for attackers who routinely exploit the embedded interpreters to launch sophisticated attacks [1]. For instance, the JavaScript interpreter in modern browsers (e.g., SpiderMonkey in Firefox) is a widely used interpreter that is responsible for many high-impact vulnerabilities [2]. Unfortunately, the nature and complexity of these interpreters is currently well beyond state-of-the-art bug finding techniques, and therefore, further research is necessary [3]. In this paper, we propose a novel evolutionary fuzzing technique that explicitly targets interpreters.

Fuzz testing is a common approach for finding vulnerabilities in software [48]. Many fuzzers exist and range from a simple random input generator to highly sophisticated testing tools. For instance, in this paper, we build on evolutionary fuzzing which has proven particularly effective in improving fuzzing efficiency [5, 9, 10] and makes use of evolutionary computing to generate inputs that exhibit vulnerabilities. While fuzzing is an efficient testing tool in general, applying it to interpreters brings its own challenges. Below, we list a few of the issues that we observed in our investigations:

  1. 1.

    Traditionally, fuzzing is about mutating input that is manipulated by a software. In the case of the interpreter, the input is program (code), which needs to be mutated.

  2. 2.

    Interpreter fuzzers must generate syntactically valid inputs, otherwise, inputs will not pass the elementary interpreter checks (mainly the parsing phase) and testing will be restricted to the input checking part of the interpreter. Therefore, the input grammar is a key consideration for this scenario. For instance, if the JavaScript interpreter is the target, the fuzzed input must follow the syntax specifications of the JavaScript language, lest the inputs be discarded early in the parsing phase.

  3. 3.

    An interpreter may use a somewhat different (or evolved) version of the grammar than the one publicly known. These small variations are important to consider when attempting fuzzing the interpreter fully.

Genetic Programming is a variant of evolutionary algorithms, inspired by biological evolution and brings transparency in making decisions. It follows Darwin’s theory of evolution and generates new individuals in the eco-system by recombining the current characteristics from individuals with the highest fitness. Fitness is a value computed by an objective function that directs the evolution process. Genetic Programming exploits the modularity and re-usability of solution fragments within the problem space to improve the fitness of individuals. This approach has been shown to be very appropriate for generating code fragments [1113], but hasn’t been used for fuzz-testing in general as program inputs are typically unstructured and highly inter-dependent. However, our key insight is that, as described before, interpreter fuzzing is a special case. Using code as input, Genetic Programming seems like a natural fit!

par In this paper, we introduce a framework called IFuzzer, which generates code fragments using Genetic Programming [14]—allowing us to test interpreters by following a black-box fuzzing technique and mainly looks for vulnerabilities like memory corruptions. IFuzzer takes a language’s context-free grammar as input for test generation. It uses the grammar to generate parse trees and to extract code fragments from a given test-suite. For instance, IFuzzer can take the JavaScript grammar and the test-suite of the SpiderMonkey interpreter as input and generate parse trees and relevant code fragments for in-depth testing. IFuzzer leverages the fitness improvement mechanism within Genetic Programming to improve the quality of the generated code fragments.

Fig. 1.
figure 1

Overview of IFuzzer approach

Figure 1 describes the overview of IFuzzer. The fuzzer takes as input a test suite, a language grammar and sample codes. The parser module uses the language grammar to parse the program and generates an abstract syntax tree. The fragment pool extractor generates a pool of code fragments extracted from a set of sample code inputs for different nodes (Non-Terminals) in the grammar. The code generator generates new code fragments by performing genetic operations on the test suite. The interpreter executes all the generated code fragments. Based on the feedback from the interpreter, the fragments are evaluated by the fitness evaluator and accordingly used (or discarded) for future generations of inputs. We evaluated IFuzzer on two versions of Mozilla JavaScript interpreter. Initially, we configured it to target SpiderMonkey 1.8.5 in order to have a comparison with LangFuzz [3], a state-of-art mutation fuzzer for interpreter testing. In another experiment, we configured IFuzzer to target the latest builds of SpiderMonkey. Apart from finding several bugs that were also found by LangFuzz, IFuzzer found new exploitable bugs in these versions.

In summary, this paper makes the following contributions:

  1. 1.

    We introduce a fully automated and systematic approach for code generation for interpreter testing by mapping the problem of interpreter fuzz testing onto the space of evolutionary computing for code generation. By doing so, we establish a path for applying advancements made in evolutionary approaches to the field of interpreter fuzzing.

  2. 2.

    We show that Genetic Programming techniques for code generation result in a diverse range of code fragments, making it a very suitable approach for interpreter fuzzing. We attribute this to inherent randomness in Genetic Programming.

  3. 3.

    We propose a fitness function (objective function) by analyzing and identifying different code parameters, which guide the fuzzer to generate inputs which can trigger uncommon behavior within interpreters.

  4. 4.

    We implement these techniques in a full-fledged (to be) open sourced fuzzing tool called IFuzzer that can target any language interpreter with minimal configuration changes.

  5. 5.

    We show the effectiveness of IFuzzer empirically by finding new bugs in Mozilla’s JavaScript engine SpiderMonkey—including several exploitable security vulnerabilities.

The rest of the paper is organized as follows. Section 2 presents the motivation for choosing Genetic Programming for code generation. We explain the implementation of IFuzzer in Sect. 3. Section 4 discusses the experimental set-up and evaluation step of IFuzzer and Sect. 7 concludes the work with comments on possible future work.

2 Genetic Programming

Evolutionary algorithms build up a search space for finding solutions to optimization problems, by evolving a population of individuals. An objective function evaluates the fitness of these individuals and provides feedback for next generations of individuals. These algorithms build on the Darwinian principle of natural selection and biologically inspired genetic operations. In prior work, Genetic Algorithms proved successful in the generation of test cases [13, 15].

Genetic programming (GP) [14, 16] achieves the goal of generating a population by following a similar process as that of most genetic algorithms, but it represents the individuals it manipulates as tree structures. Out of the many variants of GP in the literature, we follow Grammar-based Genetic Programming (GGP). In GGP, we consider programs, that are generated based on the rules formulated in the grammar (context free), as the individuals and represent them by parse trees. This procedure is a natural fit for the interpreters. All the individuals in a new generation are the result of applying the genetic operators—crossover and mutation—on the parse tree structures.

Search Space: The search space is the set of all feasible solutions. Each point in the space represents a solution defined by the fitness values or some other values related to an individual. Based on fitness constraints, the individual with highest fitness is considered the best feasible solution.

Bloating: Bloating [16] is a phenomenon that adversely affects input generation in evolutionary computing. There are two types of bloating: structural and functional bloating.

  • Structural Bloating: While iterating over generations, after a certain number of generations, the average size of individuals (i.e. the code) grows rapidly due to uncontrolled growth [17]. This results in inefficient code, while the growth hardly contributes to the improvement of fitness. Moreover, large programs require more computation to process.

  • Functional Bloating: In functional bloating [18], a range of fitness values become narrow and thereby reduces the search space. However, it is common to have different individuals with the same fitness, because after some time bloating makes everything look linear. As a result, it becomes hard to distinguish individuals.

As the process of fuzzing may run for a very long period, neglecting or failing to handle the bloating problem may lead to very unproductive results.

2.1 Representation of the Individuals

We consider input code to be the individuals manipulated by GP. Each individual is represented by its parse tree, generated using the corresponding language grammar. IFuzzer performs all its genetic operations on these parse trees and generates new individuals (input code)from the resulting parse trees. Figure 2 illustrates an example of valid program for the simple language grammar (Listing 1.1) and the corresponding parse tree derived.

figure a
Fig. 2.
figure 2

Example of a syntactically valid program and its derived parse tree

2.2 Fragment Pool

The fragment pool is a collection of code fragments for each non-terminal in the grammar specification. We can tag each possible code fragment in a program with a non-terminal in the grammar specification. Using the parser, IFuzzer parses all the input files in the test suite and extracts the corresponding code fragments for different non-terminals. With a sufficient number of input files, we can generate code fragments for all non-terminals in the language grammar. It stores these code fragments in tree representations with the corresponding non-terminal as root. At a later stage, it uses these code fragments for mutation and code generation. The same process of generating parse trees is followed in the crossover operation for identifying code fragments for selected common non-terminal between the participating individuals. An example of a fragment pool for the derived parse tree, summarized in Fig. 2, is shown in the box below.

figure b

3 Implementation

We implement IFuzzer as a proof-of-concept based on the methods discussed in the previous sections. It works as described in the overview diagram of Fig. 1 and in the following, we elaborate on IFuzzer’s individual components.

3.1 Code Generation

In this section, we explain various genetic operators that IFuzzer uses for input generation. After each genetic operation, the objective function, discussed in Sect. 3.3, evaluates the fitness of the offspring.

IFuzzer uses the ANTLR parser for the target language and generates the parser using the ANTLR parser generator framework [19] with the language grammar as input. The initial population, the fragment pool generation (discussed in Sect. 2), and the crossover and mutation operations all make use of parse tree returned by the parser.

Initial Population. The initial population of individuals consists of random selection of programs, equal to the population size, from the input test samples. This forms the first generation. After each generation, individuals from the parent set undergo genetic operations and thereby evolve into offspring.

Mutation. During mutation, IFuzzer selects random code fragments of the input code for replacement. It performs replacement by selecting a random member of the fragment pool which corresponds to the same non-terminal, or by generating a new fragment using an expansion algorithm. Our expansion algorithm assumes that all the production rules have equal and fixed probabilities for selection. We use the following expansion algorithm:

  1. 1.

    Select the non-terminal n from the parse tree to expand.

  2. 2.

    From the production rules of the grammar, select a production for n and replace it with n.

  3. 3.

    Repeat the following steps up to num iterations.

    1. (a)

      Identify a random set N of non-terminals in the resulting incomplete parse tree.

    2. (b)

      Extract a set of production rules \(P_{n}\), for the selected non-terminal n, from the production rules P (i.e., \(P_{n}\) \(\subseteq \) P) listed in the grammar specification.

    3. (c)

      Select a production \(P_{selected}\) randomly for each identified non-terminal \(\in N\).

    4. (d)

      Replace the non-terminals occurrence with \(P_{selected}\).

  4. 4.

    After expansion, replace all remaining occurrences of non-terminals with corresponding code fragments, selected randomly from the fragment pool. Note that steps 3 & 4 also solve the problem of non-termination in the presence of mutually recursive production rules.

Fig. 3.
figure 3

Example of stepwise expansion on the parse tree: all the dark nodes represent non-terminals and white nodes represent terminals. A particular node is selected and expanded as shown.

Figure 3 illustrates an example of the expansion algorithm. Dark nodes in the parse tree represent the non-terminals and white nodes represent the terminals. A dark node from the parse tree is selected during the mutation process and is expanded to the certain depth (num) as discussed above. This algorithm does not yield a valid expansion with more iterations. After expansion, we may still have unexpanded non-terminals. IFuzzer handles this by choosing code fragments from the fragment pool and replaces remaining non-terminals by such code fragments, which are represented by the same non-terminals. In this way, the tree converges with terminals and results in a valid parse tree.

Crossover. During crossover, for a given pair of individuals (parents), IFuzzer selects a common non-terminal n from parse trees of the individuals and extracts random fragments, originating from n, from both the individuals. These selected fragments from one individual are exchanged with fragments of another individual, thereby generating two new parse trees. Using these trees, IFuzzzer generates two new offsprings.

Replacement. During the process of offspring generation, it is important to retain the features of the best individuals (parents) participating in evolution. Therefore, IFuzzer adopts the common technique of fitness elitism to retain the best individuals among the parents in the next generation. IFuzzer generates the remaining population in the next generation by crossover and mutation. Elitism prevents losing the best configurations in the process.

Reusing Literals. The code generation operations may result in semantically invalid fragments or a loss of context. For instance, after a modification a statement in the program may use an identifier a which is not declared in this program. Introducing language semantics will tie IFuzzer to a language specification and we therefore perform generic semantic improvements at the syntactic level. Specifically, IFuzzer reduces the errors due to undeclared identifiers by renaming the identifiers around the modification points to the ones declared elsewhere in the program. Since it knows the grammar rules that contain them, IFuzzer can easily extract such identifiers from the parse tree automatically. In our example of the undeclared variable a, it will mapped it to another identifier b declared elsewhere and replace all occurrences of a with b.

3.2 Bloat Control

Bloat control pertains to different levels [20] and IFuzzer uses it during the fitness evaluation and breeding stages:

Stage 1: Fitness Evaluation. Applying bloat control at the level of fitness evaluation is a common technique. In IFuzzer, we use parsimony pressure [21, 22] to alter the selection probability of individuals by penalizing larger ones.

Calculating Parsimony Coefficient: The parsimony co-efficient c(t) at each generation t is given by the following correlation coefficient [23].

$$\begin{aligned} c(t) = \frac{Covariance(f,l)}{Variance(l)} = \frac{\sum _{i=0}^{n}(f_{i}-\bar{f})(l_{i}-\bar{l})}{n-1} \times \frac{n-1}{\sum _{i=0}^{n}(l_{i}-\bar{l})^2} \end{aligned}$$
(1)

where \(\bar{l}\) and \(\bar{f}\) are the mean fitness and length of all individuals in the population, and \(f_{i}\) and \(l_{i}\) are the original fitness and length of an individual i. Covariance(fl) calculates the co-variance between an individual’s fitness and length, while Variance(l) gives the variance in the length of the individuals. In Sect. 3.3, we will see that IFuzzer uses the parsimony coefficient to add penalty to the fitness value.

Stage 2: We also apply bloat control at the breeding level by means of fair size generation techniques [16]. Fair Size Generation limits the growth of the offspring’s program size. In our approach, we restrict the percentage of increase in program size to a biased value:

$$\begin{aligned} length_{generated\_code}/length_{original\_code} < bias_{threshold} \end{aligned}$$

where \(length_x\) gives information about the number of non-terminals in the parse tree x and \(bias_{threshold}\) is the threshold value for fair size generation. This restricts the size of code and if the generated program fails to meet this constraint, IFuzzer discards as invalid. In that case, it re-generates the program using the same GA operator with which it started. After a certain number of failed attempts, it discards the individual completely and excludes it from further consideration for offspring generation.

Finally, we use Delta debugging algorithm [24, 25] to determine the code fragments that are relevant for failure production and to filter out irrelevant code fragments from the test cases, further reduces the size of test case. This essentially results in part of the test case that is relevant to the failure [26]. The same algorithm reduces the number of lines of code executed and results in suitably possible valid small test case.

3.3 Fitness Evaluation

The evolutionary process is an objective driven process and the fitness function that defines the objective of the process plays a vital role in the code generation process. After crossover and mutation phases, the generated code fragments are evaluated for fitness.

As IFuzzer aims to generate uncommon code structures to trigger exceptional behavior, we consider both structural aspects and interpreter feedback of the generated program as inputs to the objective function. The interpreter feedback includes warnings, execution timeouts, errors, crashes, etc.—in other words, the goal itself. Moreover, during the fitness evaluation, we calculate structural metrics such as the cyclomatic complexity for the program. The cyclomatic complexity [27] gives information about the structural complexity of the code. For instance, nested (or complex) structure has a tendency to create uncommon behavior [28], so such structures have higher scores than less complex programs.

At its core, IFuzzer calculates the base fitness value \(f_b(x)\) of an individual x as the sum of its structural score (\(score_{structure}\)) and its feedback score (\(score_{feedback}\)).

$$\begin{aligned} f_b(x) = score_{structure} + score_{feedback} \end{aligned}$$

Finally, as discussed in Sect. 3.2, IFuzzer’s bloat control re-calculates the fitness with a penalty determined by the product of its parsimony co-efficient c and the length of the individual l:

$$\begin{aligned} f_{final}(x) = f_b(x) - c*(l(x)) \end{aligned}$$

where \(f_{final}(x)\) is the updated fitness value of an individual x.

Parameters. IFuzzer contains many adjustable GP and fitness parameters, including the mutation rate, crossover rates, population size, and the number of generations. In order to arrive at a set of optimal values, we ran application (to be tested) with various combinations of these parameters and observed for properties like input diversity, structural properties etc. We adhere to the policy that higher the values of such properties, better is the combination of parameters. In the experiments, we use the best combination based on observations, made during a fixed profiling period. We, however, note that it should be possible to fine tune all these parameters further for optimal results.

4 Experimentation

In this section, we evaluate the effectiveness of IFuzzer by performing experimentation on real-world applications. IFuzzer is a cross platform tool, which runs on UNIX and Windows operating systems. All the experiments were performed on a standalone machine with a configuration of Quad-Core 1.6 Ghz Intel i5-4200 CPU and 8 GB RAM. The outcome of our experiments aims to answer the following questions:

  1. 1.

    Does IFuzzer perform better than the known state-of-art tools? What is the effectiveness of IFuzzer?

  2. 2.

    What are the benefits of using GP? What drives GP to reach its objective?

  3. 3.

    Does our defined objective function encourage the generation of uncommon code?

  4. 4.

    How important is it that IFuzzer generates uncommon code? How is this related to having high coverage of the interpreter?

In order to answer the questions mentioned above, we performed two experiments. In the first experiment, we evaluate IFuzzer and compare it against the state-of-the-art LangFuzz using the same test software [3]. In the second experiment, we run IFuzzer against the latest build of SpiderMonkey. We have also run IFuzzer with different configurations in order to evaluate the effect of separate code generation strategies. Results of these experiments are in the Appendix.

We also ran IFuzzer on Chrome JavaScript engine V8 and reported few bugs. However, our reported-bugs do not appear to be security bugs (as per Chrome V8 team) and therefore, we do not report them in detail in this paper. In order to establish the usability of IFuzzer to other interpreters, we could configure IFuzzer for Java by using Java Grammar Specifications, available at [29]. However, we have not tested this environment to its full extent. The main intention of performing this action is to show the flexibility of IFuzzer to other grammars.

4.1 Testing Environment

In our experiments, we used the Mozilla development test suite as the initial input set. The test suite consists of \(\tilde{3}000\) programs chosen from a target version. We used the same test suite for fragment pool generation and program generation. Fragment Pool generation is a one-time process, which reads all programs at the start of the fuzzing process and extracts fragments for different non-terminals. We assume that the test suite involves inputs (i.e. code fragments) that have been used in testing in the past and resulted in triggering bugs. We choose SpiderMonkey as the target interpreter for JavaScript. We write input grammar specification from the ECMAScipt standard specification and grammar rules from the ECMAScript 262 specification [30].

4.2 IFuzzer vs. LangFuzz

Our first experiment evaluated IFuzzer by running it against interpreters with the aim of finding exploitable bugs and compare our results to those of LangFuzz. We compare in terms of time taken in finding bugs and the extent of the overlap in bugs found by both the fuzzers. Since we do not have access to the LangFuzz code, we base our comparison on the results reported in [3]. For a meaningful comparison with LangFuzz, we chose SpiderMonkey 1.8.5 as the interpreter as this was the version of SpiderMonkey that was current when LangFuzz was introduced in [3].

During the experiment on SpiderMonkey 1.8.5 version, IFuzzer found 40 bugs in a time span of one month, while Langfuzz found 51 bugs in 3 months. More importantly, when comparing the bugs found by the two fuzzers, the overlap is “only” 24 bugs. In other, a large fraction of the bugs found by IFuzzer is unique.

With roughly 36 % overlap in the bugs (Fig. 4), IFuzzer clearly finds different bugs–bugs that were missed by today’s state-of-the-art interpreter fuzzer—in comparable time frames.

We speculate that IFuzzer will find even more bugs if we further fine-tune its parameters and run it for a longer period. We also notice that there are many build configurations possible for SpiderMonkey, and Langfuzz tries to run on all such possible build configurations. In contrast, due to resource constraints, we configured IFuzzer to run only on two such different configurations (with and without enabling debugging). Trying more configurations may well uncover more bugs [31].

Fig. 4.
figure 4

Number of defects found by IFuzzer (40) and LangFuzz (51) in SpiderMonkey version 1.8.5

In order to determine the severity of the bugs, we investigated them manually with gdb-exploitable [32]–a widely used tool for classifying a given application crash file (core dump) as exploitable or non-exploitable. Out of IFuzzer’s 40 bugs, gdb-exploitable classified no fewer than 12 as exploitable.

Example of a Defect Triggered by IFuzzer: Listing 1.2 shows an example of a generated program triggering an assertion violation in SpiderMonkey 1.8.5. The JavaScript engine crashes at line 6, as it fails to build an argument array for the given argument value abcd*& \(\mathtt {\hat{}}\) %$$. Instead, one would expect an exception or error stating that the argument as invalid.

figure c
Table 1. Bugs found in the latest version of Mozilla’s SpiderMonkey

Another example (shown in Listing 1.3) exposes security issues in SpiderMonkey 1.8.5, which is related to strict mode changes to the JavaScript semantics [33]. Line 8 enables the strict mode which makes changes to the way, SpiderMonkey executes the code. On execution, the JavaScript engine crashes due to an access violation and results in a stack overflow.

4.3 Spidermonkey Version 38

We also ran an instance of IFuzzer to target SpiderMonkey 38 (latest version at the time of experimentation). Table 1 shows the results of running IFuzzer on latest build. IFuzzer detected 17 bugs and out of these, 4 were confirmed to be exploitable. Five of the crashes (marked with \(*\)) are due to assertion failures (which may be fixed in subsequent versions), unhandled out of memory crashes, or spurious crashes that we could not reproduce. The remaining ones are significant bugs in the interpreter itself.

For instance, the following code looks to be an infinite loop, except that one of the interconnected components may fail to handle the memory management and, hence the JavaScript engine keeps consuming the heap memory, creating a denial of service by crashing the machine in few seconds. The code fragment responsible for the crash is shown in Listing 1.4.

figure d

Also, in this case, our contribution and efforts were rewarded by the Mozilla’s bounty program for one of the bugs detected by IFuzzer. The bug received an advisory from mozilla [34] and CVE Number CVE-2015-4507 and concerns a crash due to a getSlotRef assertion failure and application exit in the SavedStacks class in the JavaScript implementation. Mozilla classified its security severeness as “moderate”.

The results discussed so far establishes that the evolutionary approach followed by IFuzzer tool is capable of generating programs with a given objective and trigger significant bugs in real-world applications.

Other Interpreters. When evaluating our work on the Chrome JavaScript engine V8, IFuzzer worked out of the box and reported few bugs that resulted in crash (see Table 2). As far as we can tell, these bugs do not appear to be security bugs and require further scrutiny.

figure e
Table 2. IFuzzer crashes found on Chrome V8 [4.7.0]

In order to establish the usability of IFuzzer to other languages, we could further configure IFuzzer for Java by using Java Grammar Specifications, available at [29]. However, we have not fully tested this environment to its full extent.

5 Remarks on IFuzzer’s Design Decisions

Recall that IFuzzer uses an evolutionary approach for code generation by guiding the evolution process to generate uncommon code fragments. As stated earlier, there are several parameters available to fine-tune IFuzzer for better performance. For example, the choice of using a subset (of cardinality equal to the size of population) of the initial test suite, rather than the whole suite, as the first generation is to make an effective use of resources available. The remaining inputs from test suite can be used in later generation when IFuzzer gets stuck at some local minima, which is a known obstacle in evolutionary algorithms.

The generation in which a bug is identified depends on different factors, including the size of the input test sample, the size of the fragment considered for genetic operation and the size of new fragment induced etc. As discussed, the higher the complexity of inputs, the higher the probability of finding a bug. Bloat control and the time taken by the parser to process the generated programs (one of the fitness parameters) will restrict larger programs from making it into the next generations. IFuzzer does not completely discard larger programs, but deprioritizes them.

We also observed that almost all the bugs in SpiderMonkey 1.8.5 are triggered in the range of 3–120 generations with an average range of 35–40 generations. With the increase in complexity and number of language features added to the interpreter, the latest version requires more uncommonness to trigger the bugs, which implies more time to evolve inputs. As an example, all the bugs in the latest version are found on average after 90–95 generations.

While there are some similarities between LangFuzz and IFuzzer, the differences are significant. It is difficult to make a fair comparison on all aspects. IFuzzer’s GP-based approach is a guided evolutionary approach with the help of a fitness function, whereas LangFuzz follows a pure mutation-based approach by changing the input and testing. IFuzzer’s main strength is its feedback loop and the evolution of inputs as dictated by its new fitness function makes the design of IFuzzer very different from that of LangFuzz.

Both IFuzzer and LangFuzz are generative fuzzers that use grammars in order to be language independent but differ in their code generation processes. LangFuzz uses code mutation whereas IFuzzer uses GP for code generation. The use of GP provides IFuzzer the flexibility of tuning various parameters for efficient code generation.

Intuitively, the fitness function (objective function) is constructed to use the structural information about the program along with interpreter feedback information to calculate the fitness. Structural metrics, along with the interpreter feedback information, are also considered in the fitness calculation. Structural information is used to measure the singularity and complexity of the code generated. The chances of introducing errors are higher with larger and more complex code. Hence, the inputs that triggered bugs are not entirely new inputs but have evolved through generations starting from the initial test cases. We observed this evolutionary manifestation repeatedly during our experimentation.

In a nutshell, we observed that the uncommonness characteristic of the input code (like the structural complexity or the presence of type casting and type conversions) relates well with the possibility of finding exceptional behavior of the interpreter. Throughout this work, the driving intuition has been that most tests during development of the interpreter focused on the common cases. Therefore, testing the interpreter on uncommon (“weird”) test cases should be promising as generating such test cases manually may not be straightforward and thereby some failure cases are missed.

6 Related Work

Fuzz testing was transformed from a small research project for testing UNIX system utilities [35] to an important and widely-adopted technique.

Researchers started fuzzers as brute forcing tools [36] for discovering flaws, after which they would analyze for the possibility of security exploitation. Later, the community realized that such simple approaches have many limitations in discovering complex flaws. Smart Fuzzer, overcame some of these limitations and proved more effective [37].

In 2001, Kaksonen et al. [38] used an approach known as mini-simulation, a simplified description of protocols and syntax, to generate inputs that nearly match with the protocol used. This approach is generally known as a grammar-based approach. It provides the fuzzer with sufficient information to understand protocol specifications. Kaksonen’s mini-simulation ensures that the protocol checksums are always valid and systematically checks which rules are broken. In contrast, IFuzzer uses the grammar specification to generate valid inputs.

Yang et al. [39] presented their work on CSmith, a random “C program” generator. It uses grammar for producing programs with different features, thereby performing testing and analyzing C compilers. This process is a language independent fuzzer and uses semantic information during the generation process.

In the area of security, Zalewski presented ref_fuzz [40] and crossfuzz [41] aiming at the DOM component in browsers. JsFunFuzz [42] is a language-dependent generative tool written by Ruddersman in 2007, which targets JavaScript interpreters in web browsers, and has led to the discovery of more than 1800 bugs in SpiderMonkey. It is considered as one of the most relevant work in the field of web interpreters. LangFuzz, a language independent tool presented by Holler et al. [3] uses language grammar and code mutation approaches for test generation. In contrast, IFuzzer uses grammar specification and code generation. Proprietary fuzzers include Google’s ClusterFuzz [43] which tests a variety of functionalities in Chrome. It is tuned to generate almost 5 million tests in a day and has detected several unique vulnerabilities in chrome components.

However, all these approaches may deviate the process of code generation from generating the required test data, thereby degenerating into random search, and providing low code coverage. Feedback fuzzers, on the other hand, adjust and generate dynamic inputs based on information from the target system.

An example of feedback-based fuzzing is an evolutionary fuzzer. Evolutionary fuzzing uses evolutionary algorithms to create the required search space of data and operates based on an objective function that controls the test input generation. One of the first published evolutionary fuzzers is by DeMott et al. in 2007 [10]. This is a grey-box technique that generates new inputs with better code coverage to find bugs by measuring code block coverage.

Search-based test generation using metaheuristic search techniques and evolutionary computation has been explored earlier for generating test data [44, 45]. In the context of generating inputs using GP for code generation (as also adopted by IFuzzer), recent work by Kifetew et al. [46] combines stochastic grammar with GP to evolve test suites for system-level branch coverage in the system under test.

Our approach differs from existing work in many aspects. First, our approach uses GP with a uniquely designed guiding objective function, directed towards generating uncommon code combinations—making it more suitable for fuzzing. In order to be syntactically correct but still uncommon, we apply several heuristics when applying mutation and crossover operations. Our approach is implemented as a language independent black box fuzzer. To the best of our knowledge, IFuzzer is the first prototype to use GP for interpreter fuzzing with very encouraging results on real-world application.

7 Conclusion and Future Work

In this paper, we elaborate on the difficulties of efficiently fuzzing an interpreter as well as our ideas to mitigate them. The main challenge comes from the fact that we need to generate code that is able to fuzz the interpreter to reveal bugs buried deep inside the interpreter implementation. Several of these bugs are found to be security bugs, which are exploitable, which makes an interpreter a very attractive target for future attacks.

In this work, we proposed an effective, fully automated, and systematic approach for interpreter fuzzing and evaluated a prototype, IFuzzer, on real-world applications. IFuzzer uses an evolutionary code generation strategy that applies to any computer language of which we have the appropriate grammar specifications and a set of test cases for the code generation process. IFuzzer introduces a novel objective function that helps the fuzzer to reach its goal of generating valid but uncommon code fragments in an efficient way. In our evaluation, we show that IFuzzer is fast at discovering bugs when compared with a state-of-the-art fuzzer of its class. IFuzzer found several security bugs in the SpiderMonkey JavaScript interpreter that is used in Mozilla browser. The approach used in this paper is generic enough for automated code generation for the purpose of testing any targeted language interpreters and compilers, for which a grammar specification is available and serves as a framework for generating fuzzers for any interpreted language and corresponding interpreters.

IFuzzer is still evolving and we envision avenues for further improvements. We plan to investigate more code (property) parameters to be considered for the fitness evaluation. In our experiments, we observed that the parameters for the genetic operations (mutation and crossover) should be tuned further to improve the evolutionary process. Another improvement can be to keep track of more information during program execution, which helps to guide the fuzzer in a more fine-grained manner. For example through dynamic program analysis we can gather information about the program paths traversed, which gives coverage information as well as correlation between program paths and the bugs they lead to. This information could be used to refine the fitness function, thus improving the quality of code generation.