NOTE: This is a Paper Reading for Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis. The original paper can be found here.

What is the problem being tackled?

Finding a solid theoretical foundation for fuzzing and using it to boost fuzzing efficiency is a direction of research of great practical value.

How was it addressed by prior work?

Previous works have proposed various heuristics to boost fuzzing efficiency, such as assigning more energy to seeds that have previously been observed to crash, that maximize execution counts to discover algorithmic complexity vulnerabilities, that exercise low-probability paths, etc. Furthermore, there has also been prior research in theoretical aspects of fuzzing, such as conducting a probabilistic analysis on the efficiency of blackbox versus whitebox fuzzing, empirically investigating the scalability of non-deterministic black- and greybox fuzzing, etc.

What are the innovation(s) proposed in this paper?

First, the paper develops an information-theoretic foundation for non-deterministic fuzzing.

Assumptions

Fuzzing Heuristics remain constant throughout the fuzzing process.

Concepts

Neighborhood: All inputs generated from mutating a seed.
Species: A branch within a program.
Species Discovery: Program execution traverses a previously untraversed branch when some input is provided to the program.
Incidence Frequency: The number of times a species is covered.
Energy: The probability the fuzzer chooses a seed for mutation.
Power Schedule: The procedure of assigning energy to a seed.
Local Species Distribution of a Seed: Given a seed, the probability of each species being covered, when an input generated by mutation from the seed is fed to the program.

Entropy in the Context of Fuzzing

Using the metaphor of a "language" with "words" of varying frequencies, entropy in the context of fuzzing can be understood as:

"Sentences" of the "language": Program executions resulting from generated inputs.
"Words" of the "language": Species.
Frequencies of the "words": The frequencies of each species being traversed.

Entropy can be calculated using the frequencies of the "words", and represents the frequency distribution of the "words". As high entropy implies that the species of the program have all been well covered, it can be used as a proxy for fuzzing efficiency.

Local Entropy of a Seed

Still using the metaphor of a "language" with "words" of varying frequencies, local entropy of a seed can be understood as:

"Sentences" of the "language": Program executions resulting from inputs within the seed's neighborhood.
"Words" of the "language": Species.
Frequencies of the "words": The frequencies of each species being traversed.

The local entropy of a seed quantifies the information that feeding the inputs within the seed's neighborhood into the program reveals about the species.

Second, the paper presents the first entropy-based power schedule to boost the efficiency of greybox fuzzers. More energy is assigned to seeds that elicit more information about the program's species. Thus, every time when randomly choosing a seed for mutation, each seed is assigned an energy proportional to its local entropy.

However, a new seed that has never been fuzzed will always be assigned zero energy, and they will never be chosen for mutation. To solve this problem, add-one smoothing is used for the frequency of the species.

Specifically, the frequency of species \(i\) used to calculate local entropy of seed \(t\):

\(p_i^t = \frac{Y_i^t + 1}{S + Y_1^t + \dots + Y_S^t}\)

Where:

\(Y_i^t\) is the number of times species \(i\) has been traversed by the neighborhood of \(t\).
\(S\) is the total number of species at the time of calculation.

Furthermore, in the experiments, the authors noticed that the local entropies for different seeds were almost the same, because a small number of very abundant species had a huge impact on the local entropies. Thus, the authors defined an abundance threshold \(\theta\) which is an upper bound for \(Y_i^t\).

How are those innovations evaluated? How does the paper's evaluation match with the proposed problem statement?

The paper provides an open-source implementation, Entropic, within LLVM libFuzzer, and presents a substantial empirical evaluation on over 250 widely-used, open-source C/C++ programs producing over 2 CPU years worth of data.

Four research questions were asked to evaluate the hypothesis that increasing information per generated input increases fuzzer efficiency.

What is the empirical coverage improvement over the baseline?
How much faster are bugs detected compared to the baseline?
How does the choice of abundance threshold influence the performance of our technique?
What is the cost of maintaining incidence frequencies?

The answers to these research strongly support the hypothesis, thus the evaluation matches well with the proposed problem statement.

Your opinion of the paper

Which technical innovations are most compelling to you?

Developing an information-theoric foundation for non-deterministic fuzzing, in which entropy in the context of fuzzing is calculated using the probability distribution of species (branches). This is both intuitive and allows us to effectively use entropy, which has "really nice properties, and a principled origin" as a "convenient proxy" for fuzzing efficiency.

What remains unclear after reading the paper? Are there any clarification questions whose answers would substantially change your opinion of the paper? Which problems remain unsolved after this paper?

The paper develops an information-theoretic foundation for non-deterministic fuzzing, before presenting the first entropy-based power schedule to boost the efficiency of greybox fuzzers. I have questions regarding both aspects.

Entropy is calculated using the probability distribution of species, which are branches. Is is possible to utilize a different definition of "species"?
The entropy-based power schedule assigns each seed with energy proportional to its local entropy. However, the authors noticed that the local entropies for different seeds were almost the same, because a small number of very abundant species had a huge impact on the local entropies. Thus, the authors defined an abundance threshold \(\theta\) for \(Y_i^t\), a task-relevant hyperparameter. Is there a better approach for calculating the local entropies?