Paper Reading: Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software

NOTE: This is a Paper Reading for Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis. The original paper can be found here.

To combat worms spread by the Internet exploiting software vulnerabilities, the paper proposes TaintCheck, a dynamic taint analysis technique for automatic detection of exploits on software.

Summary of TaintCheck:

TaintCheck directly operates on an arbitrary executable and does not require its source code. It uses Valgrind to translate basic blocks being executed into Valgrind's RISC-like instruction set (UCode), inserts UCode instructions for instrumentation, and passes the modified UCode back to Valgrind for execution.
TaintCheck by default considers data originating from the network as "untrusted" and taints it. It keeps track of "the propagation of tainted data as the program executes", which involves monitoring data movement instructions and arithmetic instructions, with the exception of constant functions such as xor eax, eax.
To accomplish this, TaintCheck associates "each byte of memory" with a Taint data structure. Different instances of such a data structure are "chained" to record "how tainted data is propagated".
TaintCheck checks whether tainted data is used in ways it considers illegitimate, such as being used as a return address, a function pointer, a format string, and (optionally) as an argument of a system call. When such illegitimate uses are detected, it is possible to collect information about a software vulnerability, especially "the execution path from tainted data's entry and its use in a probable exploit".

The paper also proposes a new semantic-based automatic signature generation approach on top of TaintCheck.

There are several questions that came to my mind when I was reading this paper:

The paper mentions that "the current implementation slows program execution between 1.5 and 40 times", but also mentions that "the prototype has not been optimized", and proposes optimization techniques. Why didn't the authors implement these optimization techniques and conduct experiments on the optimized TaintCheck?
There is no doubt that using Valgrind to translate basic blocks being executed into UCode greatly simplifies dynamic taint analysis on an arbitrary executable, as TaintCheck deals with an RISC-like instruction set instead of raw machine code. However, this incurs significant overhead. Would directly performing dynamic taint analysis on machine code at runtime using a dynamic binary instrumentation tool such as Intel Pin boost performance (like the case of QSym)? What about generating UCode, inserting instructions for instrumentation, and passing the modified UCode back to Valgrind before the executable is executed?
What is the overhead of using the Taint data structure? Would the total size of all Taint data structures explode for long-running processes? And why do they use this Taint data structure, instead of using a conventional Data Flow Graph?
What is the list of constant functions that TaintCheck supports? Is it representative, and is it extensible?
Are the ways tainted data is used considered by TaintCheck to be illegitimate representative of real exploits? How well can TaintCheck discriminate from "illegitimate" uses with intentional uses, and/or uses with checks? Specifically, the paper mentions that TaintCheck can "untaint the data immediately after it has been sanity checked", but how is this situation detected?
In the evaluation section, why are the benchmarks used in assessing "compatibility and false positives" different from those used in assessing "attack detection" on actual exploits?
What does a "signature" look like, and how is it used to filter attacks?

Feedback from the Class Discussion

Performance is not a priority. The paper is more of a proof-of-concept, and even "reads like a grant proposal", especially Section 6.
The "taint data structure" includes more information than a dataflow graph (snapshots of the stacks etc.). It can also use a "memory arena" instead of vanilla heap allocation to improve performance.
Some of the detected attacks may not be present in "safe", managed languages.
Due to the large overhead, the technique cannot be used to handle requests in production, but requests can be forked to it instead.
Dynamic taint analysis can have applications outside of the security domain.

Paper Reading > Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis

Paper Reading: Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software

https://abbaswu.github.io/2022/10/29/Paper-Reading-Dynamic-Taint-Analysis-for-Automatic-Detection-Analysis-and-Signature-Generation-of-Exploits-on-Commodity-Software/

Author

Jifeng Wu

Posted on

October 29, 2022

Licensed under

Various Solutions for Different Types of Reverse Proxying Previous

Python in a Functional Style: Closures, Generators, and Coroutines Next