-
Notifications
You must be signed in to change notification settings - Fork 8
Graph Matching: Challenges
For the matching following are the prerequisite tasks:
Task 1: Data flow graph of the binary Issue1
The data flow graph for the binary program must include the values(register or memory) as nodes. This requires enumerating all the registers and memory in the program and run the data flow analysis with it.
For example: for the #0: addq $0, %rax; #1: movq %rax, %rbx
, the dfg should have the following relations:
%rax (from #0) --> %rbx (from #1)
The current representation is like
#0 --> #1
and only captures the df relation across instructions and hence will not expose the input-output relation within an instruction. On the other hand, the dfg of the decompiled LLVM program, being derived from the instruction semantics, captures the df relations within an instruction.
Task 2: Enhance Stoke APIs to provide memory access information Issue2
Consider the instruction pushq %rbp
which pushes the value of rbp
at sack location %rsp-8
and also updates the value of %rsp
to %rsp-8
. The read-set of the instruction is %rbx and %rsp
and the write-set is %rsp and memory
. Currently, the stoke APIs, to extract the read/write-set of instruction, is not proving the memory read or write information. Make changes in the API to accommodate that.
Task 3: Prune out spurious df edges of the dfg of the binary Issue3
Binary instruction are multi-input and multi-output. Consider the instruction pushq %rbp
which pushes the value of rbp
at sack location %rsp-8
and also updates the value of %rsp
to %rsp-8
. The read-set of the instruction is %rbx and %rsp
and the write-set is %rsp and memory
. A naive approach of generating dfg will create two edges from each read-set to each one of the write-set. However, the edge from %rbp
to %rsp
is spurious as %rbp
is not contributing towards the update of the %rsp
value. These spurious edges can be avoided if we take the semantics of the pushq
instruction into account. In fact, the dfg of the decompiled LLVM program does not have this spurious information as it is constructed following the semantics of an instruction.
In order to remove the spurious edges, we need to know which members of the read-set are actually contributing to the write-set. Such, information can be extracted from the semantic definition of an instruction.
Task 4: Discover aliasing load/stores to normalise the dfg on the decompiled LLVM program Issue4
In the unoptimized decompiled LLVM program, there is no store forwarding, which makes it hard to see the df relation between the aliasing store and subsequent load. Discovering the aliasing relation will drive the LLVM dfg closer to the binary dfg.
Cosider the scenario:
X86
===========
addq $10, %rbx
addq $20, %rbx
jc .L2
L2:
subq %rbx, %rdx // use-set %rbx
LLVM
===========
/*
addq $10, %rbx
*/
%1 = load *RBX
%2 = add 10, %1
/*
addq $20, %rbx
*/
%3 = load *RBX
%4 = add 20, %3
jc .L2
L2:
/*
subq %rbx, %rdx // use-set %rbx
*/
%5 = sub %4, ...
Case I: Unoptimized: Both llvm dfg is present; unordered
Our intention: Before L2 we need to add an invariant rbx == %4
Case II: Optimized: Both llvm dfg is present; re-ordered
Our intention: Before L2 we need to add an invariant rbx == %4
Case III: Optimized: Only one of the lvm dfgs are present;
Our intention: Before L2 we need to add an invariant rbx == %4
test