Skip to content

Program Structure

Thomas-Malcolm edited this page Jun 30, 2023 · 12 revisions

ADT to BASIL IR

Analysis is performed over the tool's own internal IR, BASIL IR. To do so, we need a way to take a program (assumed: binary), and transform it into this IR. To do this we use a tool to lift the Binary into an ADT (Abstract Data Type) - BAP lifts the ARM binary into its own IR, which it can then output as an ADT. We parse this ADT into an internal representation of what BAP lifted, and then translate this into our IR. This is then what is used to perform analysis on, and later translate into Boogie. An overview of the translating processes is given here.

Parsing Phase

ANTLR is a tool which takes a grammar file (BilAdt.g4) and creates a parser in Java that exposes tokens of the grammar through classes. The parser is autogenerated at runtime here.

val adtLexer = BilAdtLexer(CharStreams.fromFileName(fileName))
val tokens = CommonTokenStream(adtLexer)
val parser = BilAdtParser(tokens)

Walking Phase

The parser object then stores the parsed ADT (Abstract Data Tree) of the program. We then walk the ADT in AdtStatementLoader, which happens here.

val program = AdtStatementLoader.visitProject(parser.project())

As the ADT is walked we transform tokenised ADT nodes into internal BAP IR structures, all defined here. For example, a parsed direct call, stored by the parser as a DirectCallContext, will be converted into a BAPDirectCall.

These are all collated into a BAPProgram, which is essentially a list of functions in the program, and known memory sections, as represented by BAP.

Translating Phase

We now have an internal representation of the program we want to analyse, stored as a mapping of how BAP lifted it. Instead of performing analysis on this, we translate this into our own IR. The BASIL IR was designed to be easily translated into Boogie, meaning any transformations of the program structure resulting from analyses should not hamper our ability to translate into Boogie.

With the ADT parsed, we then match up symbol names with procedures in the ADT via the output of readelf, stored in the associated .relf. This helps us identify external functions, internal functions, and global variables.

The BAPProgram is then translated into an IR Program via BAPToIR.

We then remove external functions that we wish to ignore for analysis (such as printf), and clean up some BAP naming artefacts. At this point we have a BASIL IR of the program which we can then perform analyses over, and later translate to Boogie IVL.

Derivation of the ADT grammar

The grammar of the parser is inspired by the existing parser found at BinaryAnalysisPlatform/bap-python, in particular, the nodes found in adt.py, br.py and bil.py. The grammar attempts to reflect the inheritance hierarchy - the alternatives of a token are its subclasses.

For example, exp and term are subclasses of ADT, and thus in the grammar we have adt : exp | term | ....

General Notes

  • Negative numbers are encoded somewhat strangely. It takes the 2's complement representation of the number, and interprets the binary string as if it were an unsigned integer.
    • For example, the number -2 in 64 bits is 1111111111111111111111111111111111111111111111111111111111111110, which is interpreted in the ADT as Int(18446744073709551614, 64).

BASIL IR

The internal IR is designed such that it will translate easily to Boogie's IVL. The layout of a program is given:

Similar to BAP, this simply stores a list of known functions (procedures) and memory regions.

A procedure is a representation of a function call, e.g. __libc_start_main. It stores

  • name : the function's string representation (e.g. "__libc_start_main").
  • address : offset into the binary.
  • blocks : a list of blocks in the procedure. See Block for an explanation of what constitutes a block.
  • calls : a list of function calls made by blocks in this procedure.

Represents an abstract region of memory. It has an address and a number of bytes in size. It can also store a sequence of bytes, representing data in the memory region.

A basic block. Basic blocks in the context of this IR are consecutive sequences of instructions without a jump.

  • label : unique identifier for this block. This is the label that BAP gave it which has propagated through (its effectively a "BAP Address").
  • address : offset of the start of this block in the binary.
  • statements : sequence of statements in the block.
  • jumps : a jump from this block to another block - because our blocks are basic blocks, this will either be a single (direct|indirect) call, one GoTo call representing a loop, or two GoTo calls representing a conditional branch.

These sets are calculated and stored in the block also:

  • calls : names of functions (jumps) from this block which have been resolved (i.e. direct calls).
  • modifies : memory regions that are modified within this block.
  • locals : registers accessed during this block.

An expression (rhs) which is evaluated and stored into a register (lhs). For example, R2 := 3bv32 + 20bv32 is a statement, with lhs: R2 and rhs: Expression(3bv32 + 20bv32).

  • modifies : memory regions that this statement modifies.
  • locals : registers that this statement accesses (writing or reading).

Assigning a value to a local variable. Local in this context means local to the procedure scope.

  • modifies : empty for local.
  • locals : registers that this statement accesses (writing or reading).

Assigning a value to some memory region. A memory region can be on the stack, in the heap, or global.

  • modifies : memory regions that this statement modifies (lhs of statement).
  • locals : registers that this statement accesses (writing or reading).

A jump from this point in the program to another part in the program.

  • modifies : currently unused. NOTE: is this deprecated?
  • locals : dependencies for this jump. For example, a goto instruction will depend on some expression involving CPU flags. For procedure calls, these are parameters passed to the function call.
  • calls : currently unused. NOTE: is this deprecated?

These all then depend Expressions, which contain representations for instructions to be evaluated / left symbolic, such as literals and binary operations. Boogie uses BitVectors for register values, and so operations in the BASIL IR are also represented as BitVector operations.

Analyses

Analysis and Solvers

Supported Analyses

Clone this wiki locally