Memory is allocated via a set of arenas. At the start compiler reserves a big chunk of memory (186 GiB atm), and bits of that memory are given to each arena.
On windows memory is reserved via a call to VirtualAlloc
with MEM_RESERVE
flag. Then each arena is responsible to commit memory on demand with VirtualAlloc + MEM_COMMIT
call. I stopped on 64k commit size as higher sizes do not improve time too much.
On linux I wasn't able to archive pure reserve, so compiler relies on overcommit to be enabled and commits the whole arena pool at the start. The OS is then responsible for allocating memory pages on first access.
- Read source - reads source files into memory
- Lex - produces token array per source file
- Parse - produces AST
- Register names - register definitions within scopes
- Lookup names - resolves symbol references
- Type check
- Generate IR - conversion of AST into linear IR in SSA form
- Optimize - optimizes machine-independent IR
- inline
- DCE - Dead Code Elimination: removes instructions with unused results and no side-effects
- IR lower
- lower ABI
- lower aggregates to stack slots
- lower switch
- lower GEP
- IR to LIR AMD64 - Conversion of high-level IR to amd64 machine code IR in SSA form (LIR)
- Live intervals - Collects liveness info about values for use in register allocation
- Linear Scan Register Allocation - Replaces virtual registers with physical ones
- Stack layout - Calculates offsets for stack slots
- Code gen - Converts LIR into machine code
- Link executable
- Write executable (optional, used in non-JIT mode)
For me TAC means single operation per instruction, instruction takes arguments and returns some number of results (0 or 1 currently). In CFG basic blocks contain a list instructions with single entrance and single exit, and track a list of predecessor/successor basic blocks.
Generate IR creates the IR with those properties and uses abstract instructions and virtual registers. Lower ABI introduces machine registers. IR to LIR switches from abstract to machine instructions. Linear scan replaces virtual registers with machine registers, destructs SSA form and also fixes two-operand form instructions. Code gen emits actual machine code.
Files use 2 arenas: one for file info records (SourceFileInfo
) and one for file contents.
Source files are read sequentially into an arena.
Before each source there is a start of input char char SOI_CHAR = '\2'
, and after last source char there is end of input char char EOI_CHAR = '\3'
.
During lexing those turn into corresponding tokens: TokenType.SOI
, TokenType.EOI
.
Lexer produces TokenType
and SourceLocation
per token generated. Those are put into 2 arenas (tokenBuffer
and tokenLocationBuffer
). Tokens of all source files live in the same arenas, so they share index space. Token index is 32-bit number. AST nodes then store just a single TokenIndex
.
Atm SourceLocation
is 16 bytes:
uint start;
uint end;
uint line;
uint col;
and TokenType
is a single byte enumeration.
SourceFileInfo
also stores index of the first token of the source file. And since all files are loaded sequentially, firstTokenIndex
also grows monotonically with each file added. We can then use firstTokenIndex
to find the file that contains any given token. This is used for error location printing.
Parsing uses recursive descend + Pratt parsing for expressions. Types are considered an expression too.
AST nodes are allocated in Arena!uint astBuffer
with 4-byte allocation/addressing granularity. AST nodes use 32-bit indices to refer to each other. This way arena has capacity for up to 16 GiB of nodes.
For the times where AST nodes need to allocate arays ArrayArena
is used, which has multiple pools for fixed sizes.
All identifiers are interned and 32-bit identifier index is used everywhere (struct Identifier
).
All AST node allocations are sequential in memory, so template system takes advantage of that. Templates store the range of AST arena slots that are within the template. When new template instance is created, a copy of that slot range is made at the end of the arena. Then a tree walk fixes indices and duplicates the arrays.
Named declaration nodes (NameUseExprNode
/AliasDeclNode
/EnumDeclaration
/EnumMemberDecl
/FunctionDeclNode
/StructDeclNode
/TemplateDeclNode
/VariableDeclNode
) keep track of parent scope, so that in Register names
pass they can register themselves into that scope.
This pass walks the AST nodes in all files and registers itself in the parent scope.
When we have a single node to register it just registers itself and requests name registering from nested nodes.
Because Vox supports conditional compilation in the form of #if
and #foreach
, name registering pass happens in two steps for arrays of nodes (for example struct members, or block statements):
First we request self registration of nodes. This process yields a linked list of all nodes implementing conditional compilation. Then we walk the conditional nodes and expand them in-place. Right after insertion we recursively repeat the first step.
At the end we have all conditional nodes deleted and replaced with some of their children. All of the nodes in the sequence were self registered, and no nested nodes were registered yet.
Then we walk all the nodes in array and request registering of their children.
One non-ordinary thing here is that #if
condition or #foreach
expression need to be evaluated, which means that condition subtree needs to be processed up to Type check
stage and evaluated.
Every name use expression needs to resolve the node it references. To do this every name use node stores the parent scope in which it occured within source code. Parser tracks scopes and passes them into the node for future use.
So, when name lookup pass comes to name use node it consults the scope tree.
Algorithm is the same as in D (See https://dlang.org/spec/module.html#name_lookup).
- Walk up the scope tree looking at each step in the scope. Once found it the symbol.
- If not found. Then restart walking the scope tree, but this time check in imports.
Another place where name resolution happens is member lookup in member expression (parent.member
). This only check the parent scope for names.
Performs implicit type conversions. Checks types where specific type is expected, like function call argument (foo(arg)
), variable assignment (i32 a = 42;
) etc. Finds common type for binary operations (a + b
).
Uses algorithm from Simple and Efficient Construction of Static Single Assignment Form by Braun et al. 2013
It does 3 subpasses:
- ABI lowering
- Aggregate lowering
- GEP-instruction lowering
For each IR function it transforms parameter
and ret_val
instructions to follow own calling convention, and converts call
instructions to follow callee's calling convention.
Currently Vox backend implements 3 CCs: win64
, sysv64
, and sysv64_syscall
.
Rewrites virtual registers that contain aggregates, that do not fit into a register, into stack slots. Atm algorithm is unfinished and produces redundant copies.
Is written as part of aggregate lowering pass. Currently translates switch instruction into a chain of branches.
Rewrites Get Element Pointer instructions into pointer arithmetic instructions.
aka instruction selection
Vox compiler uses basic macro expansion approach, where for each instruction of IR an instruction of target architecture is emitted (currently only for amd64).
- Empty LIR is created
- I create an array to store mapping from IR virtual registers to LIR virtual registers
- Stack slots are duplicated from IR to LIR without changes
- Basic blocks and their phi functions are recreated in LIR. New basic blocks have identical indices with old ones.
- Old basic blocks are iterated
- BB instructions are iterated
- For each instruction we either emit a single instruction
Vox uses linear scan with interval splitting over SSA form. I followed the papers:
[2002] Linear Scan Register Allocation in the Context of SSA Form and Register Constraints
[2004] Christian Wimmer. Linear Scan Register Allocation for the Java HotSpot:tm: Client Compiler
[2005] Optimized Interval Splitting in a Linear Scan Register Allocator
[2010] Linear Scan Register Allocation on SSA Form
Parameters layout is determined by ABI lowering pass.
Other stack slots are arranged by alignment (big to small). This way they are tightly packed and their alignment is respected.
Here actual machine code is emitted. So far only amd64 is supported.
First global variables are allocated to the static data sections (read-only and read-write). Symbols that are zeroed are allocated at the of those sections. In case we are producing executable, the file will not contain those zeroes.
Then we walk all functions in the program and generate their machine code.
First function prolog is emitted. It inserts code for frame pointer creation (if needed). Then stack reservation (if needed).
Function body is compiled. We walk all basic blocks and instruction in each BB. Appropriate instruction encoding is selected depending on argument kind (register, stack slot, global, function, constant), register class (GPR, XMM) and instruction size. For arguments that reference globals or functions a new ObjectSymbolReference
is created (needed for linking later).
Some instructions modify stack pointer (push
s before function call, add
/sub
to stack pointer), so we track that as an additional offset. Then, when compiling instructions that reference stack slots we need to add that offset to the stack slot offset to compensate.
Jumps between basic blocks need further fixup. Code generator allocates a table of 2 pointers per basic blocks, allowing to have 0-2 successors per basic block. When compiling branch/jump instructions it will store the address of 4 byte offset into that buffer. No jump is generated if the target is the next block.
Because IR guarantees single return instruction, epilog is generated as part of that instruction codegen. It deallocates stack and frame pointer as needed. Then return instruction is inserted.
After that jump fixup is performed and size of final machine code is calculated.
When generating IR, frontend creates a new object symbol (ObjectSymbol
), per function and per global variable. ObjectModule
is created per module. There is also ObjectSection
, which tracks section info of resulting excutable file, and ObjectSymbolReference
, which represents reference from one symbol to another.
Symbols are abstracted as arrays of bytes within a section and module. Reference is a slice of a symbol that contains address (absolute or relative) to another symbol + extra data.
After code generation step all globals are put in the globals arena, and have finalized address in the section (read-only or read-write data sections). Same with functions that have machine code in code arena.
Linking pass walks all local modules, and for each reference performs a fixup.
When compiling in JIT-mode host can export its own functions as symbols to the compiled program. They are put into a host module. If the registered function address is too far from the code that refences it, then it is put into import table. Linker handles both relative reference and indirection through import table.
When compiling standalone executable, user can also pass .dll
files to the compiler and all external functions will be resolved to one of them. Such references are implemented by generating import table in the executable (.idata
for PE/COFF).
Generation of win64
and elf64
executables is supported. For windows compiler can generate references to .dll
files provided through CLI.
Following sections are generated (if they are non-empty):
.text
- Executable code section.idata
- Import table section (only on win64).data
- Read-write static data section.rdata
- Read-only static data section
First sections are created and import table is filled. Then linking pass is run. Necessary data is filled in the PE and COFF headers and written to the buffer. Section headers are written. Then data of each section is written. Now final executable is fully in the buffer.
Executable that was stored in arena buffer is written to the target file. On posix platforms it is given rwx
rights.
AST nodes are allocated sequentially in an arena.
Arena!uint astBuffer;
AST nodes are addressed with indices. Indexing goes in 4 byte granularity. 32-bit indices x 4 bytes gives 16 GiB of addressable memory.
All arrays/hashmaps for AST are stored in a separate set of arenas. For example list of children node indices are stored in such array. IR arrays use dedicated arena.
Each AST node has common header (8 bytes in total):
struct AstNode
{
// Index to the token from the lexer
TokenIndex loc; // 4 bytes
// Type of AST node. Code often switches on it
AstType astType; // 1 byte
// Tracks node state as it progresses through semantic passes
// This is needed because nodes can be analysed in any order
// Needed to find ciclic dependencies in the program
AstNodeState state; // 4 bits
// Some nodes may need to further differentiate node type
uint subType; // 4 bits
// Bitflags
ushort flags; // 2 bytes
}
Program IR consists of globals, constants, types and functions. Functions belong to modules, but modules are pretty much unnessesary abstraction at this point.
Function IR is stored in a set of arenas that reside in CompilationContext
.
Arena!IrInstrHeader instrHeaderBuffer;
Arena!IrIndex instrPayloadBuffer;
Arena!IrIndex instrNextBuffer;
Arena!IrIndex instrPrevBuffer;
Arena!IrPhi phiBuffer;
Arena!IrVirtualRegister vregBuffer;
Arena!uint arrayBuffer;
Arena!IrBasicBlock basicBlockBuffer;
Arena!StackSlot stackSlotBuffer;
Each IR function then points to a slice of items within each of those arenas. (see IrFunction
struct)
Length of the slice is stored in the corresponding field of the IrFunction
.
// Pointers
IrInstrHeader* instrPtr;
IrIndex* instrPayloadPtr;
IrIndex* instrNextPtr;
IrIndex* instrPrevPtr;
IrPhi* phiPtr;
uint* arrayPtr;
IrVirtualRegister* vregPtr;
IrBasicBlock* basicBlockPtr;
StackSlot* stackSlotPtr;
// Length of:
uint numInstructions; // instrPtr, instrNextPtr, instrPrevPtr
uint numPayloadSlots; // instrPayloadPtr
uint numPhis; // phiPtr
uint arrayLength; // arrayPtr
uint numVirtualRegisters; // vregPtr
uint numBasicBlocks; // basicBlockPtr
uint numStackSlots; // stackSlotPtr
Because each arena works as virtually infinite array all new items are added to the end of the array.
This imposes a restriction that only one function IR can be created / modified at a time. Modification that needs to add new items requires a function copy to be created at the end of the arenas.
Copying function IR requires only one memcopy per arena without the need of any fixups, since all references are done relative to the start of respective item slice.
This also makes inlining fast. Function IR that is being inlined is appended to the end of the arena and all references in those items are offset by the size of the original IR. Such offset is done via an offset table thanks to clever design of IrIndex
.
To add a new item to the any of the function arrays we need to tell the corresponding arena to provide space for one more item at the end and grow the corresponding length field in the IrFunction
.
For example adding a new instruction looks like:
ir.numInstructions += numSlots;
context.irStorage.instrHeaderBuffer.voidPut(numSlots);
context.irStorage.instrNextBuffer.voidPut(numSlots);
context.irStorage.instrPrevBuffer.voidPut(numSlots);
adding new virtual register:
ir.numPhis += 1;
context.irStorage.phiBuffer.put(IrPhi());
etc. See ir_builder.d
for more.
All function IR entities are accessed by index into the corresponding array.
Function consists of a linked list of basic blocks (BB). There are two special BBs: entry
and exit
blocks.
BBs are stored in a memory described with basicBlockPtr/numBasicBlocks
pair.
entry
BB always has index 0
.
exit
BB always has index 1
.
Each BB has a linked list of phi functions and a linked list of instructions.
Each BB has pointers to the next and previous BBs in a function. They specify the order of BBs in a function.
Each BB also has a list of predecessor BBs and successor BBs. They represent the control flow edges between BBs. The order of predecessors matches the order of phi functions. Number of arguments of each phi function in that BB matches the number of predecessors and they correspond by index.
Each BB has at least one instruction. Each BB must end in an instruction that trasfers the control (jump, branch or return). The successors specify the target BBs of the that last instruction.
IrIndex firstInstr; // first instruction
IrIndex lastInstr; // last instruction
IrIndex prevBlock; // null only if this is entryBasicBlock
IrIndex nextBlock; // null only if this is exitBasicBlock
IrIndex firstPhi; // may be null
IrSmallArray predecessors;
IrSmallArray successors;
All instructions have common header (IrInstrHeader
), which is stored in instrPtr/numInstructions
arena.
The instruction format supports instructions with 0 or 1 results and with up to 255 arguments.
Header contains:
- 16-bit opcode field
- 8-bit number of arguments
- 8-bit bitfield for various purposes
- 1 bit for
hasResult
flag (on all instructions) - 4 bits for condition kind (for instructions with condition, like comparison or conditional jump)
- 1 bit for
- 32-bit payload offset index. Needed because number of arguments is not fixed.
ushort op;
ubyte numArgs;
ubyte flags;
uint payloadOffset;
Separation of header from payload gives us dense indexing, which is important if we want to associate extra data with each instruction. This is handy during liveness anaysis and register allocation when we need to map between instruction and its global index for linear scan register allocation algorithm.
Dense indexing is also used during instruction selection, but with virtual registers.
Instruction payload is stored in instrPayloadPtr/numInstructions
arena.
The layout is the following:
[0-or-1 result][0-255 arguments][optional payload]
payloadOffset
points to the 0-th argument.
If instruction has result, it is located before the arguments. Can be accessed with instrPayloadPtr[payloadOffset-1]
.
Optional payload can be used to store additional data that is specific to concrete opcode. It's size is opcode specified. Atm only parameter
instruction uses it to store parameter index.
result
field (if present) specifies virtual register that stores the result of the instruction. In some cases it may specify the physical register, like in the case of machine instructions.
In case the result is of virtual register kind, the virtual register will point its definition
field to the defining instruction.
Arguments specify instruction arguments and they may be of virtualRegister
/constant
/constantAggregate
/constantZero
/func
/stackSlot
/global
kinds.
Since we are storing all instructions in an array we cannot add/remove/reorder them easily and cheaply. To solve that each instruction has associated indices of next/prev instruction. They are stored separately, so that only indices for the relevant iteration direction are loaded into data cache, as passes usually iterate over all BB instructions either forwards or backwards.
Most of the instructions remain in the order they were generated by the frontend, so in most cases the next instruction will be adjacent in memory. (Disclaimer: I didn't measure different layouts of next/prev instruction indices).
Next/prev indices are stored in instrNextPtr/numInstructions
and instrPrevPtr/numInstructions
arenas.
They are stored in phiPtr/numPhis
arena.
IrIndex blockIndex;
IrIndex result;
IrIndex var;
IrIndex nextPhi;
IrIndex prevPhi;
IrSmallArray args;
Each BB has index of the first phi function (firstPhi
field).
After that all phi functions are stored in an intrusive linked list specified by nextPhi
/prevPhi
.
blockIndex
specifies the BB that phi function resides in.
Order of arguments matches the blockIndex.predecessors
.
var
field is used during initial SSA IR construction, and is ignored afterwards.
args
stores the array of arguments of the phi function. One argument per predecessor BB.
result
specifies virtual register that stores the result.
IrIndex definition;
IrIndex type;
IrSmallSet users;
definition
stores index of instruction or phi that defines this register. Since we have SSA form, each vreg has single definition point.
type
stores type index.
users
is a multi-hash-set that stores a set of all instruction or phi indices that use this register. Before it was an array, but this caused O(n^2) time in DCE pass when a lot of instructions were dead. The precise number of users is needed because this info is used in liveness analysis.
The way allocation works is that IR of the function which is currently being created must be at the end of each arena. This way adding new items to any arena if O(1) with no reallocation.
In arrayPtr/arrayLength
arena we store array payload for big IrSmallArray
/IrSmallSet
.
IrSmallArray
consists of 2 IrIndex
slots, which either encode 0-2 items inline, or offset + length
into the arrayPtr
.
IrSmallSet
consists of 4 IrIndex
slots, which either encode 0-4 items inline, or offset + length + unique length + capacity
into the arrayPtr
.
IrIndex
represents index of any IR entity.
It is a 32 bit value.
Top 4 bits of IrIndex
stores the kind
:
enum IrValueKind {
none
instruction
basicBlock
constant
constantAggregate
constantZero
global
phi
stackSlot
virtualRegister
physicalRegister
type
variable
func
array
}
At the moment 15 out of 16 possible values are used.
The meaning of the rest of the 28 bits depends on the kind
value.
none
- when the wholeIrIndex
is zero it represents undefined index. In some cases lowest 28 bits can be used to store plain integer data. For example it is used in such way inIrSmallArray
/IrSmallSet
to store length/capacity. Values of this kind will not be fixed during IR appending during inlining.instruction
- index into theIrFunction.instrPtr
basicBlock
- index into theIrFunction.basicBlockPtr
constant
- see IR Constant sectionconstantAggregate
- lower 28 bits is an index intoIrConstantStorage.aggregateBuffer
arena, which storesIrAggregateConstant
values.IrAggregateConstant
hasIrIndex type
anduint numMembers
. Then follow the values of the members of this aggregate. In case of array/struct value of each member follows. In case of union there are always 2 members: member 0 is an index, member 1 is a value.constantZero
- means that it is a constant of some type where all values bits are zero. The type of the constant is encoded in the same way as whenIrIndex
is ofIrValueKind.type
kind
.global
- index of a global (IrGlobal
). Globals are stored inCompilationContext.globals.buffer
arena.phi
- index into theIrFunction.phiPtr
stackSlot
- index into theIrFunction.stackSlotPtr
virtualRegister
- index into theIrFunction.vregPtr
physicalRegister
- next highest 8 bits specify register class
- next highest 8 bits specify register size
- lowest 12 bits specify register index
type
- next highest 4 bits specify type kind (IrTypeKind
), and lowest 24 bits are an index of a type. Index points intoCompilationContext.types.buffer
arena.variable
is used by frontend to refer to specific variable in the source language, and IR builder uses it in its SSA creation process. After IR is createdvariable
is no longer used anywhere.func
- atm lowest 28 bits just store an index of the AST node. Needs to point to backend owned data.array
- used byIrSmallArray
/IrSmallSet
to refer to an offset into theIrFunction.arrayPtr
when big array/map is needed.
Next 2 highest bits specify the kind of a constant (IrConstantKind
)
intUnsignedSmall
andintSignedSmall
specify that lowest 24 bits store actual constant bits. Signedness is needed to correctly sign-extend the value to the full size.intUnsignedBig
andintSignedBig
specify that the value doesn't fit into lowest 24 bits ofIrIndex
and instead are stored inIrConstantStorage.buffer
arena. That arena storesIrConstant
64 bit values.
Next 2 highest bits specify the size of a constant (8/16/32/64 bits)
enum IrTypeKind {
basic,
pointer,
array,
struct_t,
func_t
}
basic
- lowest 24 bits encode one of the basic types (misnamed asIrValueType
atm)enum IrValueType { noreturn_t, void_t, i8, i16, i32, i64, f32, f64, }
pointer
- points into the type arena (IrTypePointer
), where base type of the pointer is stored (IrIndex baseType
).array
- points into the type arena (IrTypeArray
), where base type of the array (IrIndex elemType
) and number of elements (uint numElements
) is stored.struct_t
- points into the type arena (IrTypeStruct
).struct IrTypeStruct { uint size; ubyte alignmentPower; bool isUnion; uint numMembers; IrTypeStructMember[0] members_payload; // all struct/union members follow in memory } struct IrTypeStructMember { IrIndex type; uint offset; }
func_t
- points into the type arena (IrTypeFunction
).struct IrTypeFunction { uint numResults; uint numParameters; CallConvention callConv; ushort syscallNumber; IrIndex[0] payload; // result types followed by parameter types }
Stores its type and symbol index for linking process.
Before inlining starts the caller is already in modifiable state, meaning that functions data is located at the end of the arenas.
We copy the callee's IR to the end of the callers IR.
Now we need to increment indices in the appended IR to account for callers IR entities.
In instrHeaderBuffer
only _payloadOffset
needs an increment by callerIr.numPayloadSlots
.
All other items are made in a way, so that they can be viewed as valid IrIndex[]
and a single increment per IrIndex
is performed. The increment depends on the kind of index. Currently there is only 16 kinds, so we build an offset table with 16 entries, populated by the arena sizes of caller IR.
uint[16] offsets;
offsets[IrValueKind.instruction] = callerIr.numInstructions;
offsets[IrValueKind.basicBlock] = callerIr.numBasicBlocks;
...
// items that need no increment remain 0 and do not affect IrIndex being fixed
Then we walk the IrIndex[]
and per each item we do:
index.asUint = oldIndex.asUint + offsets[oldIndex.kind];
Before register allocation there is ABI lowering pass that replaces calls of a form:
call fun, i32 50, i32 50, i32 32, i8 0, i8 255, i8 0, {i64, i8*} {6, g0}, i8 1
into
0: store {i64, i8*}* s0, {i64, i8*} {6, g0}
2: push i8 1
4: push {i64, i8*}* s0
6: push i8 0
8: push i8 255
10: r1<c0 s2> = move i32 50
12: r2<c0 s2> = move i32 50
14: r8<c0 s2> = move i32 32
16: r9<c0 s0> = move i8 0
18: grow_stack i8 32
20: call fun, r1<c0 s3>, r2<c0 s3>, r8<c0 s3>, r9<c0 s3>
22: shrink_stack i8 64
Liveness analysis then needs to assign correct liveness intervals. r1 for example must be live in [10; 20) interval, so that no virtual register gets this physical register.
Liveness analysis pass iterates basic blocks from end to start. And it needs to handle any instruction that takes/returns data in physical registers. In general such instruction can be preceded by movs to physical registers and followed by movs from physical registers.
Function calls can also have an instruction between the call and movs for stack alignment. This needs to be accounted for. For that call instruction has 2 bit flags saying to extend live ranges before or after the instruction for physical registers.
Example:
2: eax = v.20 // args[0]
4: ecx = v.45 // args[2]
6: optional non-mov instruction (for call, which has extendFixedArgRange)
8: edx, ecx = some_instr(eax, v.100, ecx) // edx is results[0]
10: optional non-mov instruction (for call, which has extendFixedResultRange)
12: v.200 = edx // results[0] (aka result)
14: v.300 = ecx // results[1]
The algorithm is to iterate through arguments and count the number of physical registers. Then for each physical register argument add fixed live interval of correct length.
For example eax is 0-th physical argument of 2 and we have extendFixedArgRange set.
Then we add live range [2; 8)
for eax, calculated as ((NUM_REGS - REG_INDEX) + extendFixedArgRange) * STEP == ((2 - 0) + 1) * 2 == 6. [8-6; 8)
Or one can reverse the order of movs or the order of iteration to simplify the equation.
For functions liveness analysis also adds fixed intervals for all volatile registers.
Here is the resulting live ranges
| 26| ecx = mov i32 50
|| 28| edx = mov i32 50
|| | 30| r8d = mov i32 32
|| | || 32| r9b = mov i8 0
|| | || 34| rsp = sub rsp, i8 32
||| | |||| 36| call fun, rcx, rdx, r8, r9
| 38| rsp = add rsp, i8 64
0 rax [rax]: [36; 37)
1 rcx [rcx]: [26; 37)
2 rdx [rdx]: [28; 37)
4 rsp [rsp]: [32; 40)
8 r8 [r8]: [30; 37)
9 r9 [r9]: [32; 37)
10 r10 [r10]: [36; 37)
11 r11 [r11]: [36; 37)
Same idea works with instructions that use fixed registers for arguments and results such as idiv
.
Example of results:
// source code
void fibonacci() {
i32 lo = 0;
i32 hi = 1;
while (hi < 10000) {
hi = hi + lo;
lo = hi - lo;
print(lo);
}
}
// amd64 LIR
func fibonacci() 656 bytes ir:"LIR Amd64" pass:"Live" {
# 0| | @0 SF.
# 2| 0| jmp @2
# 4| | @2 SF. in(@0)
# 6| 1| jmp @3
# 8| | @3 SFL in(@2, @4)
#|DU 8| | v1 i32 = phi1(@2 i32 1, @4 i32 v2) users [i2, i3]
#D| U 8| | v0 i32 = phi0(@2 i32 0, @4 i32 v3) users [i3, i4]
#|U 10| 2| if i32 v1 s>= i32 10000 then @5 else @4
#++ 12| | @4 SF. in(@3)
#UUD 14| 3| v2 i32 = add i32 v1, i32 v0 users [i4, phi1]
#U UD 16| 4| v3 i32 = sub i32 v2, i32 v0 users [i5, phi0]
| | # |U 18| 5| ecx = mov i32 v3
| | # || 20| 6| rsp = sub rsp, i8 32
||| | |||| # || 22| 7| call print, rcx
| # || 24| 8| rsp = add rsp, i8 32
| # || 26| 9| jmp @3
# ^^ 28| | @5 SF. in(@3)
# 30| 10| jmp @1
# 32| | @1 SF. in(@5)
# 34| 11| ret
^^^^^^^^^^^^^^^ ^^^^ ^^ ^^ ^^^^^^^^^^^^^
| | | | `- IR
| | | `- IR instruction index
| | ` linear instruction index
| `- vreg intervals
`-phys reg intervals
// Live intervals
.- interval kind (vreg, phys reg)
| .- interval index
| | .- definition
| | | .- allocated register
| | | | .- live intervals
| | | | | .- users
v vv vvv vvvvv vvvvvvvv vvvvv
p 0 rax [rax]: [22; 23)
p 1 rcx [rcx]: [18; 23)
p 2 rdx [rdx]: [22; 23)
p 4 rsp [rsp]: [18; 26)
p 8 r8 [r8]: [22; 23)
p 9 r9 [r9]: [22; 23)
p 10 r10 [r10]: [22; 23)
p 11 r11 [r11]: [22; 23)
v 32 v0 [no reg]: [8; 17) (14 instr) (16 instr)
v 33 v1 [no reg]: [8; 14) (10 instr) (14 instr)
v 34 v2 [no reg]: [14; 28) (14 instr) (16 instr) (28 phi)
v 35 v3 [no reg]: [16; 28) (16 instr) (18 instr) (28 phi)
Two-address form is used in instructions of form dst = op src0 src1
, where dst
is the same register as src0
.
It is handled after register allocation.
Ideally register allocator needs to be hinted to allocate dst
to the same register as src0
(I do not do this) and if they are still different the basic idea is to insert a mov dst, src0
before this instruction, so you get:
mov dst, src0
dst = op dst src1
But you need to make sure that dst != src1
, otherwise that mov
will overwrite the src1
. To avoid this problem live interval for src1
is extended by 1, thus preventing the same register from being allocated for both src1
and dst
. This check is done in liveness analysis code.
Here is detailed algorithm (implemented in linear_scan.d
):
Rewrite
input: r1 = r2 op r3
as
output: r1 = r2
output: r1 = r1 op r3
if r2 != r3
{
if r1 == r2 { // input: r1 = r1 op r3
output: r1 = r1 op r3
} else if r1 == r3 { // input: "r1 = r2 op r1"
if (op.isCommutative) { // input: "r1 = r3 op r2"
output: "r1 = r1 op r2"
} else { // input: "r1 = r2 op r1"
// error, we cannot swap r2 with r1 here to get r2 = r1 op r2
// because r2 will be overwritten. But r2 may be used later
// this case is handled inside liveness analysis pass by extending
// the live range of r3 by 1. This way register allocator will never
// allocate r1 to be the same as r3
}
} else { // input: "r1 = r2 op r3"
output: "mov r1 r2"
output: "r1 = op r1 r3"
}
}
else // input: "r1 = op r2 r2"
{
output: "mov r1 r2" if r1 != r2
output: "r1 = op r1 r1"
}
Source code:
@extern(module, "host")
void print(i32); // external
void fibonacci() {
i32 lo = 0;
i32 hi = 1;
while (hi < 10000) {
hi = hi + lo;
lo = hi - lo;
print(lo);
}
}
Initial IR generated from the AST:
func fibonacci() 576 bytes ir:"IR" pass:"IR gen" {
| @0
1| jmp @2
| @2 in(@0)
2| jmp @3
| @3 in(@2, @4)
| v1 i32 = phi1(@2 i32 0, @4 i32 v3) users [i.4, i.5]
| v0 i32 = phi0(@2 i32 1, @4 i32 v2) users [i.3, i.4]
3| if i32 v0 s< i32 10000 then @4 else @5
| @4 in(@3)
4| v2 i32 = add i32 v0, i32 v1 users [i.5, phi0]
5| v3 i32 = sub i32 v2, i32 v1 users [i.6, phi1]
6| call print, i32 v3
7| jmp @3
| @5 in(@3)
8| jmp @1
| @1 in(@5)
0| return
}
LIR created by instruction selection pass from IR:
func fibonacci() 656 bytes ir:"LIR Amd64" pass:"IR -> LIR" {
| @0
0| jmp @2
| @2 in(@0)
1| jmp @3
| @3 in(@2, @4)
| v1 i32 = phi1(@2 i32 1, @4 i32 v2) users [i2, i3]
| v0 i32 = phi0(@2 i32 0, @4 i32 v3) users [i3, i4]
2| if i32 v1 s>= i32 10000 then @5 else @4
| @4 in(@3)
3| v2 i32 = add i32 v1, i32 v0 users [i4, phi1]
4| v3 i32 = sub i32 v2, i32 v0 users [i5, phi0]
5| ecx = mov i32 v3
6| rsp = sub rsp, i32 32
7| call print, rcx
8| rsp = add rsp, i32 32
9| jmp @3
| @5 in(@3)
10| jmp @1
| @1 in(@5)
11| ret
}
LIR showing live intervals:
func fibonacci() 656 bytes ir:"LIR Amd64" pass:"Live" {
# 0| | @0 SF.
# 2| 0| jmp @2
# 4| | @2 SF. in(@0)
# 6| 1| jmp @3
# 8| | @3 SFL in(@2, @4)
#|DU 8| | v1 i32 = phi1(@2 i32 1, @4 i32 v2) users [i2, i3]
#D| U 8| | v0 i32 = phi0(@2 i32 0, @4 i32 v3) users [i3, i4]
#|U 10| 2| if i32 v1 s>= i32 10000 then @5 else @4
#++ 12| | @4 SF. in(@3)
#UUD 14| 3| v2 i32 = add i32 v1, i32 v0 users [i4, phi1]
#U UD 16| 4| v3 i32 = sub i32 v2, i32 v0 users [i5, phi0]
| | # |U 18| 5| ecx = mov i32 v3
| | # || 20| 6| rsp = sub rsp, i32 32
||| | |||| |||||| # || 22| 7| call print, rcx
| # || 24| 8| rsp = add rsp, i32 32
| # || 26| 9| jmp @3
# ^^ 28| | @5 SF. in(@3)
# 30| 10| jmp @1
# 32| | @1 SF. in(@5)
# 34| 11| ret
}
intervals fibonacci
p 0 rax [rax]: [22; 23)
p 1 rcx [rcx]: [18; 23)
p 2 rdx [rdx]: [22; 23)
p 4 rsp [rsp]: [18; 26)
p 8 r8 [r8]: [22; 23)
p 9 r9 [r9]: [22; 23)
p 10 r10 [r10]: [22; 23)
p 11 r11 [r11]: [22; 23)
p 16 xmm0q [xmm0q]: [22; 23)
p 17 xmm1q [xmm1q]: [22; 23)
p 18 xmm2q [xmm2q]: [22; 23)
p 19 xmm3q [xmm3q]: [22; 23)
p 20 xmm4q [xmm4q]: [22; 23)
p 21 xmm5q [xmm5q]: [22; 23)
v 32 v0 [no reg]: [8; 17) (14 instr) (16 instr)
v 33 v1 [no reg]: [8; 14) (10 instr) (14 instr)
v 34 v2 [no reg]: [14; 28) (14 instr) (16 instr) (28 phi)
v 35 v3 [no reg]: [16; 28) (16 instr) (18 instr) (28 phi)
LIR after register allocation:
func fibonacci() 896 bytes ir:"LIR Amd64" pass:"RA" {
| @0
20| store i64* s1, rbp
18| store i64* s0, rbx
0| jmp @2
| @2 in(@0)
14| eax = mov i32 0
15| ecx = mov i32 1
1| jmp @3
| @3 in(@2, @4)
2| if ecx s>= i32 10000 then @5 else @4
| @4 in(@3)
12| ebx = mov ecx
3| ebx = add ebx, eax
13| ebp = mov ebx
4| ebp = sub ebp, eax
5| ecx = mov ebp
6| rsp = sub rsp, i32 32
7| call print, rcx
8| rsp = add rsp, i32 32
16| eax = mov ebp
17| ecx = mov ebx
9| jmp @3
| @5 in(@3)
10| jmp @1
| @1 in(@5)
19| rbx = load i64* s0
21| rbp = load i64* s1
11| ret
}
intervals fibonacci
v 32 v0 [eax]: [8; 17) (14 instr) (16 instr)
v 33 v1 [ecx]: [8; 14) (10 instr) (14 instr)
v 34 v2 [ebx]: [14; 28) (14 instr) (16 instr) (28 phi)
v 35 v3 [ebp]: [16; 28) (16 instr) (18 instr) (28 phi)
Final x64 machine code:
sub rsp,0x18
mov QWORD PTR [rsp+0x0],rbp
mov QWORD PTR [rsp+0x8],rbx
xor eax,eax
mov ecx,0x1
cmp ecx,0x2710
jge 0x42
mov ebx,ecx
add ebx,eax
mov ebp,ebx
sub ebp,eax
mov ecx,ebp
sub rsp,0x20
call QWORD PTR [rip+0x0] // call print
add rsp,0x20
mov eax,ebp
mov ecx,ebx
jmp 0x15
mov rbx,QWORD PTR [rsp+0x8]
mov rbp,QWORD PTR [rsp+0x0]
add rsp,0x18
ret
Tests in Vox are defined in the source code via string literals annotated with TestInfo
value. Test runner then iterates all definitions in test modules and gets those that marked with @TestInfo
.
Here is what inside TestInfo
:
struct TestInfo
{
void function(ref TestContext) tester;
HostSymbol[] hostSymbols;
}
Here is how the simplest test case looks like:
@TestInfo() // annotation
immutable test = q{}; // string with the source code
immutable test = q{};
is equivalent to immutable string test = "";
, in D you can omit string
in this case and q{}
denotes token literal, which for most cases just makes editor syntax highlight string content.
Inside the test string I use https://github.com/marler8997/har (HAR) format. Basic idea of which is that you can concatenate multiple text files into one, like an archive:
--- main.vx
import kernel32;
void main() { ExitProcess(42); }
--- kernel32.vx
void ExitProcess(u32 uExitCode);
each file begins with the header of --- <path>
format. This way we can define test cases with as many files as we want, in a single string.
Now we can define a test case that defines test.vx
module, with get42
function in it:
@TestInfo()
immutable test = q{
--- test.vx
i32 get42() {
return 42;
}
};
Since we jit-compile test case we can also run the code we compiled:
@TestInfo(&tester)
immutable test = q{
--- test.vx
i32 get42() {
return 42;
}
};
void tester(ref TestContext ctx) {
auto run = ctx.getFunctionPtr!(int)("get42"); // get pointer to get42
assert(run() == 42); // call it
}
here we define D function called tester
, and pass pointer to it to the @TestInfo
annotation. After test runner compiles the test, it check if TestInfo.tester
is set and if yes, tester
is called.
Inside the tester
we can introspect compiled case. Usually test cases retreive function pointer to some function in the test case and call it. This is possible because Vox uses C calling convention by default, and D can call functions marked with extern(C)
, which is what getFunctionPtr
returns.
After we get the function pointer we can call it with any arguments for testing purposes.
But what if we want Vox program to call D code? This is possible too. There are two options:
- bind D function to an external declaration within compiled test case
- or pass function pointer to the Vox program
Example of the first one:
@TestInfo(&tester, [HostSymbol("add", cast(void*)&external_func)])
immutable test = q{
--- test.vx
i32 run(i32 par) {
return add(par, 10);
}
i32 add(i32, i32); // external declaration
};
extern(C) int external_func(int par1, int par2) {
return par1 + par2;
}
void tester(ref TestContext ctx) {
auto run = ctx.getFunctionPtr!(int, int)("run");
int result = run(13);
assert(result == 23);
}
We declare a D function called external_func
that sums two ints. extern(C)
makes it use C calling convention. With the HostSymbol("add", cast(void*)&external_func)
we define binding from add
name to the external_func
function. And when compiler stumbles upon external declaration of add
it will link it to the provided pointer.
Here is an example of the second option of passing function pointer manually:
@TestInfo(&tester)
immutable test = q{
--- test.vx
i32 run(i32 par, i32 function(i32, i32) add) {
return add(par, 10);
}
};
extern(C) int external_func(int par1, int par2) {
return par1 + par2;
}
void tester(ref TestContext ctx) {
auto run = ctx.getFunctionPtr!(int, int, extern(C) int function(int, int))("run");
int result = run(13, &external_func);
assert(result == 23);
}
To define a test case that must fail the compilation you need to add a special file named <error>
. Test runner will look for it when setting the compiler. If it is present, the content of this file must match the error message that compiler generates.
@TestInfo()
immutable failing = q{
--- failing.vx
i32 fun(){
bar();
}
--- <error>
failing.vx:2:3: Error: undefined identifier `bar`
};
Test runner also supports building executables, which I use to test executable generation. I only have 2 tests that produce an executable, because it is much slower than doing everything in memory.
At the time of writing I have 310 tests, all of which take ~90ms. Of which 2 executable tests take ~10ms. This is on Windows. So on average 5ms per executable test, which compiles, writes executable file, runs the executable. And 0.19ms per JIT test, which doesn't perform any IO. Together with compiling compiler from scratch it takes me 3.6s to compile and run compiler with whole test suite. So, iteration time is quite nice.
Calling convention of regular function on x86-64 specifies that stack is aligned to 16/32/64 bytes before the call, and call instruction pushes the 8 byte return address. It is not the case with the entry point. It is always aligned to 16 bytes.
See section 3.4.1 of System V Application Binary Interface AMD64 Architecture Processor Supplement
.
It says the following:
%rsp: The stack pointer holds the address of the byte with lowest address which is part of
the stack. It is guaranteed to be 16-byte aligned at process entry.