B-minor is a C-like, strictly-typed, imperative programming language. It is most notably used in undergraduate computer science compiler courses as a semester-long project.
This repository contains my own personal implementation of the language's specifications, completed as partial requirement of TCNJ's CSC-425: Compilers and Interpreters course in the Spring 2021 semester. The project was an individual, semester-long guide through our rigorous study of how programming language specifications are implemented as actual, executable compilers. This implementation targets AArch64 or ARM64 as its target architecture, meaning that any computer from this family of architectures can actually execute the generated machine instructions that result from this compiler!
- Implemented Features
- Tools Used
- Compilation and Usage
- Files in this Repository
- Implementation Details
- Example Output
Suprisingly, B-Minor features a large number of common programming language constructs. While more complex features such as classes, objects, and multi-dimensional arrays are not supported, I was proud of the long list of features that are implemented and working as of the completion of the project. These include:
- Strict typing system:
- Four primitive data types:
integer
,boolean
,char
, andstring
- Multiple supported escape sequences (
\n
,\t
, etc.) forchar
andstring
- Four primitive data types:
- Memory abstraction via variables:
- Global and local primitive variable declaration and assignments
- Global declaration and local access of single-dimensional arrays
- Static scoping:
- Accessing both local and non-local variables within a function
- Nested scope, such as control statements within a function
- Variable shadowing in the case of nested scopes
- Expressions:
- Arithmetic expressions, including exponentiation, unary negation, and modulus
- Conditional expressions through the use of comparison operators
- Logical expressions with
&&
,||
, and!
- Assignment expressions for variable value assignment
- Flow of program control:
- Multiple-selection statements in the form of
if
andelse
statements - Iteration statements in the form of
for
andwhile
statements
- Multiple-selection statements in the form of
- Support for subprograms:
- Global declarations of functions and function prototypes
- Function calls with up to 6 arguments
- Returning a value from a function
- Simple program output:
- Printing single and multiple expressions in a single line
- Compatibility with the standard C library
- Single and multi-line comments
- Stack-dynamic local variables, allowing for recursion
Though the language's implementation was written mostly from scratch, there were a number of tools and concepts used in order to make the project manageable for a single-semester course. These tools are listed below:
- C - The compiler was written entirely in the C programming language. Its low-level nature was a perfect fit for the project, and ability to create more advanced data structures using pointers was essential for features such as symbol tables and abstract syntax trees.
- Regular Expressions - Regular expressions were used to describe the patterns of tokens and lexemes in the language. They greatly simplify the process of detecting valid and invalid programs based on their most basic units.
- Flex - A scanner generator for C. This tool was largely used to create the lexical analyzer for the compiler. Regular expressions were used to descripe the different tokens and lexemes of the language, and Flex could then generate the code for the lexical analyzer.
- Bison - A parser generator for C. This tool was used to create the syntax analyzer for the compiler. The language was described using a context-free grammar in BNF so that Bison could then generate the code for the parser.
- Make and Makefile - A build automation tool for C. This tool was used to manage compiling program components in the correct order due to a number of dependency concerns between modules of the program.
- Git and GitHub - The VCS and hosting platform used to manage the project's source code.
In order to compile this B-Minor compiler, you will need to ensure that you have Make, Flex, and Bison installed on your machine.
- Compile the compiler by running
make
in the root directory of the project. This will generate the executablebminor
in the root directory. - Write your B-Minor source code that you would like to use. This source code must use the
.bminor
file extension. - Run the compiler on your source code with the following command. The output file must have the
.s
file extension. The step can be any of the named steps mentioned in the pipeline (scan
,parse
, etc.), but to generate the final machine code, you must use thecodegen
step:
./bminor -<step> <source_file>.bminor <output_name>.s
- On a machine with an AArch64 or ARM64 processor, you can compile the generated assembly code into an executable binary with the following command. Note that the
library.c
file is required for theprint
function to work properly:
gcc -g <source_file>.s library.c -o <output_name>
- Finally, you can run your executable binary as you would expect:
./<output_name>
In order to keep the project more organized, each logical component of the compiler would be kept in its own pair of header and implementation files. The following is a list of the files in this repository and a brief description of their purpose. Note that files with an asterisk (*) are the result of some intermediate compilation step, and are not present in the repository:
- General Project Files:
main.c
- This file serves as the entry point for the compiler. It contains the main functions for scanning, parsing, pretty printing, type checking, and generating machine code for a B-Minor program.library.c
- This file contains the implementation of the B-Minor standard library, which is used to provide the compiler with the necessary functions for printing to the console.Makefile
- This file contains the build instructions for the compiler. It is used by the Make tool to compile the compiler.
- Lexical Analysis:
scanner.flex
- This file contains the regular expressions used to describe the tokens and lexemes of the language. It is used by Flex to generate the lexical analyzer for the compiler.- *
scanner.c
- This file contains the implementation of the lexical analyzer for the compiler. It was generated by Flex from thescanner.flex
file.
- Syntax Analysis:
parser.bison
- This file contains the context-free grammar used to describe the syntax of the language. It is used by Bison to generate the parser for the compiler.- *
parser.c
- This file contains the implementation of the parser for the compiler. It was generated by Bison from theparser.bison
file. - *
token.h
- This file contains the token definitions for the language. It was generated by Bison from theparser.bison
file.
- Abstract Syntax Tree Node Types:
decl.h
anddecl.c
- These files contain the data structures and functions used to represent, print, typecheck, and generate machine code for declarations in the language.expr.h
andexpr.c
- These files contain the data structures and functions used to represent, print, typecheck, and generate machine code for expressions in the language.param_list.h
andparam_list.c
- These files contain the data structures and functions used to represent, print, typecheck, and generate machine code for parameter lists in the language.stmt.h
andstmt.c
- These files contain the data structures and functions used to represent, print, typecheck, and generate machine code for statements in the language.type.h
andtype.c
- These files contain the data structures and functions used to represent, print, and typecheck types in the language.
- Scope and Name Resolution:
hash_table.h
andhash_table.c
- These files contain the data structures and functions used to represent hash tables in the scope stack for typechecking and name resolution. They were provided by the course instructor.scope.h
andscope.c
- These files contain utility functions for entering, exiting, and managing scopes during the typechecking and name resolution phases of the compiler.symbol.h
andsymbol.c
- These files contain the data structures and functions used to represent symbols in the symbol table for typechecking and name resolution.
- Machine Code Generation:
scratch.h
andscratch.c
- These files contain utility functions for determining which registers could be used for variable storage during the code generation phase of the compiler.label.h
andlabel.c
- These files contain utility functions for generating unique labels during the code generation phase of the compiler.- *
bminor.exe
- This executable file is the result of compiling the compiler. It is used to compile B-Minor programs into machine code.
- Testing:
run-tests.sh
- A shell script used to run the test suite for the compiler. It compiles the compiler, runs it on each test file, and compares the output to the expected output.tests/
- A directory containing the test suite for the compiler. Half of the tests were written by the course instructor and the other half were written by me.
The compilation process of a B-Minor program (and most other programming languages) is like a pipeline, where the input is the source code and the output is the machine code. The pipeline for B-Minor is broken down into five main stages: the lexical analyzer, the syntax analyzer, the 'pretty printer', the type checker / name resolver, and the code generator. The following sections will describe each stage in more detail.
A B-Minor program is nothing more than a sequence lexemes, or the smallest meaningful units of a language. char
, 45
, myVar
and for
are all valid lexemes, as per the rules of the language. These rules are specified in the scanner.flex
file, which is used by Flex to generate the lexical analyzer for the program.
As per the specification of a Flex input file, scanner.flex
contains a mapping of regular expressions to their associated tokens in the language. For example, take a look at the following lines from this file:
DIGIT [0-9]
LETTER [a-zA-Z]
({LETTER}|"_")({LETTER}|"_"|{DIGIT}){0,255} { return TOKEN_IDENT; }
The first two lines are constants used to simplify the rest of the regular expressions and reduce redundancy. As you can see, a LETTER
is any character that satisfies the regular expression [a-zA-Z]
.
The third line is a Flex rule used to match a valid identifier in B-Minor. According to the B-Minor language specification, an identifier is a sequence of characters that begins with a letter or underscore and is followed by no more than 255 letters, underscores, or digits. The curly braces {}
to the right of the regex are used to specify the action to be taken when a match is found. In this case, the action is to return the token TOKEN_IDENT
to the parser.
The scanner.flex
file contains rules for the simplest language constructs like keywords...
else { return TOKEN_ELSE; }
false { return TOKEN_FALSE; }
for { return TOKEN_FOR; }
...as well as more complex tokens such as string and char literals...
'((\\?[^\\\'\n])|(\\\\)|(\\\'))' { return TOKEN_CHARLIT; }
\"((\\.|[^\\"\n]){0,255})\" { return TOKEN_STRINGLIT; }
After a program has been scanned by the lexical analyzer, and no invalid lexemes have been found in the source program, the syntax analyzer is used to parse the program into a tree-like structure called an abstract syntax tree (AST). The input for the parser is the output of the scanner: a stream of tokens. You can see how the pipeline analogy is starting to make sense now!
The parser ensures that the syntax of the input program is valid: for example, this is not a valid B-Minor expression:
x: integer + 'e' == foo(1, --);
While all of the individual lexemes of this statement are valid (by the way, they are x
, :
, integer
, +
, 'e'
, ==
, foo
, (
1
, ,
, --
, )
, and ;
!), they are not in an order that makes any sense for the language. The parser will catch this error and report it to the user.
The syntax rules for the language are defined in the Bison input file, named parser.bison
. Just like the scanner, it contains a number of BNF-like rules that contain terminal and non-terminal symbols. The terminal symbols are the tokens that are output by the scanner, and the non-terminal symbols are the rules that describe the structure of the language. For example, take a look at the syntax rule for an paramslist
, or parameter list, in B-Minor:
paramslist : ident TOKEN_COLON type TOKEN_COMMA paramslist
| ident TOKEN_COLON type
| ident TOKEN_COLON nosizearr TOKEN_COMMA paramslist
| ident TOKEN_COLON nosizearr
;
Additionally, each rule, when matched, has an associated action. Here are the actions for each of the rules above:
{$$ = param_list_create($1, $3, $5);}
{$$ = param_list_create($1, $3, 0);}
{$$ = param_list_create($1, $3, $5);}
{$$ = param_list_create($1, $3, 0);}
There's a lot here, but let's break it down:
- The symbol to the left of the colon
:
is the name of the non-terminal symbol, or the rule that is being defined. In this case, we are defining the syntax for aparamslist
. - Everything to the right of the colon
:
is a production, or a rule that describes how the non-terminal symbol can be constructed. These can be made up of terminal and/or non-terminal symbols. Each production is separated by a pipe|
character, telling us that there are multiple ways to construct a the non-terminal. In this case, there happens to be 4 different ways to construct aparamslist
. - Each production has an associated action inside of curly braces
{}
. This action is executed when the parser encounters this production. Each action in the parser is generally the creation of a new node in the abstract syntax tree, or parse tree, which is the output of this step in the pipeline. - The rules are sometimes defined recursively, as is made obvious by the fact that the
paramslist
non-terminal is used in the definition of itself. This is a common pattern in the parser, and is used to describe the structure of the language. A parameter list has an arbitrary number of parameters, so the best way to define a list of parameters is as a parameter followed by a comma and another parameter list, or just a parameter by itself.
Each major node type for the AST was given its own set of header and implementation files. These included:
decl.h
anddecl.c
for declarationsexpr.h
andexpr.c
for expressionsparam_list.h
andparam_list.c
for parameter listsstmt.h
andstmt.c
for statementstype.h
andtype.c
for types
Within these files, functions for creating AST nodes were defined. In fact, they are called in the Bison specification file, parser.bison
, to create the nodes in the parse tree. Take another look at the production rules for paramslist
, and you'll notice that the action associated with each matched rule is actually a call to a function called param_list_create()
.
This step in the compilation process is quite complex, and requires complete knowledge of the language's specifications. You can see how even a simple language such as B-Minor has a number of production rules that must be defined in the parser. The parser is also responsible for operator associativity and precedence, defined in part by the order that production rules are applied to make up the expr
node in the abstract syntax tree.
Though not required for a real compiler, we were tasked with the development of a 'pretty printer'. This component in the pipeline really served as a test of the parser's output: the AST. The pretty printer would traverse the nodes of the abstract syntax tree and print each node to the console in a human-readable format, which included proper indentation for nested language constructs and scopes.
The pretty printer could also be seen as an optimizer, as it removed some redundancy in the AST before typechecking and code generation would occur. For example, expressions with redundant or simply unneccesary groupings for precedence, such as (1 * 2) + 3
, would be printed as 1 + 2 * 3
. This is a simple example, but the pretty printer would also remove redundant parentheses in more complex expressions, such as (((((((1 + 2) * 3)))) + 4))
. This served to reduce the number of nodes in the AST, as well as ensure that the AST was being traversed in the proper order to take care of the language's associativity and precedence rules.
In order to implement this, each of the AST node files (decl.c
, expr.c
, etc.) contains a print function that is called by the pretty printer. The AST is traversed in a depth-first manner, and each visited node is printed to the console. The pretty printer also keeps track of the current indentation level, which is used to print the correct number of spaces before each line of output. By traversing the AST in the same manner that the code generator eventually will, this component of the pipeline was also used to ensure that the parser was outputting the correct intermediate representation of the language.
Now that the AST was known to have been traversed correctly, it could be passed to this step in the compilation pipeline. Issues including scope, type, and name resolution are all handled in this step. As the AST was being traversed, sometimes a new scope would be found. This could occur in a number of instances, such as nested loops, selection statements within functions, any plenty of other common language constructs. In order to handle name resolution to either local or non-local variables, a series of symbol tables within a stack were used. All of the code for scope resolution and name resolution was contained within the scope.c
and scope.h
files. A hash table implementation was provided by the course staff, which can be found in the hash_table.c
and hash_table.h
files.
Each time a new scope was found, a new symbol table was pushed onto the stack. This symbol table would contain all of the variables that were declared within the scope. When a variable was declared, it would be added to the symbol table. When a variable was accessed, the symbol table stack would be searched from top to bottom, and the first instance of the variable would be returned. This ensured that the correct variable was accessed, and that the variable was declared in the correct scope. Non-local variables were also taken care of: not finding a variable in the current scope would cause the symbol table stack to be searched for the variable in the next scope up, and so on.
When a scope was exited, the symbol table was popped off of the stack. This ensured that the symbol table stack would always contain the correct symbol tables for the current scope. This was also used to ensure that the correct variable was accessed, as the symbol table stack would be searched from top to bottom, and the first instance of the variable would be returned.
Declarations, expressions, and statements all required the proper typechecking, implemented through the inclusion of an appropriate typecheck()
function, such as decl_typecheck()
. For example, any declaration required the type to the right of the assignment operator to match the type to the left of the assignment operator. This was done by checking the type of the expression on the right side of the assignment operator, and comparing it to the type of the variable on the left side of the assignment operator. If the types did not match, an error was thrown.
Take the following declaration as an example:
x: integer = (true && b);
The following steps would occur in order to ensure that the proper typeing rules were enforced:
- The type of
x
, indicated by the declaration asinteger
, is compared against the type of(true && b)
, which cannot be determined without typechecking this sub-expression. - The type of
(true && b)
should beboolean
, as indicated by the operator (&&
) being used. However, the type checker needs to confirm that both operands of the&&
operator are of typeboolean
. This is done by typechecking the left and right operands of the&&
operator, which aretrue
andb
, respectively. true
is easy to typecheck, butb
must be searched for in the symbol table, through the process of name resolution described in the scoping section above. If we suppose thatb
is of typeboolean
, then the type of(true && b)
isboolean
.- With the types of
x
and(true && b)
now both known, they are compared, and because they are not the same, an error is thrown.
This same process must be repeated for every expression, statement, and declaration in the abstract syntax tree. If at any point the types do not match, it means that the language specifications have been violated, as per the strict typing system of B-Minor as a language.
Finally, the abstract syntax tree could be traversed for a final time in order to produce the final output of the compiler: the assembly code. This was done by traversing the AST in a depth-first manner, and calling the appropriate code generation function for each node. The code generation functions, much like the AST node creation, printing, and typechecking functions, were contained within files such as decl.c
, expr.c
, etc.
Because the target architecture was ARM64 or AArch64, the code generated was in the form of ARM64 assembly code. Each node of the AST was called in the same depth-first manner as the typechecker, and the appropriate code generation function was called. This function would determine what machine instructions should be written to the output file, and would also call the code generation functions for any sub-expressions, statements, or declarations that were required.
scratch.c
, scratch.h
, label.c
, and label.h
were a few additional utility files that were written in order to keep track of register allocation and proper assembly labels to use.
In the end, after the machine code was generated, it could be assembled, linked, and executed on the appropriate architecture.
Below is a sample B-Minor program that features a number of complex features of the language. Following it is the output of the compiler, which contains the raw assembly code that was generated.
// simple program testing arrays; calculates some basic statistics for a global array
// of positive integer test scores, including average, maximum, and minimum scores
// global array of test scores
scores : array [10] integer = {97, 87, 83, 99, 77, 80, 72, 100, 92, 88};
main : function integer (argc : integer, argv : array [] string) = {
// variables used for calculating test statistics
i : integer = 0;
total : integer = 0;
avg : integer = 0;
max : integer = scores[0];
min : integer = scores[0];
// looks through each of the test scores
for(i = 0; i < 10; i++) {
// totals up all of the test scores
total = total + scores[i];
// updates max if applicable
if(scores[i] > max) {
max = scores[i];
}
// updates min if applicable
if(scores[i] < min) {
min = scores[i];
}
}
// compute average using total from above
avg = total / 10;
// print the statistics
print "The average test score is ", avg, "\n";
print "The maximum test score is ", max, "\n";
print "The minimum test score is ", min, "\n";
return 0;
}
# ARMv8-a assembly code generated by
# 'B-Minor Compiler' v1.0,
# written by:
#
# PAYTON JAMES SHALTIS
# COMPLETED MAY 4TH, 2021 for
#
# CSC-425: "Compilers and Interpreters",
# Professor John DeGood, Spring 2021 at
# The College of New Jersey
.arch armv8-a
.file "output.s"
.data
.global scores
.align 3
scores:
.word 97
.word 87
.word 83
.word 99
.word 77
.word 80
.word 72
.word 100
.word 92
.word 88
.text
.global main
main:
stp x29, x30, [sp, #-120]!
str x0, [sp, 16]
str x1, [sp, 24]
mov x9, 0
str x9, [sp, 32]
mov x9, 0
str x9, [sp, 40]
mov x9, 0
str x9, [sp, 48]
adrp x9, scores
add x9, x9, :lo12:scores
mov x10, 0
mov x11, 4
mul x10, x10, x11
ldr w0, [x9, x10]
mov x9, x0
str x9, [sp, 56]
adrp x9, scores
add x9, x9, :lo12:scores
mov x10, 0
mov x11, 4
mul x10, x10, x11
ldr w0, [x9, x10]
mov x9, x0
str x9, [sp, 64]
mov x9, 0
str x9, [sp, 32]
.LSTMT1:
ldr x9, [sp, 32]
mov x10, 10
cmp x9, x10
b.lt .LCOND1
mov x10, 0
b .LCOND2
.LCOND1:
mov x10, 1
.LCOND2:
cmp x10, 0
b.eq .LSTMT2
ldr x9, [sp, 40]
adrp x10, scores
add x10, x10, :lo12:scores
ldr x11, [sp, 32]
mov x12, 4
mul x11, x11, x12
ldr w0, [x10, x11]
mov x10, x0
add x10, x9, x10
str x10, [sp, 40]
adrp x9, scores
add x9, x9, :lo12:scores
ldr x10, [sp, 32]
mov x11, 4
mul x10, x10, x11
ldr w0, [x9, x10]
mov x9, x0
ldr x10, [sp, 56]
cmp x9, x10
b.gt .LCOND3
mov x10, 0
b .LCOND4
.LCOND3:
mov x10, 1
.LCOND4:
cmp x10, 0
b.eq .LSTMT3
adrp x9, scores
add x9, x9, :lo12:scores
ldr x11, [sp, 32]
mov x12, 4
mul x11, x11, x12
ldr w0, [x9, x11]
mov x9, x0
str x9, [sp, 56]
.LSTMT3:
adrp x9, scores
add x9, x9, :lo12:scores
ldr x10, [sp, 32]
mov x11, 4
mul x10, x10, x11
ldr w0, [x9, x10]
mov x9, x0
ldr x10, [sp, 64]
cmp x9, x10
b.lt .LCOND5
mov x10, 0
b .LCOND6
.LCOND5:
mov x10, 1
.LCOND6:
cmp x10, 0
b.eq .LSTMT4
adrp x9, scores
add x9, x9, :lo12:scores
ldr x11, [sp, 32]
mov x12, 4
mul x11, x11, x12
ldr w0, [x9, x11]
mov x9, x0
str x9, [sp, 64]
.LSTMT4:
ldr x9, [sp, 32]
add x9, x9, 1
str x9, [sp, 32]
b .LSTMT1
.LSTMT2:
ldr x9, [sp, 40]
mov x10, 10
sdiv x10, x9, x10
str x10, [sp, 48]
adrp x9, .LSLIT1
add x9, x9, :lo12:.LSLIT1
mov x0, x9
bl print_string
ldr x9, [sp, 48]
mov x0, x9
bl print_integer
adrp x9, .LSLIT2
add x9, x9, :lo12:.LSLIT2
mov x0, x9
bl print_string
adrp x9, .LSLIT3
add x9, x9, :lo12:.LSLIT3
mov x0, x9
bl print_string
ldr x9, [sp, 56]
mov x0, x9
bl print_integer
adrp x9, .LSLIT4
add x9, x9, :lo12:.LSLIT4
mov x0, x9
bl print_string
adrp x9, .LSLIT5
add x9, x9, :lo12:.LSLIT5
mov x0, x9
bl print_string
ldr x9, [sp, 64]
mov x0, x9
bl print_integer
adrp x9, .LSLIT6
add x9, x9, :lo12:.LSLIT6
mov x0, x9
bl print_string
mov x9, 0
mov x0, x9
ldp x29, x30, [sp], #120
ret
ldp x29, x30, [sp], #120
ret
.section .rodata
.align 8
.LSLIT1:
.string "The average test score is "
.section .rodata
.align 8
.LSLIT2:
.string "\n"
.section .rodata
.align 8
.LSLIT3:
.string "The maximum test score is "
.section .rodata
.align 8
.LSLIT4:
.string "\n"
.section .rodata
.align 8
.LSLIT5:
.string "The minimum test score is "
.section .rodata
.align 8
.LSLIT6:
.string "\n"