We already know that a program is a state machine, and if the program becomes complex, so do the state machine's transfers. Not to mention the details of each state in the state machine, it's hard to fully understand how the state machine transfers between states.
It would be inefficient to use GDB to understand this process.
To improve efficiency, we can use printf()
to output some of the information we care about.
The fact that we care about the details of how this state-machine transfer in its states it also means that we care about the execution of the program.
In the field of software engineering, the information that records the execution of a program is called trace.
With the trace information, we can determine whether the program execution process is as expected, and thus diagnose bugs.
NEMU has implemented a simple trace function -- itrace (instruction trace), which can record every instruction executed by the client program.
The implementation of itrace is very simple, the code just records each instruction fetched by instr_fetch()
, and then calls the disassembly function provided by the llvm project (implemented in nemu/src/utils/disasm.cc
).
Itrace outputs the PC, binary representation of the instruction and the disassembly result.
The framework code has this feature turned on by default, and instructions executed by the client program are logged to build/nemu-log.txt
.
By looking at this file, you can see how the client program is running.
NEMU can restrict when we output traces, you can manually specify when to output them, and even customize the conditions for outputting traces. For how to do this, please RTFSC. Since the current behavior of the program is deterministic, multiple runs will yield the same result. This is helpful when the program went in an error state.
For neat traces, we can also filter them with text processing tools such as grep
, awk
, sed
, etc. So if you understand how to use shell commands for text processing, you can further improve your debugging efficiency.
Generally speaking, we only care about the trace before the error scene, when running some big programs, the trace much earlier in the run is not necessary to check or even output. A natural idea is that we can output the most recent instructions executed right before the client program error happens (such as out of bounds memory access).
It is not difficult to implement this feature, all we need to do is to maintain a very simple data structure - a ring buffer.
Specifically, each time an instruction is executed, the information about that instruction is written to the ring buffer; if the buffer is full, the old contents are overwritten.
When the client program encounters the error, the instructions in the ring buffer are printed out for debugging purposes.
An example of output is as follows, where -->
indicates the instruction causing the error.
0x80002b00: srli a2, a2, 1 00 16 56 13
0x80002b04: slli a1, a1, 1 00 15 95 93
0x80002b08: or a0, s1, a5 00 f4 e5 33
0x80002b0c: sub s0, s0, a5 40 f4 04 33
0x80002b10: jal -112 f9 1f f0 ef
0x80002aa0: lui a5, 524295 80 00 77 b7
0x80002aa4: lw a5, 2028(a5) 7e c7 a7 83
0x80002aa8: addi sp, sp, -32 fe 01 01 13
--> 0x80002aac: sw s2, 16(sp) 01 21 28 23
0x80002b3c: ret 00 00 80 67
0x80002b14: add s2, s2, a0 00 a9 09 33
0x80002b18: bnez s0, -40 fc 04 1c e3
0x80002af0: neg a5, s0 40 80 07 b3
0x80002af4: and a5, a5, s0 00 87 f7 b3
0x80002af8: or a2, s3, a5 00 f9 e6 33
0x80002afc: or a1, s4, a5 00 fa 65 b3
Based on the above, implement iringbuf in NEMU. You can design the output format according to your own preference. If you want to output the disassembly of an instruction, you can refer to the code for itrace. If you don't know what code to add where, you need RTFSC.
Memory access accounts for a large portion of program execution, and if you have encountered some memory access related errors (e.g. access physical memory that is out of bounds), you will want to know the exact behavior of the program's memory accesses, and then identify incorrect accesses to help you diagnose the bug. In fact, we can easily track the results of memory accesses and compile a memory trace.
This function is so simple that you've already figured out how to implement it: Just log in
paddr_read()
andpaddr_write()
. You can define the format of the mtrace output yourself.However, unlike iringbuf, which only outputs once at the end, programs generally execute many access instructions, This means that turning on mtrace will generate a lot of output, so it's best to turn mtrace off when you don't need it. Oh, well, let's look at the itrace implementation: try adding the appropriate code to Kconfig and the associated files to enables us to turn mtrace on and off via menuconfig. It is also possible to implement conditions on the mtrace output, e.g., you may only care about accessing a certain section of memory. With the associated conditional control, mtrace is much more flexible to use.
Itrace and mtrace both trace the program from the underlying state machine perspective, but if we want to understand the semantic behavior of the program, itrace and mtrace can't help us. Therefore, we need a tool that traces the semantic behavior of a program.
The question is, what semantics should we choose? In programming class, we learned that a program consists of a number of functions, and a function consists of a number of statements, and a statement is compiled into a number of instructions, so the only thing that can clearly carry the semantics of a program is a function. A statement is compiled into a number of instructions, so the only thing that clearly carries the semantics of a program is a function. Imagine if we could design a tool, ftrace, to trace function calls and returns during program execution. Wouldn't we be able to tell how the program is working?
This is not difficult, because itrace already traces all the instructions executed by the program. To implement ftrace, all we need to care about is the function call and return instructions. We can record the destination address in the function call instruction, indicating that a function will be called. Then we can record the current PC in the function return instruction, indicating that we are returning from the function where the PC is located. It's easy to accomplish this. But the target address and PC value still lack program semantics, and would be easier to understand if we could translate them into function names!
Given an address in a code segment, how do you know which function it is in? This is where the symbol table in the ELF file comes in. The symbol table is a section of the executable file that records compile-time information about the program, including variables and functions. In order to implement ftrace, we first need to know what information is recorded in the symbol table.
Using the user program add
in cpu-tests
as an example, use the readelf command to view information about the ELF executable.
riscv64-linux-gnu-readelf -a add-riscv32-nemu.elf
You will see that the readelf command outputs a lot of information that is useful for understanding the structure of ELF, and we recommend that you read it carefully after class. For now, we only need to be concerned with the symbol table information listed below, which is found in the output。
Symbol table '.symtab' contains 28 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 80000000 0 SECTION LOCAL DEFAULT 1
2: 80000108 0 SECTION LOCAL DEFAULT 2
3: 8000010c 0 SECTION LOCAL DEFAULT 3
4: 8000020c 0 SECTION LOCAL DEFAULT 4
5: 00000000 0 SECTION LOCAL DEFAULT 5
6: 00000000 0 FILE LOCAL DEFAULT ABS add.c
7: 00000000 0 FILE LOCAL DEFAULT ABS trm.c
8: 80000108 1 OBJECT LOCAL DEFAULT 2 mainargs
9: 800000e8 32 FUNC GLOBAL DEFAULT 1 _trm_init
10: 80009000 0 NOTYPE GLOBAL DEFAULT 4 _stack_pointer
11: 80000108 0 NOTYPE GLOBAL DEFAULT 1 _etext
12: 80000000 0 NOTYPE GLOBAL DEFAULT ABS _pmem_start
13: 8000022c 0 NOTYPE GLOBAL DEFAULT 4 _bss_start
14: 80000109 0 NOTYPE GLOBAL DEFAULT 2 edata
15: 80009000 0 NOTYPE GLOBAL DEFAULT 4 _heap_start
16: 80001000 0 NOTYPE GLOBAL DEFAULT 4 _stack_top
17: 80009000 0 NOTYPE GLOBAL DEFAULT 4 end
18: 80000010 24 FUNC GLOBAL DEFAULT 1 check
19: 80000108 0 NOTYPE GLOBAL DEFAULT 1 etext
20: 80000000 0 FUNC GLOBAL DEFAULT 1 _start
21: 00000000 0 NOTYPE GLOBAL DEFAULT ABS _entry_offset
22: 80000028 180 FUNC GLOBAL DEFAULT 1 main
23: 80000109 0 NOTYPE GLOBAL DEFAULT 2 _data
24: 8000010c 256 OBJECT GLOBAL DEFAULT 3 ans
25: 80009000 0 NOTYPE GLOBAL DEFAULT 4 _end
26: 800000dc 12 FUNC GLOBAL DEFAULT 1 halt
27: 8000020c 32 OBJECT GLOBAL DEFAULT 4 test_data
Each row represents a table entry, and each column lists some attributes of the table entry. Now we only need to care about the table entries whose Type
attribute is FUNC
. If you look at the Name
attribute, you'll see that these table entries correspond to the functions defined in the program. The corresponding Value
attribute is their starting address (you can compare this with the disassembly result), and the corresponding Size
attribute gives the size of the function.
We have defined the macro
NR_DATA
inam-kernels/tests/cpu-tests/tests/add.c
, the local variablec
and the parametersa
,b
are also defined in theadd()
function. But you won't find them in the symbol table, why is that? Think about it, what is a symbol?
Oh, with the symbol table, we can establish a mapping between function names and their addresses!
But the readelf output is already parsed, in fact the Name
attribute in the symbol table holds the offset of the string in the string table.
To look at the string table, let's look at the Section Headers in the readelf output.
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 80000000 001000 000108 00 AX 0 0 4
[ 2] .sdata2.mainargs PROGBITS 80000108 001108 000001 00 A 0 0 4
[ 3] .data.ans PROGBITS 8000010c 00110c 000100 00 WA 0 0 4
[ 4] .data.test_data PROGBITS 8000020c 00120c 000020 00 WA 0 0 4
[ 5] .comment PROGBITS 00000000 00122c 00001c 01 MS 0 0 1
[ 6] .symtab SYMTAB 00000000 001248 0001c0 10 7 9 4
[ 7] .strtab STRTAB 00000000 001408 00009b 00 0 0 1
[ 8] .shstrtab STRTAB 00000000 0014a3 000055 00 0 0 1
As you can see from the information in Section Headers, the string table is stored at the ELF file location with offset 0x1408
from the beginning.
We can output the hexadecimal form of the ELF file directly with the following command:
hd add-riscv32-nemu.elf
Looking at the output of this command around offset 0x1408
, we can see that the string table is nothing more than a string of identifiers stitched together.
Now we can clarify the relationship between the symbol table and the string table:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 7] .strtab STRTAB 00000000 001408 00009b 00 0 0 1
|
+--------------+
++ V
00001400 |V 00 61 64 64 2e 63 00 74 | ........add.c.t|
00001410 72 6d 2e 63 00|6d 61 69 6e 61 72 67 73 00 5f 74 |rm.c.mainargs._t|
00001420 72 6d 5f 69 6e|69 74 00 ^ |rm_init._stack_p|
| |
| +----------------+
| |
+-----------------------------------------+ |
Symbol table '.symtab' contains 10 entries: | |
Num: Value Size Type Bind Vis Ndx Name | |
7: 00000000 0 FILE LOCAL DEFAULT ABS 7 (trm.c) | |
8: 80000108 1 OBJECT LOCAL DEFAULT 2 13(mainargs) --+ |
9: 800000e8 32 FUNC GLOBAL DEFAULT 1 22(_trm_init) -----+
Write a Hello World program under Linux, compile it and find the string table of the ELF file by the above method. Where do you find the "Hello World!" string in the string table? Why is it there?
Now we can translate a given address into a function name. Since the address ranges of functions are disjoint, we can check if the current function address falls in any interval [Value, Value + Size)
specified in symbol table entry with Type
attribute equal to FUNC
. If it does, find the corresponding string in the string table according to the Name
attribute of the table entry, and return it as the function name. If no symbol table entry is found, the string "???" is returned. But this is probably due to a bug in your implementation, and you need to check your implementation again.
Taking recursion
from cpu-tests
as an example, some sample output from ftrace is as follows (for reference only, different compiler versions may get different output, you can try to understand it using the disassembly results).
0x8000000c: call [_trm_init@0x80000260]
0x80000270: call [main@0x800001d4]
0x800001f8: call [f0@0x80000010]
0x8000016c: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x8000016c: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x8000016c: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x8000016c: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x8000016c: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x80000058: ret [f0] # note(2)
0x800000fc: ret [f2] # note(1)
0x80000180: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x80000058: ret [f0]
0x800000fc: ret [f2]
0x800001b0: ret [f3] # note (3)
0x800000fc: ret [f2]
0x80000180: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x8000016c: call [f2@0x800000a4]
0x800000e8: call [f1@0x8000005c]
0x80000058: ret [f0]
Based on the above, implement ftrace in NEMU. You can decide the format of the output. You need to pay attention to the following:
- You need to pass in an ELF file for NEMU, and you can do this by adding the relevant code to
parse_args()
- You may need to read the symbol and string tables from the ELF file when you initialize ftrace, for your subsequent use.
- For information on how to parse ELF files, see
man 5 elf
.- If you choose riscv32, you also need to consider how to correctly recognize function call instructions and function return instructions from
jal
andjalr
instructions
If you look closely at the sample output of
recursion
above, you'll notice something interesting. Specifically, the functionret
at note (1) is matched to the correspondingcall
, i.e., thecall
callsf2
, and the correspondingret
returns fromf2
. But the set ofcall
s andret
s indicated by note (2) is different, in the sense thatcall
callsf1
, but returns fromf0
. A similar phenomenon occurs with the set ofcall
s andret
s indicated in comment (3), wherecall
callsf1
, but returns fromf3
.Try to analyze why this phenomenon occurs together with the disassembly results.
Write a Hello World program under Linux, and then use the
strip
command to strip the symbol table in the executable:gcc -o hello hello.c strip -s helloIf you look at hello program with
readelf
, you'll see that the symbol table has been stripped, so can the hello program run successfully?There is also a symbol table in the object file, which we can strip as well.
gcc -c hello.c strip -s hello.oLook at the information for hello.o with
readelf
, and you'll see that the symbol table has been stripped. Try linking to hello.o:gcc -o hello hello.oWhat problems did you find? Try to compare the two situations and analyze the reasons behind it.
We asked you to implement the trace tool in NEMU as an infrastructure to help you debug. In fact, in addition to helping you understand how a program works, tracer can also guide developers in optimizing their programs and systems such as:
- Based on ftrace, we can further analyze the arguments when calling
memcpy()
, such as whetherdest
andsrc
are aligned or not, and whether the length of the copied memory is long or short, and then optimize the algorithm implementation ofmemcpy()
according to the frequent combinations.- You can count the number of function calls based on ftrace, and optimize the function with more accesses, which can significantly improve the performance of the program.
- You can filter the branch jump instruction execution based on itrace and use it as an input to the branch predictor (a performance-enhancing component of modern processors) to adjust the branch predictor's implementation, thus improving the processor performance.
- The sequence of memory accesses of the program can be obtained from mtrace and used as input to a cache (another performance-enhancing component of modern processors) model, which guides the optimization of prefetching and substitution algorithms (as you will see in Lab4).
trace is so important for performance optimization, because trace reflects the real behavior of the program, if you blindly optimize a function that will only be called once, you can imagine that such optimization will not improve the overall performance of the program. Trace actually shows us the details of the program execution events (such as function call), if we do statistical analysis of these events, we can tell which events are occurring frequently, and optimize these frequent events to improve the overall system performance. And this is the scientific method of performance optimization.
You can write the klib, and then run the string
program on NEMU to see if it passes the test. On the surface, this doesn't seem like a bad idea, but if it doesn't pass the test, you're going to be thinking when you're debugging it: did you write the klib wrong, or is there a bug in NEMU? If this problem is not solved, debugging becomes more difficult: it is possible to debug in NEMU for a week and find out that the klib implementation is buggy.
The reason why this is a problem is that both the software (klib) and the hardware (NEMU) are written by you, and their correctness is not 100% guaranteed. We all learned about the control variant method in high school: if you can replace one side with an implementation that you think is correct, you can test the correctness of the other side alone! For example, if we test klib on a real machine, if the test doesn't pass, then klib is the problem, because we can trust that the hardware implementation of the real machine will always be correct; on the other hand, if the test passes, then klib isn't the problem, but rather NEMU is buggy.
A new question is, can we really easily port software to other hardware for testing? If you're smart, you'll remember the core idea of AM: decoupling the program from the architecture through a set of abstract APIs. The idea of AM ensures that the code running on top of AM (including klib) is architecture-independent, which increases the portability of the code. Imagine if the code in string.c
had a nemu_trap
instruction that could only be executed in NEMU, then it would not run on a real machine.
The abstract-machine
has a special architecture called native
, which implements the AM API using the GNU/Linux default runtime environment. For example, when we compile a program with gcc hello.c
, it compiles to the runtime environment provided by GNU/Linux. Super Mario, the program you tried in PA1, is also compiled to native
and runs. Compared to $ISA-nemu
, native
has the following advantages:
- Running directly on the real machine, you can trust that the real machine will always behave correctly.
- Even if there are bugs in the software, it is easier to debug them on
native
(e.g. you can use GDB, which is much more convenient than NEMU's monitor).
Therefore, instead of debugging the software directly in $ISA-nemu
, it is better to test NEMU by tuning the software on native
and then switching to $ISA-nemu
to run it. In abstract-machine
, we can easily compile the program target another architecture, e.g., in the am-kernels/tests/cpu-tests/
directory, by running
make ALL=string ARCH=native run
You can compile the string
program to native
and run it. Since we will be compiling the program into a different architecture, you need to pay attention to the ARCH
parameter in the make
command. If the string
program does not pass the test, the terminal will output
make[1]: *** [run] Error 1
Of course, it is also possible to output information such as segment errors.
Read the relevant Makefile, and try to understand how
abstract-machine
generatesnative
executables.
Why is the error code
1
? Do you know how themake
program gets this error code?
Don't get too excited, when compiling to native
glibc is linked by default, we need to switch to klib that we have written for testing. We can do this by defining the macro __NATIVE_USE_KLIB__
in abstract-machine/klib/include/klib.h
. If this macro is not defined, the library functions will be linked to glibc, which can be used as a correct reference implementation for comparison.
How did we link these library functions on
native
to klib after defining the macro__NATIVE_USE_KLIB__
? How exactly does this happen? Try to explain this phenomenon based on what you learned about linking in class.
Ok, now you can test/debug your klib implementation on native
, and use putch()
for character output to help you with debugging, or even GDB into it. Once implemented correctly, compile the program to $ISA-nemu
(remembering to remove the putch()
inserted during debugging), and test it against NEMU.
In order not to damage the portability of your program, you can no longer write your program with architectural assumptions, for example, "the length of a pointer is 4 bytes" will no longer hold, because the length of a pointer on
native
is 8 bytes, and a program written according to this assumption will most likely trigger a segmentation error when run onnative
.Of course, there are ways to solve the problem, and how to do it is, as always, STFW.
The string
program simply calls a function in klib, which itself acts as a client program to test the NEMU implementation, but it does not adequately test the klib implementation. For this reason, you need to write some adequate test cases to specifically test the klib implementation. Test cases consist of test inputs and test outputs, and if we want to construct test cases efficiently, we need to find a way to get test outputs independently of the test objects.
+----> Test Objects ----> Actual Output
| |
Input +----> Same?
| |
+----> Some Method ----> Expected Output
In klib, there are three main categories of functions that you need to implement.
- Read-Write functions for memory and strings, e.g.
memset()
,strcpy()
, etc.. - Read-only functions for memory and strings, e.g.
memcmp()
,strlen()
, etc.. - Formatted output functions, such as
sprintf()
, etc.
For the first class of functions, how should we construct a test session so that there are methods to easily obtain the test output? Note that these functions all write to an area of memory, consider the following array.
#define N 32
uint8_t data[N];
void reset() {
int i;
for (i = 0; i < N; i ++) {
data[i] = i + 1;
}
}
Each element of such an array is 1 byte and has a different value. If we test on this array, any one byte of the actual output that is incorrect can be detected with a high probability. In order to get the expected output, we also have to think about the expected behavior of the test function: The above functions all write to a contiguous segment of the array, so we can examine the expected output in three segments.
- The first segment is the left-hand side of the interval that the function writes to, which is not written to, so there should be
assert(data[i] == i + 1)
- The second segment is the interval itself that the function writes to, and the expected result of this segment is related to the specific behavior of the function.
- The third segment is the right-hand side of the interval that the function writes to, and this segment of the interval is not written to, so there should be
assert(data[i] == i + 1)
So we can write two helper functions for checking this:
// Check that the values in the [l,r) interval are in order of val, val + 1, val + 2...
void check_seq(int l, int r, int val) {
int i;
for (i = l; i < r; i ++) {
assert(data[i] == val + i - l);
}
}
// Check that the values in the [l,r) interval are all equal to val
void check_eq(int l, int r, int val) {
int i;
for (i = l; i < r; i ++) {
assert(data[i] == val);
}
}
With these two functions, we can iterate over various inputs and easily write the expected output of the function we are testing. For example, for memset()
, we can write the following test code.
void test_memset() {
int l, r;
for (l = 0; l < N; l ++) {
for (r = l + 1; r <= N; r ++) {
reset();
uint8_t val = (l + r) / 2;
memset(data + l, val, r - l);
check_seq(0, l, 1);
check_eq(l, r, val);
check_seq(r, N, r + 1);
}
}
}
Try to understand how the above test code performs the tests and add a new test set
klib-tests
for klib in theam-kernels/tests/
directory, the file structure of the test set can be referred toam-kernels/tests/am-tests
oram-kernels/kernels/hello
.Then write the test code for the first type of write function described above. There are a few things to keep in mind when writing the tests.
- The behavior of
memcpy()
is UB when the intervals overlap, you can check if the intervals overlap while traversing, and if so, skip this check; or use another identical array assrc
so that there is no overlap.- String handler functions need extra attention for
\0
and buffer overflowsOnce written, you can test your test code with the glibc libraries on native, then test your klib implementation with the test code on native, and finally run the test code on NEMU to test your NEMU implementation.
Yes, you can. But untested code is always wrong, and you'll be using these libraries to write more complex programs (e.g., OS) later, so if you don't want to get screwed over later, take the time to test them now. Also, if you want to optimize these libraries later (e.g.,
memcpy()
andmemset()
are two functions that will be used a lot later on), you may write a slightly more complex algorithm, and at that point, you'll see how important these tests are.
Try to add tests for
klib-tests
for the second class of read-only functions, e.g.memcmp()
,strlen()
etc. Think about it, how do you get the expected output of a function?
Finally, let's look at the formatted output function. Taking %d
as an example, we need to construct some inputs. But the range of integers is too large to go through them all, so we need to pick some representative integers. The C standard header file limits.h
contains the definitions of the maximum and minimum numbers, which you can read by opening /usr/include/limits.h
. Some representative integers could be:
int data[] = {0, INT_MAX / 17, INT_MAX, INT_MIN, INT_MIN + 1,
UINT_MAX / 17, INT_MAX / 17, UINT_MAX};
In order to get the desired output, we can write a native
program to output them using printf
, and then organize the output into test code. The expected output in cpu-tests
is generated in the same way.
Try adding tests for formatted output functions to
klib-tests
. You can print the actual output to a buffer withsprintf()
, and then compare it to the expected output withstrcmp()
.You could also consider implementing width, precision, length modifiers, etc., and generating test cases to test them.
After understanding the execution process of instructions, adding various instructions is more of an engineering implementation. Engineering implementations will inevitably encounter bugs, how to quickly debug when the implementation is incorrect, in fact, also belongs to the scope of the infrastructure. Think about it, there are so many instructions (x86 instructions itself is a lot), some instructions may have more complex behavior (most of the x86 instructions are very complex), if one of the implementation is wrong, how do we find it?
Intuitively this doesn't seem like an easy thing to do, but let's discuss why. Suppose we accidentally fill in the wrong type for an instruction, and NEMU executes that instruction, decodes it with the wrong type, and either gets the wrong source operand, or writes the right result to the wrong destination operand. As a result, the NEMU executes the instruction in a way that violates its original semantics, which in turn causes other instructions that depend on the instruction to fail to execute correctly. From the perspective of the violation, the result is a UB. Eventually, we see the client program accessing memory out of bounds, getting stuck in a dead loop, or HIT BAD TRAP, or even the NEMU triggering a segmentation error.
We have already discussed debugging methods in PA1, but for bugs in the implementation of instructions, we find that these debugging methods are not very effective: it is difficult to express the correct behavior of instructions through assert()
, and printf()
and GDB don't actually reduce the distance between error and failure.
If there is a way to express the correct behavior of an instruction, we can base our assert()
-like checks on that method. So, what exactly expresses the correct behavior of an instruction? The most direct, of course, is the ISA manual, but we implement instructions in NEMU based on the behavior of the instructions in the ISA manual, and the same set of methods can't be used for both implementation and checking. It would be nice to have a reference implementation of the ISA manual. Hey! Isn't the real machine we are using implemented according to the ISA manual? We let each instruction executed in the NEMU be executed in the real machine, and then compare the state of the NEMU and the real machine, and if the state of the NEMU and the real machine don't match, we catch the error!
This is actually a very effective testing methodology, known in the software testing field as [differential testing] difftest (subsequently referred to as DiffTest). Typically, DiffTest is performed by providing a REF (Reference) that has the same functionality as the DUT (Design Under Test), but with a different implementation, and then letting them accept the same defined inputs to see if they behave the same way.
We have just mentioned "state", what exactly does "state" mean? As we recognized in PA1, a program and a computer can be viewed as a state machine, and the state can be represented as a pair S = <R, M>
, where R
is the value register, and M
is the value of memory. To check whether the implementation of an instruction is correct, just check whether the states of DUT and REF are the same after executing the instruction! DiffTest can catch error in a very timely manner, and whenever you first notice that the state of the NEMU is different from the real machine, it is due to an error in the implementation of the instruction currently being executed. This is actually very close to the error, preventing the error from propagating further, and at the same time, it is much easier to backtrack to find the fault.
What a wonderful feature that illustrated the theory behind the computer! But unfortunately, don't forget that the real machine is running the operating system GNU/Linux, and we can't run AM programs compiled to x86-nemu
in native
, and mips32 and riscv32 programs can't be run directly on the real machine using x86. So, what we need is not only a correct implementation of the ISA manual, but we also need to be able to run AM programs compiled to $ISA-nemu
correctly on it.
As we introduced in PA1, NEMU is a full-system emulator. We can consider other full-system emulators as REFs. They are a simulation of a complete computer system, and the goal of NEMU is to simulate only a subset of it hence programs that can run on NEMU can also run on it. Therefore, in order to test the correctness of NEMU's implementation through the DiffTest method, we let NEMU and another emulator execute the same client program instruction by instruction at the same time. After each instruction, both sides check the status of their registers and memory. If any inconsistency is found, an error is reported and the execution of the client program is stopped.
The idea of DiffTest is very simple: find a correct implementation and compare the results with it. In fact, the expression generator you implemented in PA1 has the same idea of DiffTest in it: the C program is the REF. The
cpu-tests
provided inam-kernels
are also the REF: these tests will be run on thenative
first to obtain the correct results, so you're actually using thenative
as a REF to compare the results of the program execution (HIT GOOD/BAD TRAP).Of course, the granularity of the DiffTest introduced here is much more detailed: we don't just compare the results of the program execution, but the behavior of each instruction. This can help us quickly find and locate bugs in the implementation of instructions. We inherited this idea in the LoongArch competition, using the CPU written in verilog/chisel as the DUT, and the implemented NEMU as the REF, we can quickly find and fix the bugs in verilog/chisel. With the help of DiffTest, we achieved the epic that
Implement a full out-of-order processor correctly and run your own time-sharing multitasking operating system, Nanos. Then run a complex application, Chinese paladin, on it.Another example of the power of DiffTest is a hot topic in July 2020, when five undergraduates from the Chinese Academy of Sciences graduated with their own processor chip designs. Under the guidance of a team of faculty members, the students used DiffTest to perform online comparisons of their processor designs with NEMU, successfully booting Linux and running the Busybox in 5 days, and booting Debian and running complex applications such as GCC and QEMU in 4 days. Students also presented at thecrvs2020(video, slide) and riscv gf(video, slide) to share processor design experience. DiffTest is presented as a key technique.
The above examples repeatedly illustrate the importance of infrastructure: a well-developed infrastructure makes CPU design efficient and simple, and even accomplishes things that no one has been able to do before. With infrastructure, daunting processor hardware designs can be transformed and revitalized: you don't need to look at waveforms that make you dizzy to debug your hardware code. Recently, there has also been a surge in agile hardware design, and the role of infrastructure in this has been obvious. If you're interested, please contact us and let's explore the direction of infrastructure improvement.
To facilitate the implementation of DiffTest, we define the following set of APIs between DUT and REF.
// Copy `n` bytes between `buf` of DUT host memory and `addr` of REF guest memory.
// `direction` specifies the direction of copying, `DIFFTEST_TO_DUT` means copy to DUT, `DIFFTEST_TO_REF` means copy to REF.
void difftest_memcpy(paddr_t addr, void *buf, size_t n, bool direction);
// When `direction` is `DIFFTEST_TO_DUT`, get register state of the REF and write to `dut`.
// When `direction` is `DIFFTEST_TO_REF`, set the register state of REF to `dut`.
void difftest_regcpy(void *dut, bool direction);
// Have REF execute `n` instructions
void difftest_exec(uint64_t n);
// Initializing REF's DiffTest function
void difftest_init();
Where the register state dut
requires the registers to be in a certain order, without the required order, the behavior of difftest_regcpy()
is undefined (which dumps the blame on you guys ^_^).The REF needs to implement these APIs, and the DUT will use these APIs for DiffTest. In this case, DUT and REF are NEMU and other emulators respectively.
The NEMU framework code is ready for DiffTest functionality, turn on the appropriate option in menuconfig.
Testing and Debugging
[*] Enable differential testing
Then just recompile NEMU and run it. The NEMU configuration system selects the appropriate emulator as the REF based on the ISA.
- x86: KVM. KVM can use hardware virtualization technology to create a virtual computer system directly on the hardware, just like the real machine. KVM provides a set of
ioctl()
based APIs for Linux user programs. Which allow us to put the hardware into virtualization mode, run client programs on it, and get the state of the client programs. - mips32: QEMU. QEMU is a complete system-wide emulator, supporting multiple ISAs. However, we can only communicate with QEMU via the socket-based GDB protocol to get its status, which has a high communication overhead. In order to run QEMU, you need to install it.
apt-get install qemu-system
- riscv32: Spike. Spike is a full-system emulator from the RISC-V community that works very similar to NEMU. We have added a small number of interfaces to Spike to implement the DiffTest API, thanks to class of 2018 Chenlu. Since Spike contains a lot of source files, the compilation process may take a few minutes. In order to run Spike, you need to install another tool.
apt-get install device-tree-compiler
We updated Spike's code at 01:15 on April 8, 2023 . If you get NEMU's code before that time, but Spike's code after that time. You may have problems with tools/spike-diff/difftest.cc compilation errors. Please update the relevant files according to this page.
If you get your Spike code before the above time, or your NEMU code after the above time, you don't need to worry about this.
The following information does not affect the compilation result of Spike, you can ignore them.
Makefile:349: warning: overriding recipe for target 'disasm.o' Makefile:349: warning: ignoring old recipe for target 'disasm.o' make[2]: Circular libcustomext.so <- libcustomext.so dependency dropped. make[2]: Circular libsoftfloat.so <- libsoftfloat.so dependency dropped.
The appropriate rules and parameters have been set in nemu/tools/difftest.mk
, which will automatically go to the appropriate subdirectory under nemu/tools/
(kvm-diff, qemu-diff or spike-diff) to compile the dynamic libraries and pass them in as parameters to NEMU's --diff
option.
After enabling DiffTest, init_difftest()
in nemu/src/cpu/difftest/dut.c` performs the following additional initialization:
- Open the incoming dynamic library file
ref_so_file
. - Resolve and relocate the API symbols in the dynamic libraries through dynamic loading, and return their addresses.
- Initialize the DIffTest function of the REF, the specific behavior varies from REF to REF.
- Copy the guest memory of the DUT into the REF.
- Copy the register state of the DUT into the REF.
After the above initialization, the DUT and REF are in the same state. The next step is to do an instruction-by-instruction comparison, which is done by the difftest_step()
function (defined in nemu/src/cpu/difftest/dut.c
). It will be called in the main loop of cpu_exec()
, and after executing an instruction in NEMU, it will let REF execute the same instruction in difftest_step()
, and then read out the registers in REF and compare them. Since the registers are different for different ISAs, the framework code abstracts the register comparison into an ISA-related API, the isa_difftest_checkregs()
function (defined in nemu/src/isa/$ISA/difftest/dut.c
). You need to implement the isa_difftest_checkregs()
function, which compares the general-purpose registers and the PC with the values of the registers read from the DUT. If the comparison results are the same, the function returns true
; if the values are found to be different, the function returns false
and the framework code automatically stops the client program. Especially, if the comparison result of isa_difftest_checkregs()
is inconsistent, the second parameter pc
should point to the instruction that caused the inconsistency, which can be used to print a debug message.
In the introduction to the API conventions above, it was mentioned that register state
r
needs to put the registers in a certain order. You first need to RTFSC, to find this ordering, and check that your NEMU implementation already meets the constraints.Then add the appropriate code to
isa_difftest_checkregs()
to implement the core functionality of DiffTest. Once implemented correctly, you'll have an incredibly powerful testing tool.After realizing the power of DiffTest, think about this. As an infrastructure, how much time can DiffTest save you in debugging?
Huh? Don't we need to compare the state of the memory? In fact, it's not easy to get the memory locations modified by instructions in the REF, and comparing the whole memory would be a huge overhead, so we don't compare the memory state. In fact, the simplified implementation of NEMU can also cause the state of some registers to be inconsistent with the results of the REF, for example, x86 EFLAGS, NEMU only implements a small number of flags in EFLAGS, and also simplifies the updating of EFLAGS by some instructions. In addition, some special system registers are not fully implemented. Therefore, our implementation of DiffTest is not a complete comparison of the state of the REF and NEMU, however, both memory and special registers will be used again in the near future as long as they have been modified by an instruction of the client program, and the difference can be detected. Therefore, we sacrifice some comparison accuracy for performance improvement, but even so, since DiffTest needs to communicate with the REF and let the REF execute instructions, it will still slow down NEMU's operation. Therefore, it is not recommended to turn on DiffTest to run NEMU unless you are debugging.
The simplification of NEMU leads to a difference in the behavior of some commands from the REF, which makes it impossible to compare them. In order to solve this problem, the framework has prepared two functions, difftest_skip_ref()
and difftest_skip_dut()
:
-
There are commands that cannot be executed directly by REF, or the behavior of the executed commands must be different from NEMU.For example, the
nemu_trap
instruction, in REF, will throw a debugging exception. This can be calibrated withdifftest_skip_ref()
, after executing it,difftest_step()
will cause the REF to skip the execution of the current instruction. At the same time, the current register state of NEMU will be synchronized to REF directly. The effect is equivalent to "the execution result of this instruction is based on the state of NEMU". -
Due to implementation specificity, QEMU may on rare occasions package several instructions to be executed together. When we call
difftest_step()
once, QEMU will execute several instructions. However, since NEMU'sfetch_decode_exec_updatepc()
executes one instruction at a time, this introduces a bias. This can be calibrated withdifftest_skip_dut(int nr_ref, int nr_dut)
. Immediately after executing it, the REF is told to single-stepnr_ref
times. The NEMU is then expected to catch up with the REF withinnr_dut
instructions, skipping the checking of all the instructions in the process.
Instructions in PA2 that require calibration:
- x86: N/A
- riscv32: N/A
- mips32: Various jump commands
- This is because
mips32-NEMU
does not implement a branch delay slot, which can be calibrated withdifftest_skip_dut(2, 1)
In some older versions of mips32-QEMU, instruction packing is done only if the last 12 bits of the PC value of the above instruction are
0xffc
. This packing condition seems very strange, do you know what could be the reason?
RISC-V, as a RISC architecture, does not normally support unaligned access, and executing an unaligned access instruction in Spike will throw an exception, causing the PC to jump to
0
. The NEMU does not do this for simplicity, so if you execute such an instruction in both the NEMU and Spike, DiffTest will throw an error. However, this is probably a problem with your software implementation (e.g. klib), so please check and correct the code.
DiffTest will connect to QEMU through a fixed port, running two copies of NEMU with DiffTest open at the same time will result in the following message.
Failed to find an available port: Address already in use
If you are sure that you are not running two copies of NEMU at the same time, but still encounter the above message, you can kill the QEMU left in the background by executing the following command
pkill -9 qemu
In the process of implementing the instructions, you need to run the test cases one by one. But once the instructions are implemented correctly, does that mean you can say goodbye to the test cases? Obviously not. You will need to add new functionality to NEMU in the future, and in order to make sure that the new functionality doesn't affect the existing functionality, you will need to re-run the test cases. In order to ensure that the new functionality does not affect the implementation of the existing functionality, you need to re-run the test cases. In software testing, this process is called regression test.
Since these test cases will be re-run in the future, it would be inefficient to re-run each test manually. To improve efficiency, we provide cpu-tests
with a one-click regression test command:
make ARCH=$ISA-nemu run
With the above command you can automatically run all the tests in cpu-tests
in a batch and report the results for each test case.
As you already know, a NEMU is a program that executes other programs. In computational theory, there is a special term for such a program, called a Universal Program, which means, in common parlance, that it can do what any other program can do. The existence of universal programs has been proved, and we won't go into it here, but the idea of universal programs is behind the fact that we can write NEMUs, do experiments with Docker/Virtual Machines, and even do all sorts of things on our computers: NEMUs and simulators are just instantiations of universal programs. It is not an exaggeration to say that a computer is a physical instance of a universal program. The existence of universal programs laid the theoretical foundation for the emergence of computers, and is an extremely important conclusion in computational theory. If the existence of a universal program is not proved, we cannot use computers with confidence, and we cannot say "the machine is always right". ``
The NEMU we write will eventually be compiled into x86 machine code, using x86 instructions to simulate the execution of client instructions. In fact, in 1983, [Prof. Martin Davis] martin davis published a book entitled "Computability, complexity, and languages: fundamentals of theoretical computer science", L programming language, was proposed in the book, and it was shown that L is equivalent to all other programming languages. The three instructions in L are.
V = V + 1 V = V - 1 IF V != 0 GOTO LABELIn terms of x86 instructions, these are
inc
,dec
andjne
.Even more surprisingly, Prof. Martin Davis proved that it is possible to write a general-purpose program similar to NEMU in L, without taking physical limitations into account (think of unlimited memory capacity, where each memory cell can hold an arbitrarily large number of numbers)! And the framework of this general purpose program written in L is similar to the
cpu_exec()
function in NEMU: fetch, decode, execute... This is not a coincidence, but an application of simulation simulation in computer science.Long before Prof. Martin Davis proposed the L-language, scientists have been exploring what problems are computable. Going back to the 1830s, in an attempt to answer this question, different scientists proposed and investigated different models of computation, including the recursive function by Gödel, Herbrand, and Kleen, the λ-lambda by Church, and Turing machine by Turing, which were later found to be equivalent in terms of computational power; and by the 1840s computers were being built. It was even shown that if an infinite number of abacuses were spliced together, they would be equivalent to a Turing machine! We can deduce from this that general-purpose programs have different manifestations in different models of computation. NEMU as a general-purpose program was of great significance in the 1830's. If you could have designed NEMU 90 years ago, maybe the Turing Award would have been named after you. Limits of Computation is a series of popular science articles about the development of computable theory that you can read to get a sense of human civilization (some math skills will be required). If you are interested in computational theory, you can take Mr. Fangmin Song's Introduction to Computational Theory course.
Returning our thoughts to PA, the nature of a general purpose program tells us that the potential of NEMU is unlimited. What do you think is missing from NEMU to create a colorful world?
In addition to being an emulator, NEMU has a simple debugging function, which allows you to set breakpoints and check the status of your program. If you add the following feature to NEMU
When the guest program is trapped in a infinite loop, let the guest program pause, and output some corresponding log.
How do you think it should be done? If you are confused, search for information on the Internet.
This concludes phase 2 of PA2.