manual/fwc_other_implementation_details.tex at master · ForwardCom/manual · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
% chapter included in forwardcom.tex
\documentclass[forwardcom.tex]{subfiles}
\begin{document}
\RaggedRight


\chapter{Other implementation details}
\section{Endianness} \label{endianness}
The memory organization is little endian. Instructions for byte swapping are provided for reading and writing big endian binary data files.

\subsubsection{Rationale}
The storage of vectors in memory would depend on the element size if the organization was big endian. Assume, for example, that we have a 128 bit vector register containing four 32-bit integers, named A, B, C, D. With little endian organization, they are stored in memory in the order:
\vv

A0, A1, A2, A3, B0, B1, B2, B3, C0, C1, C2, C3, D0, D1, D2, D3,
\vv

where A0 is the least significant byte of A and D3 is the most significant byte of D. With big endian organization we would have:
\vv

A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0, D3, D2, D1, D0.
\vv

This order would change if the same vector register is organized, for example, as eight integers of 16 bits each or two integers of 64 bits each. In other words, we would need different read and write instructions for different vector organizations.
\vv

Little endian organization is more common for a number of reasons that have been discussed many times elsewhere.
\vv


\section{Pointers and addresses} \label{PointersAndAddresses}
ForwardCom is using a flat memory model with 64-bit addresses.
Pointers stored in registers contain absolute addresses with 64 bits. A lower number of bits is allowed for systems with a smaller address space.
\vv

Function pointers and code pointers must be divisible by 4 because the code consists of 4-byte words.
\vv

Data pointers and function pointers stored in data memory may contain absolute or relative addresses.
Absolute pointers need to be initialized before they are used. There are three different ways to accomplish this:
\begin{enumerate}
  \item Initialize the data to an absolute address. This is possible, but it is not recommended because the data section containing an absolute address will be position-dependent. A particular platform may or may not support position-dependent code.

  \item Make a start-up code that initializes the pointer, using a relative address for calculation.

  \item Store a relative pointer and calculate the full address in the running
        program. This is explained below

\end{enumerate}
\vspace{4mm}


Relative pointers can be more compact than absolute pointers.
A relative pointer is storing the address of a function, code label, or data label relative to an arbitrary reference point. The reference point must be placed in a section with the same address regime as the target that we want to address -- either IP-based or DATAP-based or THREADP-based.
\vv

A relative pointer may be scaled by a power of 2. For example, it is useful to scale a relative code pointer by 4 because all code addresses are divisible by 4. This allows us to use fewer bits for the relative pointer. The relative pointer may be a signed integer of 8, 16, or 32 bits. The number of bits must be sufficient to contain the distance between the target and the reference point, divided by the scale factor, in a signed integer. The calculation of the scaled relative address is done by the linker.
\vv

A relative pointer must be converted to an absolute pointer before it can be used for addressing a target. The sign\_extend\_add instruction is useful for converting a relative pointer to an absolute pointer. This instruction will read the relative pointer, sign-extend it to 64 bits, optionally scale it by a factor of 2, 4, or 8, and then add the address of the reference point. This method can be used for data pointers as well as code pointers.
See page \pageref{relativeDataPointer} for details.
\vv

Relative code pointers can also be used directly in the jump\_relative or call\_relative instruction.
This instruction will read one entry from an indexed list of relative addresses, sign extend the relative address, scale it by 4, add an arbitrary reference point, and then jump or call to the calculated address. This is a very compact and efficient way of implementing a multiway jump or call, such as a switch/case statement or function table.
See page \pageref{exampleSwitchCase} for an example. The jump\_relative  instruction can also be used without an index if you have just one relative code pointer. (page \pageref{exampleRelFuncPointer})
\vspace{4mm}

Direct jump and call instructions and conditional jump instructions use relative addresses of 8, 16, 24, or 32 bits stored as part of the instruction. The least significant two bits of relative code addresses are not stored because they are always zero. The target address of a relative jump or call is calculated in the following way. The relative address is multiplied by 4 and sign-extended to 64 bits. The address of the end of the instruction (beginning of next instruction) is added to this value to get the address of the target. \vv

Indirect jump and call instructions can use a pointer containing an absolute address.
The pointer can be a register or a memory operand initialized as described above.
\vv

\section{Implementation of call stack} \label{callStackImplementation}
The ForwardCom design defines two separate stacks, one for local data and one for function return addresses. There are several advantages to having separate stacks for data and return addresses.
\vv

A separate return stack improves the security by preventing buffer overflow errors from overwriting return addresses. This makes a common hacking method impossible and prevents programming errors from leading the program astray.
\vv

The separate return stack will usually be small enough to fit on the CPU chip. This makes function call and return efficient. Only in case of deeply recursive function calls will it be necessary to spill the stack to RAM memory.
\vv

There is no need for a branch predictor to predict return addresses because the return address is immediately available in the CPU.
\vv

A joint stack for data and return addresses has the disadvantage that the stack pointer is modified by the ALU at the end of the pipeline, while it is needed at the beginning of the pipeline for pushing and popping return addresses. Today, advanced CPUs have a complicated "stack engine" to mirror or predict the stack pointer at the beginning of the pipeline to avoid the pipeline-length delay between changing the stack pointer and using it in call and return instructions. A separate on-chip call stack eliminates the need for such a stack engine. Call and return instructions can be handled at an early stage in the pipeline without having to wait for any other instructions that might change the stack pointer at the end of the pipeline.
\vv

A function is always sure to return to the correct place even if the data stack is compromised by a mismatch between caller and callee regarding the number or type of function parameters.
\vv

There is no need to save and restore a link register in non-leaf functions, such as several RISC architectures require.
\vv

A disadvantage of a separate on-chip call stack is that it has to be saved and restored during a task switch, unless the chip has multiple call stacks. Special instructions are needed to read and write the call stack for the sake of task switching and stack unwinding.
\vv

The function calling conventions defined in chapter \ref{chap:functionCallingConventions} will
work with a separate call stack as well as with a unified stack, but it is recommended to have a separate call stack. The programmer is free to make one or more data stacks.
\vv


\section{Performance monitoring and error tracking}
ForwardCom supports a set of performance counter registers. These registers count the number of instructions of different kinds that are executed, for example double-size instructions, vector instructions, or branch instructions. The read\_perf instruction can read and reset these registers. This instruction is described on page \pageref{table:readPerfInstruction}. Specific hardware implementations may have different performance counter registers. These will be described in the manual for the specific hardware implementation or soft core.
\vv

Tracking of floating point errors is described on page \pageref{ADifferentWayOfTrackingErrors}. Tracking of integer overflow is described on page \pageref{integerOverflowDetection}.
\vv

A special set of performance counters may be used for tracking non-recoverable errors such as unknown instructions, wrong parameters for instructions, and memory access violations. These errors will normally generate a trap or stop execution, but the error traps can be disabled using the capabilities registers described on page \pageref{table:capabilitiesRegisters}. This makes it possible to execute a piece of code tentatively and afterwards check if any errors have occurred. The number of errors of each type is counted, and the position and type of the first error is recorded.
\vv

Specific hardware implementations may have additional features for debugging and error tracking. These will be described in the manual for the specific hardware implementation or soft core.
\vv


\section{Multithreading}
The ForwardCom design makes it possible to implement very large vector registers to process large data sets. However, there are practical limits to how much you can speed up the performance by using larger vectors. First, the actual data structures and algorithms often limit the vector length that can be used. And second, large vectors mean longer physical distances on the semiconductor chip and longer transport delays.
\vv

Additional parallelism can be obtained by running multiple threads in each their CPU core. The design should allow multiple CPU chips or multiple CPU cores on the same physical chip.
\vv

Communication and synchronization between threads can be a performance problem. The system should have efficient means for these purposes, perhaps including speculative synchronization.
\vv

It is probably not worthwhile to allow multiple threads to share the same CPU core and level-1 cache simultaneously (this is what Intel calls hyper-threading) because this could allow a low priority thread to steal resources from a high priority thread, and it is difficult for the operating system to determine which threads might be competing for the same execution resources if they are run in the same CPU core. Simultaneous multithreading may also involve security problems by making the design vulnerable to side-channel attacks.
\vv

\section{Security features} \label{securityFeatures}
Security is included in the fundamental design of both hardware and software. This includes the following features.

\begin{itemize}
\item A flexible and efficient memory protection mechanism.

\item Separation of call stack and data stack so that return addresses cannot be compromised by buffer overflow.

\item Each thread has its own protected memory space, except where compatibility with legacy
software requires a shared memory space for all threads in an application.

\item Device drivers and system functions have limited memory access.
Buffer overrun errors can be effectively prevented by controlling which memory area a device driver has access to, since all program input goes through device drivers. A calling program can limit the memory access of a device driver to only a specific input buffer.

\item Device drivers and system functions have carefully controlled access rights to input/output ports and other system resources. These access rights are defined in the executable file header of the device driver and controlled by the system core.

\item A fault in a device driver should not generate a ``blue screen of death'', but generate an error message and close the application that called it and free its resources.

\item Application programs have only access to specific resources as specified in the executable file header and controlled by the system.

\item Array bounds checking is simple and efficient, using an addressing mode with built-in bounds checking or a simple branch.

\item There are various optional methods for checking integer overflow.

\item There is no ``undefined'' behavior. There is always a limited set of permissible responses to an error condition.
\end{itemize}

\subsection{How to improve the security of applications and systems}
Several methods for improving security are listed below. These methods may be useful in ForwardCom applications and operating systems where security is important.

\subsubsection{Protect against buffer overflow}
Input buffers must be protected against overflow. This can be done efficiently by limiting the memory access of the device driver that handles the input. Furthermore, it is possible to allocate an isolated block of memory for a data buffer. See page \pageref{isolatedMemoryBlocks}.

\subsubsection{Protect arrays} Array bounds should be checked.

\subsubsection{Protect against integer overflow} Use one of the methods for detecting integer overflow mentioned on page \pageref{integerOverflowDetection}.

\subsubsection{Protect thread memory} Each thread in an application should have its own protected memory space. See page \pageref{threadMemoryProtection}.

\subsubsection{Protect code pointers}
Function pointers and other pointers to code are vulnerable to control flow hijack attacks. These include:

\begin{description}
\item[Return addresses.] Return addresses on the stack are particularly vulnerable to buffer overflow attacks. ForwardCom has separate call stack and data stack so that return addresses are isolated from other data.

\item[Jump tables.] Switch/case multiway branches are often implemented as tables of jump addresses. These should use the jump table instruction with the table placed in a read-only section. See page \pageref{table:indirectJumpInstruction}.

\item[Virtual function tables.] Programming languages with object polymorphism, such as C++, use tables of pointers to virtual functions. These should use the call table instruction with the table placed in the a read-only section. See page \pageref{table:IndirectCallInstruction}.

\item[Procedure linkage tables.] Procedure linkage tables, import tables and symbol interposition are not used in ForwardCom. See page \pageref{libraryLinkMethods}.

\item[Callback function pointers.] If a function receives a pointer to a callback function as parameter, then keep this pointer in a register rather than saving it to memory.

\item[State machines.] If a state machine or similar algorithm is implemented with function pointers then place these function pointers in a read-only array, use a state variable as index into this array and check the index for overflow. The compiler should have support for defining an array of relative function pointers in a read-only section and access them with the call table instruction.

\item[Other function pointers.] Most uses of function pointers can be covered by the methods described above. Other uses of function pointers should be avoided in high security applications, or the pointers should be placed in protected memory areas or with unpredictable addresses.
% DEAD LINK: (See \href{http://dslab.epfl.ch/proj/cpi/}{Code-Pointer Integrity link}).

\end{description}

\subsubsection{Control access rights of application programs}
The executable file header of an application program should include information about which kinds of operations the application needs permission to. This may include permission to various network activities, access to particular sensitive files, permission to write executable files and scripts, permission to install drivers, permission to spawn other processes, permission to inter-process communication, etc. The user should have a simple way of checking if these access rights are acceptable. We may implement a system for controlling the access rights of scripts as well. Web page scripts should run in a sandbox.

\subsubsection{Control access rights of device drivers}
Many operating systems are giving very extensive rights to device drivers. Rather than having a bureaucratic centralized system for approval of device drivers, we should have a more careful control of the access rights of each device driver. The system call instruction in ForwardCom gives a device driver access to only a limited area of application memory (see page \pageref{systemCallInstruction}). The executable file header of a device driver should have information about which ports and system registers the device driver has access to. The user should have a simple way of checking if these access rights are acceptable.

\subsubsection{Standardized installation procedure}
The operating system should provide a standardized way of installing and uninstalling applications. The system should refuse to run any program, script or driver that has not been installed through this procedure. This will make it possible for the user to review the access requirements of all installed programs and to remove any malware or other unwanted software through the normal uninstallation procedure.
\vv

Malware protection should be an integral part of the operating system, not a third-party add on with possible compatibility problems and performance problems.

\end{document}