-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
496 lines (392 loc) · 22.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
| | -- | | /
| | |
| | -- | |
| | | | |
-- ---- | | |
This document attempts to fully explain the implementation and behavior
of the Ubik runtime. When the code and this document disagree, the code
is incorrect; on the other hand, don't trust anything in this document.
Structure of the project -----------------------------------------------
The project itself is divided into a few software components. By far the
most interesting is libubik, which is the shared library that contains
the entirety of the Ubik runtime. There's runubik, which is a simple
thing that uses libubik to load and run Ubik bytecode. There's ubikc
and ubiki, which are a compiler and interpreter, respectively; both are
thin wrappers around libubik, which contains the actual compiler
implementation. And then there's pyasm, which is a hilarious "assembler"
written as a Python DSL; pyasm will be gone the moment there's a working
Ubik compiler, but for now, with my focus on the runtime, pyasm is here
to stay.
There are unit tests and "integration tests"; unit tests are haphazard
but the integration tests (in the form of pyasm scripts in
test/pyasm/*.upy) are pretty comprehensive. You can run all of the
tests by running `make check`.
make check
When you're done with all that, the compiler will be present at
bin/ubic, the interpreter will be at bin/ubi and the runtime will
be at bin/ubik.
One last thing: before contributing, please ensure you have installed
the pre-commit hook in the `res` directory before you commit anything.
It does some simple checks to make sure that you're not doing anything
dumb.
For completeness, building Ubik requires:
- GCC 4.9 or later (it's possible that the build works with other
versions of GCC and with clang, but they're untested)
- Bison 3.0 or later
- Flex 2.6 or later
Ubik has no runtime dependencies outside of libc. The rest of this
document is more interesting, I promise.
Runtime representations ------------------------------------------------
Ubik is based on a simple primitive: unbalanced binary trees of 64-bit
values (called "words"). Everything is expressed in this way, from
integers to types to functions; this homogeneity of data representation
allows us to simplify otherwise-complicated tasks and has a satisfying
purity. Note that the in-memory representation is extremely close to the
on-disk representation of compiled Ubik programs.
Each node in the tree has a left value, a right value, and a tag. Each
value can be one of two things: a pointer to another node or a word; the
tag tells you how to interpret the left and right values.
Tags
Tags are 16-bit integers which contain types and subtypes for a given
binary blob. The available tag types are:
0x1000: object is a tree
0x2000: object is a graph
0x3000: object is a URI
Values that are trees have the following lower bytes:
0b00000001 (0x01): right child is a node
0b00000010 (0x02): right child is a word
0b00000100 (0x04): right child is a graph
0b00010000 (0x10): left child is a node
0b00100000 (0x20): left child is a word
0b01000000 (0x40): left child is a graph
Thus, a node whose left value is a node and whose right value is a word
would have the lower byte `0b00001001`, and its tag would be `0x1009`.
Graphs have the following lower bytes:
0b00000001 (0x01): graph is native
0b00000010 (0x02): graph is unresolved
0b00000100 (0x04): graph is a modinit graph
TODO: what do those mean.
Types
Types can be base types or derived types. Base types are special in that
they are identified only by an integer constant; derived types (quite
predictably) require quite a bit more information about the type itself.
The base types of Ubik are a few flavors of integer, lists and tuples.
Each of these has a base type code and comes with a tree encoding. The
full list of base type codes is:
Code Description
....word unsigned 64-bit integer
...sword signed 64-bit integer
....bool boolean
..packed packed list
..lambda function
...float 64-bit floating-point number
Each type code is actually a word that contains the numeric equivalent
of a 8-byte string, with the first letter in the byte with the lowest
address. Note that periods in the above table stand in for space
characters (`0x20`).
The tree encoding of integer types is very simple; on its left is a word
containing the type code of the relevant integer type and on its right
is the representation of the value. The value representation is a word
in which the represented value is stored in the lowest bits.
The tree encoding of list types puts a list type descriptor on its left
and the value of the list on its right. The list type descriptor has a
constant word `list` on its left and a node representing the type of the
elements of the list on its right. Each item on the right of the list
should have a value on its left and either a node representing a list or
a word set to zero on its right; finding the zero-word is a sentinel
that the end of the list has been reached.
The tree encoding of tuple types is similar to lists; there's a tuple
type descriptor on the left and a tuple value on the right. The values
are cons cells just like lists. However, the type descriptor is more
complicated, as tuple elements are not homogenous. Instead, the type
descriptor is a node whose left is the constant word `tuple` and whose
right is a zero-word-terminated list of type encodings.
The tree encoding of packed list types is provided for better space
usage than a straight list would provide. The left type descriptor for a
packed list is a node whose left is the constant word `packed` and whose
right is the type of the elements of the list; this type is restricted
to be a built-in integral type or `char`. The right value of a packed
list is a node whose left is a word containing the number of items in
the list, and whose right is a list of packed values encoded as cons
cells, where each cell has a left of 8 bytes of packed data, and a right
that either points to the next node or a zero-word. If the data ends off
of an 8-byte boundary, the remaining bytes in the left word are zeroed;
it is up to the user of the tree value to respect the length given in
the first word of the value.
Strings are stored as UTF-8 and are represented as packed lists of
bytes; the represented string is the concatenation of all left values in
the list. Note that this means that each individual left value may in
fact be poorly-formed UTF-8, as a multibyte code point may be split
across words.
A type descriptor describes a type on its own; its left is the constant
word `type` and its right is the descriptor itself. Anything that would
appear on the left of a typed value is a type descriptor value, and
could appear on the right of a type node.
A resource identifier is how all entities within the ubik runtime are
identified. Resources all have a name, author, version and scope; all of
these are encoded in the URI tree. The left of a URI tree is the
constant word `uri`, and the right is a cons-cell list, where the first
element in the list is the name of the resource as a packed string, the
second element in the list is the author (also as a packed string), the
third item is a 64-bit integer representing the version number and the
final element is a 64-bit integer represending the scope of the
resource. The meanings of these fields will be covered in the Bindings
section below.
Building logic ---------------------------------------------------------
Logic in Ubik is encoded in a directed acyclic graph of computations
(DAGC); nodes in this graph then reference values stored as trees. There
are four kinds of nodes in a DAGC:
`apply` Points to a node with a function value and a node with
an argument. Applies the argument to the function
value. If the function is then fully applied the value
of the node is the result of the function; if the
function is partially applied the value of the node is
a function that has an arity of one fewer than the
input function.
`const` A constant type and value; the value can be either a
tree or a graph.
`load` A load operation. Has a URI that references where the
value should come from.
`store` A store operation. Has a URI and a reference to a node
with a value, and stores the given value at that URI.
`input` The input to a DAGC.
`cond` Points to a condition, a true value node and a false
value node. If the condition evaluates as true, the
value of the node is the value of the true value
node; if it evaluates as false, it as the value of
the false value node.
`native` This one is sneaky; the native node is a node that
is filled in by native code in "native DAGCs".
Native DAGCs are backed by native code and
evaluating them just calls the associated native
code block. That code block then fills in the result
of the compuation over the graph on the native node.
Reading native nodes in from ubik bytecode is not
supported; this is only a construct used internally
in the runtime.
`ref` This node has the value of another node in the
graph, and is used primarily as a simplification in
the code generation phase before it is optimized
away.
Each individual DAGC represents a function; each DAGC contains an input
node for each of the function's arguments, and a single result node that
represents the function's result. However, the graph may have multiple
nodes whose value must be evaluated for the graph to have been called
fully evaluated; these nodes are called terminal nodes, and the result
node is one of them. The minimal set of nodes required to evaluate all
terminal nodes is found, and all nodes in that set are evaluated in
turn. This gives the result for the DAGC. Once evaluated, the DAGC is
semantically and computationally equivalent to a single value; this is
possible because all functions in Ubik are idempotent, and thus a
zero-arity function is equivalent to a value.
Logic is then built up by creating a number of graphs and referencing
them from each other. When a node with a graph value is applied, a copy
of the referenced DAGC is made and that DAGC's `input` nodes are
completed. Once all required input nodes have been completed, the DAGC
can be evaluated.
An ubik bytecode blob contains some number of graphs; the various
graphs can reference each other, and in general their ordering in the
bytecode does not matter. There is only one exception: when evaluating a
bytecode blob, the first graph is considered to represent the desired
computation; the other graphs are only evaluated if it is necessary to
do so to computate the value of the first graph. In this sense, the
first graph is the "main method" of other languages.
Encoding in-flight and at-rest -----------------------------------------
Ubik bytecode has a single binary storage format that is used for
network operations and for on-disk storage. The format begins with a
header, and then contains a number of encoded graphs.
All numerically-encoded data is stored in big-endian ("network") byte
order, which means the most significant byte comes first.
The header has the following fields:
Byte index Field
0-3 The constant bytes 0x65 0x78 0x70 0x6C ("expl")
4-7 The version number of the encoding format, as a
big-endian 4-byte unsigned integer.
8-15 The number of graphs in the file.
16-23 The number of values in the file.
The version of the encoding specified in this document is 1, and thus
bytes 4-7 should be 0x00000001. The decoding proceeds by concatenating a
number of value encodings followed by a number of graph encodings. Each
graph or value that is encoded is identified by its index in the file.
Immediately after the bytecode header comes a series of tree-encoded
values. The tree encoding proceeds as follows: Begin with the root of
the tree to be encoded. The node's 8-bit tag is written to the stream,
followed by the encoding of its left child, then its right child. If
a child is a value, the value is written to the stream in
little-endian format as an 8-byte integer. If a child is another
value, recurse and apply this same operation. If a child is a
reference to a graph, the word-encoded index of the graph in the blob
is written. The tag of the value is used to disambiguate the types of
the children of the node.
Tree decoding proceeds in much the same way: Take a byte off of the head
of the stream. If the byte indicates the left is a node, recurse. If it
indicates the left is a value, read the value as a little-endian 8-byte
integer. Repeat the same operation for the right node.
The graphs are specified after the values have all been read.
The graph block has a header of its own, that is given once for each
graph:
Byte index Field
0-3 The tag of the graph
8-15 The number of nodes in the graph
16-23 The index of the result node in the graph
24-31 The value index of the identity URI of the graph
32-39 The value index of the type of the graph
The identity URI represents the meaning of the graph; in most cases this
will be the URI of whatever the graph is bound to. The result node
should be set to all-ones if the graph has no identity.
Following the graph header, the specified number of node encodings
follows. The order of encoding matters; each node is identified by other
nodes by its index.
A node is encoded by a type-independent header followed by a
type-dependent body. The type-independent node header has the following
format:
Byte index Field
0-7 Node type (word-encoded)
8-15 The ID of the node (for error reporting)
16 Set to 0x01 if the node is a terminal node, meaning
that its value is intended as an output of
computation. Set to 0x00 otherwise.
17-23 Unused
Once the header has been loaded, the type-dependent decoding of the body
is given.
Nodes with node type ` apply` have the following body:
Byte index Field
0-7 Node index of function node
8-15 Node index of argument node
Nodes with node type ` load` have the following body:
Byte index Field
0-7 Value index of the URI value
Nodes with node type ` store` have the following body:
Byte index Field
0-7 Node index of value node
8-15 Value index of the URI value
Nodes with node type ` const` have the following subheader:
Byte index Field
0-7 The subtype of the associated value (either
`cvalue` or `cgraph`) which informs whether this
constant points at a DAGC or a tree.
8-15 Value index of the tree-encoded type of the constant
Then, depending on the associated subtype, this header is either
followed by a word-encoded index of a value (if the subtype is `cvalue`)
or contains a word-encoded index of another graph in the encoding (if
the subtype is `cgraph`).
Nodes with node type ` input` have the following body:
Byte index Field
0-7 Argument index of the node
Nodes with node type ` cond` have the following body:
Byte index Field
0-7 Node index of the condition
8-15 Node index of the true value
16-23 Node index of the false value
Nodes with node type ` ref` have the following body:
Byte index Field
0-7 Node index of referrent node
Ubik bytecode blobs may have arbitrary data following the data encoded
by the standard; conforming parsers will ignore the data that follows
the encoded graphs. This is provided as a means for tools to attach
metadata onto the end of the bytecode if so desired.
Compilation ------------------------------------------------------------
The compiler for Ubik is distributed as part of the Ubik library; a
compiler executable can be built using:
$ make
$ ub/ubic path/to/file.uk out.ub
The process of creating an executable is divided into two phases, not
unlike compilation for C and other native languages: compilation and
linking. However, unlike C, the user has less control over the build
process; object files are always created for each compilation unit,
"headers" (called "interface files") are automatically generated from
the source, and all linking is static. The compiler abstracts most of
the details away from the user, but they still get the advantage of
fast, incremental builds.
Compilation begins by loading the given Ubik file and generating code
for the module (see below). In general, at the end of this process there
are still unresolved references; we collect all of the imports that
provide those references, and find the files that provide the requisite
units among the compiler's search path. If the file is found, we check
to see if there already exists an interface file for it in the build
cache directory with the correct content hash; if there is, definitions
are loaded from the interface file, and if not, we compile the module
from scratch.
Code generation --------------------------------------------------------
Code generation begins by lexing and parsing the provided source code,
using the lexer and parser given in token.l and grammar.y. At the end of
parsing, an in-memory representation of the AST is available. That AST
is walked and DAGC nodes are attached to each AST element. Graphs are
then built from these elements and a DAGC structure is emitted.
In addition, each compiled file contains a "modinit", which is a DAGC
to be executed upon the loading of the file. This modinit contains
everything required to add the contents of the module to an environment,
as well as any code the user has specified should be run upon module
load using the immediate operator. Special care must be taken when
calling the modinit to ensure that it is not called with an ephemeral
child environment, which will cause the modue's contents to be
discarded. The convention held by the compiler is that the modinit is
always the first graph returned by the compilation phase.
When assigning nodes to AST elements, the first applicable rule applies:
1. Constant AST atoms (strings, integers, floats) are assigned a CONST
node with the given value.
a. At the moment, no common subexpression elimination is in place
to allow constant nodes with the same value to point at the same
value in-memory.
2. AST atoms that are names, where the name is in the argument list that
is in the nearest enclosing function in scope, are assigned REF nodes
whose referrent is the input node corresponding to that name.
3. AST atoms that are names, where the name is defined in a scope that
is within the current function, are assigned LOAD nodes whose URI has
the given name, a NULL source and the LOCAL scope.
4. AST atoms that are names, where the name is in the scope of an
enclosing function that is not the nearest function in scope, raise
the nearest function to a closure. The closure transformation adds an
argument to the head of the argument list for the closure and adds a
REF node from the AST element that raises the function to a closure
to the new INPUT node created corresponding to the argument. The
function is then immediately applied to the argument from the
enclosing scope. This process continues recursively through the stack
until all modified functions have been raised if necessary.
This transforms this:
\x -> (\y -> + x y)
To this:
\x -> ((\x0 y -> + x0 y) x)
It transforms this:
\x -> (\y -> (\z -> x))
Into this:
\x -> ((\x0 y -> ((\x1 z -> x1) x0)) x)
It transforms this:
\x -> {
: y = + x 10
! \z -> y
}
Into this:
\x -> {
: y = + x 10
! (\y0 z -> y0) y
}
This closes these functions over the enclosing scope by making all
enclosing scope information an explicit argument to the function, and
then partially applying the function over that information.
5. AST atoms that are qualified names are checked against the imports
that correspond to the source of the qualifier. If a corresponding
element in the module is found, the AST atom is assigned a LOAD node
whose URI has the given name, whose source is the name of the
package, and the PACKAGE source.
If no corresponding import is found, or there is no item within the
imported module with the given name, compilation fails.
6. AST atoms that are names are assigned LOAD nodes whose URI has the
given name and the UNKNOWN scope.
7. AST expressions that represent application are assigned an APPLY node
with the func and args of the appropriate AST substructures.
8. Bindings in the AST are assigned a STORE node with the location given
by the name of the binding and the LOCAL scope, and the value given
by the appropriate AST substructure.
9. Lambda statements in the AST are assigned a CONST node that points at
the graph represented by the lambda's body (the "lambda subgraph").
The lambda subgraph contains one input node for each item in the
lambda statement's argument list.
After nodes have been assigned to all AST elements, all URIs present
within the combined AST are resolved. At the end of this process, there
do not exist any URIs in the graph structures with an UNKNOWN scope.
When resolving URIs, the first applicable rule applies:
1. URIs whose name occurs as a binding in the module that is currently
being compiled are assigned the LOCAL scope.
2. URIs whose name occurs as a binding in the native library are
assigned NATIVE scope.