README


                     |  |   --     |   |  /
                     |  |              |
                     |  |     --   |   |
                     |  |          |   |  |
                      --    ----   |   |  |

This document attempts to fully explain the implementation and behavior
of the Ubik runtime. When the code and this document disagree, the code
is incorrect; on the other hand, don't trust anything in this document.


Structure of the project -----------------------------------------------

The project itself is divided into a few software components. By far the
most interesting is libubik, which is the shared library that contains
the entirety of the Ubik runtime. There's runubik, which is a simple
thing that uses libubik to load and run Ubik bytecode. There's ubikc
and ubiki, which are a compiler and interpreter, respectively; both are
thin wrappers around libubik, which contains the actual compiler
implementation. And then there's pyasm, which is a hilarious "assembler"
written as a Python DSL; pyasm will be gone the moment there's a working
Ubik compiler, but for now, with my focus on the runtime, pyasm is here
to stay.

There are unit tests and "integration tests"; unit tests are haphazard
but the integration tests (in the form of pyasm scripts in
test/pyasm/*.upy) are pretty comprehensive. You can run all of the
tests by running `make check`.


        make check

When you're done with all that, the compiler will be present at
bin/ubic, the interpreter will be at bin/ubi and the runtime will
be at bin/ubik.

One last thing: before contributing, please ensure you have installed
the pre-commit hook in the `res` directory before you commit anything.
It does some simple checks to make sure that you're not doing anything
dumb.

For completeness, building Ubik requires:

    - GCC 4.9 or later (it's possible that the build works with other
      versions of GCC and with clang, but they're untested)
    - Bison 3.0 or later
    - Flex 2.6 or later

Ubik has no runtime dependencies outside of libc. The rest of this
document is more interesting, I promise.


Runtime representations ------------------------------------------------

Ubik is based on a simple primitive: unbalanced binary trees of 64-bit
values (called "words"). Everything is expressed in this way, from
integers to types to functions; this homogeneity of data representation
allows us to simplify otherwise-complicated tasks and has a satisfying
purity. Note that the in-memory representation is extremely close to the
on-disk representation of compiled Ubik programs.

Each node in the tree has a left value, a right value, and a tag. Each
value can be one of two things: a pointer to another node or a word; the
tag tells you how to interpret the left and right values.


Tags

Tags are 16-bit integers which contain types and subtypes for a given
binary blob. The available tag types are:

                           0x1000: object is a tree
                           0x2000: object is a graph
                           0x3000: object is a URI

Values that are trees have the following lower bytes:

                0b00000001 (0x01): right child is a node
                0b00000010 (0x02): right child is a word
                0b00000100 (0x04): right child is a graph
                0b00010000 (0x10): left child is a node
                0b00100000 (0x20): left child is a word
                0b01000000 (0x40): left child is a graph

Thus, a node whose left value is a node and whose right value is a word
would have the lower byte `0b00001001`, and its tag would be `0x1009`.

Graphs have the following lower bytes:

                0b00000001 (0x01): graph is native
                0b00000010 (0x02): graph is unresolved
                0b00000100 (0x04): graph is a modinit graph

TODO: what do those mean.

Types

Types can be base types or derived types. Base types are special in that
they are identified only by an integer constant; derived types (quite
predictably) require quite a bit more information about the type itself.

The base types of Ubik are a few flavors of integer, lists and tuples.
Each of these has a base type code and comes with a tree encoding. The
full list of base type codes is:

                         Code   Description
                     ....word   unsigned 64-bit integer
                     ...sword   signed 64-bit integer
                     ....bool   boolean
                     ..packed   packed list
                     ..lambda   function
                     ...float   64-bit floating-point number

Each type code is actually a word that contains the numeric equivalent
of a 8-byte string, with the first letter in the byte with the lowest
address. Note that periods in the above table stand in for space
characters (`0x20`).

The tree encoding of integer types is very simple; on its left is a word
containing the type code of the relevant integer type and on its right
is the representation of the value. The value representation is a word
in which the represented value is stored in the lowest bits.

The tree encoding of list types puts a list type descriptor on its left
and the value of the list on its right. The list type descriptor has a
constant word `list` on its left and a node representing the type of the
elements of the list on its right. Each item on the right of the list
should have a value on its left and either a node representing a list or
a word set to zero on its right; finding the zero-word is a sentinel
that the end of the list has been reached.

The tree encoding of tuple types is similar to lists; there's a tuple
type descriptor on the left and a tuple value on the right. The values
are cons cells just like lists. However, the type descriptor is more
complicated, as tuple elements are not homogenous. Instead, the type
descriptor is a node whose left is the constant word `tuple` and whose
right is a zero-word-terminated list of type encodings.

The tree encoding of packed list types is provided for better space
usage than a straight list would provide. The left type descriptor for a
packed list is a node whose left is the constant word `packed` and whose
right is the type of the elements of the list; this type is restricted
to be a built-in integral type or `char`. The right value of a packed
list is a node whose left is a word containing the number of items in
the list, and whose right is a list of packed values encoded as cons
cells, where each cell has a left of 8 bytes of packed data, and a right
that either points to the next node or a zero-word. If the data ends off
of an 8-byte boundary, the remaining bytes in the left word are zeroed;
it is up to the user of the tree value to respect the length given in
the first word of the value.

Strings are stored as UTF-8 and are represented as packed lists of
bytes; the represented string is the concatenation of all left values in
the list. Note that this means that each individual left value may in
fact be poorly-formed UTF-8, as a multibyte code point may be split
across words.

A type descriptor describes a type on its own; its left is the constant
word `type` and its right is the descriptor itself. Anything that would
appear on the left of a typed value is a type descriptor value, and
could appear on the right of a type node.

A resource identifier is how all entities within the ubik runtime are
identified. Resources all have a name, author, version and scope; all of
these are encoded in the URI tree. The left of a URI tree is the
constant word `uri`, and the right is a cons-cell list, where the first
element in the list is the name of the resource as a packed string, the
second element in the list is the author (also as a packed string), the
third item is a 64-bit integer representing the version number and the
final element is a 64-bit integer represending the scope of the
resource. The meanings of these fields will be covered in the Bindings
section below.


Building logic ---------------------------------------------------------

Logic in Ubik is encoded in a directed acyclic graph of computations
(DAGC); nodes in this graph then reference values stored as trees. There
are four kinds of nodes in a DAGC:

        `apply`     Points to a node with a function value and a node with
                    an argument. Applies the argument to the function
                    value. If the function is then fully applied the value
                    of the node is the result of the function; if the
                    function is partially applied the value of the node is
                    a function that has an arity of one fewer than the
                    input function.
        `const`     A constant type and value; the value can be either a
                    tree or a graph.
        `load`      A load operation. Has a URI that references where the
                    value should come from.
        `store`     A store operation. Has a URI and a reference to a node
                    with a value, and stores the given value at that URI.
        `input`     The input to a DAGC.
        `cond`      Points to a condition, a true value node and a false
                    value node. If the condition evaluates as true, the
                    value of the node is the value of the true value
                    node; if it evaluates as false, it as the value of
                    the false value node.
        `native`    This one is sneaky; the native node is a node that
                    is filled in by native code in "native DAGCs".
                    Native DAGCs are backed by native code and
                    evaluating them just calls the associated native
                    code block. That code block then fills in the result
                    of the compuation over the graph on the native node.
                    Reading native nodes in from ubik bytecode is not
                    supported; this is only a construct used internally
                    in the runtime.
        `ref`       This node has the value of another node in the
                    graph, and is used primarily as a simplification in
                    the code generation phase before it is optimized
                    away.

Each individual DAGC represents a function; each DAGC contains an input
node for each of the function's arguments, and a single result node that
represents the function's result. However, the graph may have multiple
nodes whose value must be evaluated for the graph to have been called
fully evaluated; these nodes are called terminal nodes, and the result
node is one of them. The minimal set of nodes required to evaluate all
terminal nodes is found, and all nodes in that set are evaluated in
turn. This gives the result for the DAGC. Once evaluated, the DAGC is
semantically and computationally equivalent to a single value; this is
possible because all functions in Ubik are idempotent, and thus a
zero-arity function is equivalent to a value.

Logic is then built up by creating a number of graphs and referencing
them from each other. When a node with a graph value is applied, a copy
of the referenced DAGC is made and that DAGC's `input` nodes are
completed.  Once all required input nodes have been completed, the DAGC
can be evaluated.

An ubik bytecode blob contains some number of graphs; the various
graphs can reference each other, and in general their ordering in the
bytecode does not matter. There is only one exception: when evaluating a
bytecode blob, the first graph is considered to represent the desired
computation; the other graphs are only evaluated if it is necessary to
do so to computate the value of the first graph. In this sense, the
first graph is the "main method" of other languages.

Encoding in-flight and at-rest -----------------------------------------

Ubik bytecode has a single binary storage format that is used for
network operations and for on-disk storage. The format begins with a
header, and then contains a number of encoded graphs.

All numerically-encoded data is stored in big-endian ("network") byte
order, which means the most significant byte comes first.

The header has the following fields:

        Byte index   Field
               0-3   The constant bytes 0x65 0x78 0x70 0x6C ("expl")
               4-7   The version number of the encoding format, as a
                     big-endian 4-byte unsigned integer.
              8-15   The number of graphs in the file.
             16-23   The number of values in the file.

The version of the encoding specified in this document is 1, and thus
bytes 4-7 should be 0x00000001. The decoding proceeds by concatenating a
number of value encodings followed by a number of graph encodings. Each
graph or value that is encoded is identified by its index in the file.

Immediately after the bytecode header comes a series of tree-encoded
values. The tree encoding proceeds as follows: Begin with the root of
the tree to be encoded. The node's 8-bit tag is written to the stream,
followed by the encoding of its left child, then its right child.  If
a child is a value, the value is written to the stream in
little-endian format as an 8-byte integer. If a child is another
value, recurse and apply this same operation. If a child is a
reference to a graph, the word-encoded index of the graph in the blob
is written. The tag of the value is used to disambiguate the types of
the children of the node.

Tree decoding proceeds in much the same way: Take a byte off of the head
of the stream. If the byte indicates the left is a node, recurse. If it
indicates the left is a value, read the value as a little-endian 8-byte
integer. Repeat the same operation for the right node.

The graphs are specified after the values have all been read.

The graph block has a header of its own, that is given once for each
graph:

        Byte index   Field
               0-3   The tag of the graph
              8-15   The number of nodes in the graph
             16-23   The index of the result node in the graph
             24-31   The value index of the identity URI of the graph
             32-39   The value index of the type of the graph

The identity URI represents the meaning of the graph; in most cases this
will be the URI of whatever the graph is bound to. The result node
should be set to all-ones if the graph has no identity.

Following the graph header, the specified number of node encodings
follows. The order of encoding matters; each node is identified by other
nodes by its index.

A node is encoded by a type-independent header followed by a
type-dependent body. The type-independent node header has the following
format:

        Byte index   Field
               0-7   Node type (word-encoded)
              8-15   The ID of the node (for error reporting)
                16   Set to 0x01 if the node is a terminal node, meaning
                     that its value is intended as an output of
                     computation. Set to 0x00 otherwise.
             17-23   Unused

Once the header has been loaded, the type-dependent decoding of the body
is given.

Nodes with node type `   apply` have the following body:

        Byte index   Field
               0-7   Node index of function node
              8-15   Node index of argument node

Nodes with node type `    load` have the following body:

        Byte index   Field
               0-7   Value index of the URI value

Nodes with node type `   store` have the following body:

        Byte index   Field
               0-7   Node index of value node
              8-15   Value index of the URI value

Nodes with node type `   const` have the following subheader:

        Byte index   Field
               0-7   The subtype of the associated value (either
                     `cvalue` or `cgraph`) which informs whether this
                     constant points at a DAGC or a tree.
              8-15   Value index of the tree-encoded type of the constant

Then, depending on the associated subtype, this header is either
followed by a word-encoded index of a value (if the subtype is `cvalue`)
or contains a word-encoded index of another graph in the encoding (if
the subtype is `cgraph`).

Nodes with node type `   input` have the following body:

        Byte index   Field
               0-7   Argument index of the node

Nodes with node type `    cond` have the following body:

        Byte index   Field
               0-7   Node index of the condition
              8-15   Node index of the true value
             16-23   Node index of the false value

Nodes with node type `     ref` have the following body:

        Byte index   Field
               0-7   Node index of referrent node

Ubik bytecode blobs may have arbitrary data following the data encoded
by the standard; conforming parsers will ignore the data that follows
the encoded graphs. This is provided as a means for tools to attach
metadata onto the end of the bytecode if so desired.

Compilation ------------------------------------------------------------

The compiler for Ubik is distributed as part of the Ubik library; a
compiler executable can be built using:

        $ make
        $ ub/ubic path/to/file.uk out.ub

The process of creating an executable is divided into two phases, not
unlike compilation for C and other native languages: compilation and
linking. However, unlike C, the user has less control over the build
process; object files are always created for each compilation unit,
"headers" (called "interface files") are automatically generated from
the source, and all linking is static. The compiler abstracts most of
the details away from the user, but they still get the advantage of
fast, incremental builds.

Compilation begins by loading the given Ubik file and generating code
for the module (see below). In general, at the end of this process there
are still unresolved references; we collect all of the imports that
provide those references, and find the files that provide the requisite
units among the compiler's search path. If the file is found, we check
to see if there already exists an interface file for it in the build
cache directory with the correct content hash; if there is, definitions
are loaded from the interface file, and if not, we compile the module
from scratch.

Code generation --------------------------------------------------------

Code generation begins by lexing and parsing the provided source code,
using the lexer and parser given in token.l and grammar.y. At the end of
parsing, an in-memory representation of the AST is available. That AST
is walked and DAGC nodes are attached to each AST element. Graphs are
then built from these elements and a DAGC structure is emitted.

In addition, each compiled file contains a "modinit", which is a DAGC
to be executed upon the loading of the file. This modinit contains
everything required to add the contents of the module to an environment,
as well as any code the user has specified should be run upon module
load using the immediate operator. Special care must be taken when
calling the modinit to ensure that it is not called with an ephemeral
child environment, which will cause the modue's contents to be
discarded. The convention held by the compiler is that the modinit is
always the first graph returned by the compilation phase.

When assigning nodes to AST elements, the first applicable rule applies:

1. Constant AST atoms (strings, integers, floats) are assigned a CONST
   node with the given value.
     a. At the moment, no common subexpression elimination is in place
        to allow constant nodes with the same value to point at the same
        value in-memory.

2. AST atoms that are names, where the name is in the argument list that
   is in the nearest enclosing function in scope, are assigned REF nodes
   whose referrent is the input node corresponding to that name.

3. AST atoms that are names, where the name is defined in a scope that
   is within the current function, are assigned LOAD nodes whose URI has
   the given name, a NULL source and the LOCAL scope.

4. AST atoms that are names, where the name is in the scope of an
   enclosing function that is not the nearest function in scope, raise
   the nearest function to a closure. The closure transformation adds an
   argument to the head of the argument list for the closure and adds a
   REF node from the AST element that raises the function to a closure
   to the new INPUT node created corresponding to the argument.  The
   function is then immediately applied to the argument from the
   enclosing scope. This process continues recursively through the stack
   until all modified functions have been raised if necessary.

   This transforms this:
        \x -> (\y -> + x y)
   To this:
        \x -> ((\x0 y -> + x0 y) x)

   It transforms this:
        \x -> (\y -> (\z -> x))
   Into this:
        \x -> ((\x0 y -> ((\x1 z -> x1) x0)) x)

   It transforms this:
        \x -> {
            : y = + x 10
            ! \z -> y
        }
   Into this:
        \x -> {
            : y = + x 10
            ! (\y0 z -> y0) y
        }

   This closes these functions over the enclosing scope by making all
   enclosing scope information an explicit argument to the function, and
   then partially applying the function over that information.

5. AST atoms that are qualified names are checked against the imports
   that correspond to the source of the qualifier. If a corresponding
   element in the module is found, the AST atom is assigned a LOAD node
   whose URI has the given name, whose source is the name of the
   package, and the PACKAGE source.

   If no corresponding import is found, or there is no item within the
   imported module with the given name, compilation fails.

6. AST atoms that are names are assigned LOAD nodes whose URI has the
   given name and the UNKNOWN scope.

7. AST expressions that represent application are assigned an APPLY node
   with the func and args of the appropriate AST substructures.

8. Bindings in the AST are assigned a STORE node with the location given
   by the name of the binding and the LOCAL scope, and the value given
   by the appropriate AST substructure.

9. Lambda statements in the AST are assigned a CONST node that points at
   the graph represented by the lambda's body (the "lambda subgraph").
   The lambda subgraph contains one input node for each item in the
   lambda statement's argument list.

After nodes have been assigned to all AST elements, all URIs present
within the combined AST are resolved. At the end of this process, there
do not exist any URIs in the graph structures with an UNKNOWN scope.
When resolving URIs, the first applicable rule applies:

1. URIs whose name occurs as a binding in the module that is currently
   being compiled are assigned the LOCAL scope.
2. URIs whose name occurs as a binding in the native library are
   assigned NATIVE scope.