-
Notifications
You must be signed in to change notification settings - Fork 89
Type inference
The theoretical model of Nemerle is described in my MSc thesis. This page explains this from a more practical standpoint, though reading at least the first chapters of the thesis wouldn't hurt.
Most languages, even plain good ol' C, support a form of bottom up type
inference. It happens when you have an expression, let's say a function
call f()
and f's return type is int
, then we
know the type of f()
is int
. A more involved example
would be Singleton
of type T -> Set[T]
(that is
Singleton
is a function taking element of type T
and returning Set[T]
). Then Singleton(f())
clearly has type Set[int]
. This example is more interesting
as the type of function call depends on types of arguments.
From this kind of reasoning there is only a single little step to allowing:
def s = Singleton (f ());
Set<T> s = Singleton (f ());
You can even do this in GCC compiling C++ (or C, but as it lacks templates it doesn't make much sense): <c>#define DEF(v,e) __typeof(e) v = e int main (void) { DEF(x, 21 + 21); return x; } </c>
The situation becomes more involved when there are type parameters
that are used in return type but not in type of function parameters.
For example consider MakeSet
function returning Set[T]
and taking no arguments.
def s = MakeSet ();
And what type does s
have? It is Set[T]
for some
T
. We do not yet know what kind of set will it be. But after
adding next line with s
usage:
def s = MakeSet ();
s.Add (42);
T
is int
, because we have added
element of type int
to it. In fact we know this because
Set[T].Add
has type T -> void
.
Such yet unknown types are called free type variables. They are free, that is yet-unbound, they are type variables so real types can be put in place of them later.
Nemerle type inference engine very often produces such type variables just to kill them a moment later by replacing them with a real type.
Note that type variables are not exactly type parameters used in method definition. The difference is that we need a fresh copy of all type parameters upon each call. And this copy is called free type variable. Otherwise we couldn't say:
def s1 = MakeSet ();
def s2 = MakeSet ();
s1.Add (42);
s2.Add ("42");
s1 : Set[T1]
and s2 : Set[T2]
and everything is OK.
Let's have a look at the following classes:
class BaseClass {}
class Derived1 : BaseClass {}
class Derived2 : BaseClass {}
Now we create a set and add an element:
def s = MakeSet ();
s.Add (Derived1 ()); // add a new Derived1 instance
After this we should know, like in the int-example above, that
s
has type Set[Derived1]
. But such conclusion
would be premature, because we may want to:
s.Add (Derived2 ());
This is all OK, as long as we consider s
to be of
type Set[BaseClass]
. Clearly if programmer adds
Derived1
and later Derived2
then the intention
was to have a set of BaseClass
.
Therefore we cannot eagerly replace T
with Derived1
upon first usage. Instead of this we just place a constrain that
T
have to be a supertype of Derived1
.
Next we place additional constrain that T
is supertype
of Derived2
. Now because all
supertypes of both Derived1
and Derived2
are also supertypes of BaseClass
we can simplify these
two constraints to a single one stating that T
is a supertype of BaseClass
.
There are other places where free type variables can be introduced.
For example the null
literal does it (in fact it additionally
adds anti-value-type constraint), but probably the most interesting
of such places are function parameters.
Nemerle supports type inference for parameters of local functions (that is functions defined and visible only inside some other function). We plan to extend this to private class members, there are no conceptual problems here, only implementation will be hard.
For example:
def foo (x) {
x + 3
}
We first assign a free type variable as the type of x
, and later
infer, probably a bit too eagerly, that x
has type int
,
because it is added to 3
.
There are some more problems with free type variables. For example:
def foo (x) {
x + x
}
After seeing such a definition we do not know what type should x
have. Would it be an int
? A double
? A string
?
This is because the addition operator is overloaded, it has more than
one signature.
So we cannot type such a definition alone. But hey, who defines
a function not to use it at all [1]? If we have this definition and
usage like foo ("42")
we know that x
is string
,
therefore the +
overload to be chosen is
string * string -> string
and therefore the return type
of foo
is also string
.
Now, because we can assign only one type to x
in the
generated code, if after foo ("42")
user writes
foo (42.0)
, then we report an error. This is however not
a problem in most circumstances.
Similar reasoning can be applied to:
def get_length (x) {
x.Length
}
We cannot know what type does x.Length
have, or even if it's valid
at all until we know the type x
. But we can tell it again
by looking at the function usage get_length (array [1,2])
.
Of course it is easy to say we look at, we can deduce, but the compiler requires somehow stricter algorithm. The basic idea here is that the compiler proceeds with type inference in the top-down, left-right order, until it sees something it cannot handle. Examples of such things include:
- a member reference on an expression with yet unknown type
- overloaded call, that still has more than one maximally specific solution
- call to operator on operands with unknown types (because resolution rules for operators depend on types of operands)
- a indexer reference on expression with unknown type
- a situation when a macro decides that it doesn't yet have enough information (this is macro-specific, for example the
foreach
macro needs to known the type of the collection it is iterating over)
Once the entire typing process for a method is over, we look into our FIFO queue. We try to ask each action in it, to resolve itself. When it succeeds, it is removed from the queue, otherwise it is added again. When we process each of the elements initially in the queue, we check if the queue is empty. If it is, we have succeeded.
Otherwise, we check if we did resolve something, and if so, we start the queue iteration process again (because resolution of an expression can make resolution of some other expression possible).
If we didn't resolve anything, we ask the first element in the queue to explain to the user why did it fail. This also happens to be the first failed expression in the source program order, so the error reported is the top-most one.