Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port concepts from mem #22

Open
orbisvicis opened this issue Apr 25, 2016 · 10 comments
Open

port concepts from mem #22

orbisvicis opened this issue Apr 25, 2016 · 10 comments

Comments

@orbisvicis
Copy link

orbisvicis commented Apr 25, 2016

Mem has a better memoization framework. I think it might be worth considering porting some concepts over. As a long term project, this is more of a note than a real issue, as right now I don't have time and I doubt anyone else is interested. Overview:

Consider the following (mostly equivalent, from a memoizing standpoint) fbuild functions:

def obj(ctx, target:fbuild.db.DST, source:fbuild.db.SRC):
def obj(ctx, source:fbuild.db.SRC) -> fbuild.db.DST:
  1. Fbuild blurs inputs and outputs. The only requirements to enable determinism are: input path, input contents, and output path. However, fbuild uses: input path, input contents, output path, and output path exists. Not only does this confuse the concept of pure, deterministic functions, it has a major drawback (below).

  2. Fbuild doesn't handle target modification. For example assume obj copies source->target, in this case test.in->test.out. Consider:

    Initial execution:

    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    $ fbuild
    Copying test.in to test.out...
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    1
    2
    

    That was the initial memoization, so not much to see. Let's try the only condition supported by fbuild's fbuild.db.DST - removing test.out.

    $ rm test.out
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    $ fbuild
    Copying test.in to test.out...
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    1
    2
    

    While the end result is acceptable, unfortunately fbuild had to rerun the obj function. Now let's trying modifying test.out. As for real-world scenarios, this could easily be an unintended side effect of a build command.

    $ echo "44" >test.out
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    4
    $ fbuild
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    44
    

    Well, that's not good at all.

    In fact, whether or not the target was removed or modified, the memoized function should never be run again. Instead, the target should be restored from the cache if and only if it was modified or removed. Let's compare fbuid to mem:

    • target unmodified: fbuild does not rerun the memoized function. [1/1]
    • target removed: fbuild detects this, but reruns the memoized function. [1/2]
    • target removed: fbuild doesn't detect this. [0/1]

    Like fbuild, mem memoizes function outputs. Now obviously no function should be expected to return a byte-for-byte copy of a file, suitable for pickling. Instead, mem introduces an extra processing step if the output object defines the functions hash, store, and restore. If the output hasn't been memoized, mem will call store(). If it has, mem wall first call hash(). If the hash remains unchanged from the cached version, mem does nothing. Otherwise, it calls restore(). For example, this is mem's file class:

    class File:
       def __init__(self, path):
           # notice the file's contents won't be serialized
           self.path = path
    
       def __hash__(self):
           """ checksum of self.path """
    
       def __store(self):
           """" store a copy of the file in the build cache """
    
       def __restore(self):
           """ restore the file from the build cache """
  3. Fbuild depends on python annotations to memoize file contents. While helpful, it is also obfuscating and confusing. Why not depend on the standard object-oriented paradigm, like mem does? Not only is this expected, it is less verbose, and simpler:

    obj_b(obj_a("file.a", "file.b"))
    @mem.memoize
    def obj_a(source_path_string, target_path_string): 
       # unfortunately, the inputs are python strings, without store()/restore(), and a __hash__() that doesn't depend on contents.
       # so, let's explicitly add a dependency on the path's contents
       mem.add_dep(source_path_string)
       mem.add_dep(target_path_string)
       # process the input, determine the outputs
       output = ...
       return mem.nodes.File(output)
    
    @mem.memoize
    def obj_b(source_path_node):
      # the inputs are already node objects, no need to use mem.add_dep()
      pass

    Now for convenience and backwards compatibility, I do like parameter annotations.

    @mem.memoize
    obj_a_alternative(source_path_string:fbuild.file.to_node, target_path_string:fbuild.file.to_node):
       pass

    Also why not add notation to prevent certain parameters from being memoized. Mem acknowledges this as a shortcoming of its design, but also notes that it has never needed such functionality:

    @mem.memoize
    obj_c(source_path_string:fbuild.file.to_node, dont_memoize:fbuild.db.ignore):
       pass
  4. Fbuild ties the build environment (compiler flags) to a complicated data structure (list(tuple(set, dict))) and a complicated class hierarchy. While this simplifies most build targets, the complexity makes edge-cases more difficult to implement. On the other hand, mem provides a much "flatter" hierarchy.

    1. Mem doesn't differentiate between extraneous and required environment (or environment and command-line options). The merged dictionary of both shell environment and specific flags (overrides) can by passed to any build target function decorated with mem.util.with_env:

      @mem.util.with_env(CFLAGS=[])         # only pass-in CFLAGS from the environment
      @mem.memoize
      def obj(target, source, CFLAGS):
          pass
      
      obj(target, source, env={k:v for d in (os.environ, {CFLAGS: "-O3"}) for k,v in d.items()})

      The decorator ensures that only the required flags are memoized.

    2. Mem provides a single compile operation, and a single link operation. You just need to make sure you pass the correct flags to each operation, depending on your needs:

      • build, program: []
      • build, static: []
      • build, shared: ["-fPIC"]
      • link, program: []
      • link, static: []
      • link, shared: ["-shared"] (at the very minimum)

      Compare to fbuild's over-engineered guess_static and guess_shared with either build_lib or build_exe. Yes, the guess_ function has a secondary use of finding the correct compiler, but the process of deciding static/shared then lib/exe makes the class hierarchy more complicated than it should be. An independent class maintaining a database of compiler flags would be more appropriate.

  5. Support for building a single object from multiple sources (link-time optimization):

    All mem build targets support multiple sources. If the output target is unspecified, instead of compiling an object for each input source, the input sources will be agglomerated (link-time optimization) and a single optimized output target will be produced. Admittedly, because mem is unmaintained, this depends on the outdated '-combine' flag.

  6. Just a tiny nitpick, but I find the term "cache" confusing, as the standard and pythonic term is "memoize".

Overall mem feels more pythonic. I only mention its advantages but in terms of features - as an unmaintained project - mem lags far behind fbuild.

  • path objects

    Fbuild overrides __truediv__ for convenience.

  • logging

    Fbuild provides loggers, though I'm not too impressed. Setting up tee-styled redirections might be better.

  • commands

    Using the provided execute function is required to log command output. Once again, I'm not impressed. I'd rather use the subprocess module directly, and have implicit logging at the program level from tee-style redirections.

  • command line options

    Fbuild uses argparse, but requires the definition of the pre_options function, which is magically loaded. It would be more transparent to explicitly pass the ArgumentParser object to fbuild.

  • installing files

    I'm not sure if install is memoized - haven't checked.

  • configuration testing

    Fbuild really shines here - and I mean really.

  • command-line targets

    This is well implemented - the decorator is sufficient, so interacting directly with argparse isn't necessary.

  • python 3

  • supports many more builders

  • cross platform

    While mem is technically also cross platform by virtue of python, per platform code must still be written to handle differences of compiler and environment.

I don't see the point of requiring a context object passed around. If the namespace was becoming too polluted, why not put all configuration into a global container object (sub-module)?

...

With a memoization framework like mem's, it would be possible to support an uninstall target. Even more impressive, uninstall would be able to restore files overwritten during installation.

@refi64
Copy link
Contributor

refi64 commented Apr 25, 2016

I'm slightly confused as to some parts of this issue. Do you think you could elaborate on the following?

  1. Fbuild blurs inputs and outputs. The only requirements to enable determinism are: input path, input contents, and output path. However, fbuild uses: input path, input contents, output path, and output path exists. Not only does this confuse the concept of pure, deterministic functions, it has a major drawback (below).

What exactly do you mean by "requirements to enable determinism"?

  1. Fbuild doesn't handle target modification.

...

In fact, whether or not the target was removed or modified, the memoized function should never be run again. Instead, the target should be restored from the cache if and only if it was modified or removed. Let's compare fbuid to mem:

What exactly do you mean be "should be restored from the cache"?

I gathered that you're saying that build functions shouldn't be run if the output is removed or modified, and that, instead, a copy of the file should be restored from the cache. Is this correct?

@refi64
Copy link
Contributor

refi64 commented Apr 25, 2016

Continuation:

  1. Support for building a single object from multiple sources (link-time optimization):

All mem build targets support multiple sources. If the output target is unspecified, instead of compiling an object for each input source, the input sources will be agglomerated (link-time optimization) and a single optimized output target will be produced. Admittedly, because mem is unmaintained, this depends on the outdated '-combine' flag.

This sort of makes sense, but probably more as an additional argument to build_objects, named combine or something.

Thanks for all the feedback! I'm just trying to make sure I fully understand what you're saying.

@orbisvicis
Copy link
Author

orbisvicis commented Apr 26, 2016

What exactly do you mean by "requirements to enable determinism"?

Generally, the function of a memoizing build system is to transform input files into output files. By determinism, I mean the output can always be predicted from the input. This is the first requirement for purity (the second is that there should be no side effects). If you define a file as both path and contents, then only three factors determine the output:

  • input path
  • input contents
  • output path

Input path and contents are easy to reason about; both fbuild and mem define them the same way. If the output path changes, then the output file changes (file is defined as both path and contents). The output file (path & contents) of a function, however, is not affected by either the current contents or whether the path is already in use. Now technically certain compilers or commands may refuse to overwrite an existing file; this is immaterial, because given the same inputs, the function will never be run twice. Instead mem will itself restore the file from the cache, overwriting the existing file.

Purity is a bit stranger than for most functional languages, which consider IO impure. Furthermore, neither mem nor fbuild can enforce purity - however their utility is that they attempt to. Accepting input, creating outputs, is the purpose of memoizing build system. So you have to imagine you are already inside the IO monad (haskell reference). From this context, IO is no longer impure. Now it isn't possible for a function in fbuild or mem to claim no side-effects. For example, compilers may create multiple outputs (ocaml). So the job of fbuild or mem is to provide utilities to track these external dependencies manually, to prevent side-effects.

What exactly do you mean be "should be restored from the cache"?

I gathered that you're saying that build functions shouldn't be run if the output is removed or modified, and that, instead, a copy of the file should be restored from the cache. Is this correct?

Well mem keeps both a build directory (influenced by the environment variable BUILD_DIR, defaults to build) and a cache directory (.mem). Function outputs are stored in <cache-dir>/results and objects implementing the store function use <cache-dir>/blob. It seems git-inspired:

$ tree build .mem
build
├── test
└── test.o
.mem
├── blob
│   ├── 05
│   │   └── <md5sum #1>
│   │           contents:   Same as test.
│   │           name:       md5sum of contents.
│   └── fc
│       └── <md5sum #2>
│               contents:   Same as test.o.
│               name:       md5sum of contents.
├── deps
│   ├── 0e
│   │   └── <pickled-file #1>
│   │           contents:   Pickled list of mem.nodes.File.
│   │           contents:   The list contains the dependencies of test.c as
│   │                       determined by `gcc -M`
│   │           note:       mem.nodes.File stores path and hash. Dependencies
│   │                       are inputs to the builder. If they change, the
│   │                       builder will be called again.
│   │           name:       md5sum, of what I don't know...
│   └── 1a
│       └── <pickled-file #2>
│               contents:   Pickled list of mem.nodes.File.
│               contents:   The list contains the dependencies of test.o as
│                           determined by `gcc -M`
│               note:       mem considers itself as a dependency
│               name:       md5sum, of what I don't know...
└── results
    ├── 1f
    │   └── <pickled-file #3>
    │           contents:   Pickled mem.nodes.File.
    │           contents:   Path and hash of test.o.
    │           name:       md5sum, probably of function inputs and external
    │                       dependencies
    └── 7e
        └── <pickled-file #4>
                contents:   Pickled mem.nodes.File.
                contents:   Path and hash of test.
                name:       md5sum, probably of function inputs and external
                            dependencies

So yes, that's exactly what I'm saying.

@skaller
Copy link
Member

skaller commented Apr 26, 2016

Technically there is a problem, that the current contents of the output file are not affected by the function: clearly this assumption is false if the output file is also an input file! Indeed this happens in some systems, such as LaTeX, which re-reads AND re-generates auxilliary data to get cross references correct.

In fact, such iteration is the basis of the only sane building concept: fixpoint building. A fixpoint builder just runs build steps until the system converges to a fixpoint. It needs to detect outputs, but it does not need to be told what they are. Nor does it need to even know the inputs: knowing that is an optimisation. Similarly ordering build steps is an optimisation. Semantically, fixpoint building (a) may not converge and (b) may converge to an unintended fixpoint which can be fixed by (a) an iteration limit and (b) hints. As an example LaTeX is known not to always converge (changing pages numbers from 1 to 2 digits can spill an item onto another page which again changes the reference).

Fbuild isn't a fixpoint builder as such, but the key factor is that it is an executable builder.

@skaller
Copy link
Member

skaller commented Apr 26, 2016

OK, so here is a problem I have: most build steps produce multiple outputs. You gave Ocaml as an example, and said "So the job of fbuild or mem is to provide utilities to track these external dependencies manually, to prevent side-effects."

I'm not sure what you mean. Producing multiple outputs isn't a side effect, it's producing a product (cartesian product). This is standard, pretty much ALL tools do this. Even a C compiler does it, it produces diagnostics, and, some compilers, eg on Apple, produces external debug symbol table files (.dSYM). Similarly Windows linkers produces DLLs and static link thunks.

What's worse is that for some building functions, the outputs CANNOT be known in advance: Felix fdoc packages and in general ALL literate programming tools are like this, as well as all package managers (Debian, etc), and all archive decoders (tar, zip, etc). And for some build functions, the inputs aren't known either! Felix has plenty of build steps that process "all the *.flx files in a particular directory. And of course in C, you may know the translation unit, but do you know all the #include files? What about a programming language where the include files are calculated using an complex function? That includes C even: conditional compilation!

So basically, it seems to me you may have made a mistake specifying that the effect of a build function is determined only by the input paths, input contents, and output path: although this is logically correct (except for the recursion case mentioned in previous post), it assumes the output path is known. And that the inputs are known. They're not.

Which is one reason why Make and all related build tools are rubbish. A sane builder can NOT require the programmer to specify anything except the build steps and the initial inputs. The programmer cannot know the outputs, final, or intermediate, and they cannot know the dependencies -- in fact this last is clearly true since they cannot know the outputs.

The build tool can detect the outputs (and cache them) but most build tools cannot do even that, let alone detect dependencies. So building is extremely hard. In particular functional models of building don't work. In theory such a model COULD work if it could handle products. That is each build step has a product type for input and output eg: A * B -> C * D, AND the whole build expression is given, which is very hard to do and requires at least let bindings. An executable model using channels is much better. However BOTH these models fail because they assume something that is completely wrong: that the build system knows how to connect the outputs of one step to the inputs of another. The thing is the build system not only doesn't know the outputs .. it doesn't know the inputs either.

And that's the core problem. It is not possible to require that the build system know the connections between the steps because, quite simply, the programmer doesn't know them. As Ryan's sig says:

"[ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong."

Fbuild is constructed on the model that the functional view of building is totally wrong. Instead, the programmer specifies an executable procedure which is built from other executable procedures that do things, and which the programmer knows will work from a clean start. It then allows certain parts of that process to be cached to speed the building up. Those parts will be pieces where the programmer knows enough to specify inputs, output, and dependencies, etc.

One should note that even this is quite hard because the action of a build step is not necessarily determined by the inputs. In particular the Felix build system (and possibly fbuild as well) can use features of certain C compilers to calculate the inputs for a C program by running the C preprocessor in a special mode that tells all the #include files). If all the inputs have not changed (including command line switches and relevant environment variables) the outputs won't change either, and the compile to object file step can be skipped. The problem is, you have to RUN the compiler step to determine this and the step is self optimising so there's no point the build system trying to do the optimisation.

Similarly the "flx" tool of Felix does automatic dependency checking so there's no point trying to optimise the use of "flx" as a compiler because it optimises itself, and it does it more correctly than any build system could (because, in effect, it is a nested build system).

So again: build systems are NOT functional. It's completely the wrong idea. They're intrinsically imperative. At the fine grained level they're indeterminable: I do not mean non-deterministic, I mean that you cannot predict their actions because you simply don't know enough. The problem is certainly low level build steps, such as compiling a C program appear to be functional because they're fairly simple. One C translation unit + header files -> object file. So "make" and friends are built on this but in the end the functional model is unsustainable. A build system is just a program. You have to run it. The best a build tool can do is provide a library that allows the programmer to optimise it.

@orbisvicis
Copy link
Author

orbisvicis commented Apr 26, 2016

I don't see any problems stemming from what you've mentioned. The outputs aren't known - specify them manually. The outputs are recursively used as inputs - specify them manually as both inputs and outputs. The inputs can't be known - that's a real problem. But in this respect automake, fbuild, and mem are all identical.

The iterative case (fixpoint building) can be handled at the build level, recursively:

@...
def obj(sources, output):
    # specify the output as an input
    # then run a single pass of the iterative process
    output = call_build_tool_iterative_single_pass(sources, output)
    if converged:
        return output
    else:
        return obj(sources, output)

but usually the build tools called by the builder handle this:

@...
def obj(sources):
    # call the iterative builder, then return the output
    output = call_build_tool_iterative_multi_pass(sources, output)
    return output

Every build system requires predetermined inputs and outputs, be it make, omake, waf, mem, or even fbuild. You can't get anything done without doing so. For example, this is how I'm interpreting what you're saying:

  • GNU Make:

    all: ???
    
    ???: ???.o ???.o ???.o
    g++ ??? ??? ??? -o ???
  • Or fbuild:

    def build(ctx):
    static - fbuild.builders.c.guess_static()
    static.build_exe("???", ["???", "???", "???"])

Outputs are always known:

  • zip: zip -d output.predetermined.zip directory.to.zip/

    For example, the inputs and outputs must be known during build, and can be added as external fbuild dependencies:

    with ZipFile("output.predetermined.zip", "w") as myzip:
       for root, dirs, files in os.walk("directory.to.zip"):
           for file in files:
               file = os.path.join("root", "file")
               ctx.db.add_external_dependencies_to_call(srcs=[file])
               myzip.write(file)
  • tar: tar -cvzf output.predetermined.tar.gz directory.to.tar/

  • c: gcc -o output.predetermined output.predetermined.c

Inputs are usually known beforehand. If not, they can be calculated during runtime and added as external dependencies. For example, both fbuild and mem uses gcc -M. This only runs the preprocessor, so no compilation or optimization. These dependencies are memoized along with the inputs (see above, .mem), and behave like inputs.

Your post made me realize a very tricky situation in which inputs are not always known beforehand - linking. This is a situation that neither fbuild nor mem handles, I think. Say several libraries are linked, as -lc -lgpm -lfpc -lfl. Normally, these are links to the most recent version of a shared library - so if these shared libraries are updated, the build system doesn't have to be re-run. In other words, they are almost never external dependencies. However, libfl.a (from flex) is almost always distributed as a static library. It will be included in the target's output. If the flex library is updated, then the build system should be re-evaluated.

edit. Probably not a problem as I originally thought. Any significant differences to the static library will be expressed in the header, and any changes to the header will trigger a re-build.

All build systems track inputs and outputs to short-circuit the build - that is, files are only rebuilt when necessary (if you don't care about this, you might as well use a shell script). This is done using the dependency graph. In fbuild, the dependency graph is equivalent to the function call graph. I consider side-effects dependencies (inputs or outputs) that fbuild doesn't know about. Therefore the build has side-effects unknown to fbuild that prevents its dependency tracking from working as intended. To fix this, use ctx.db.add_external_dependencies_to_call.

I'm not saying fbuild or mem are functional. They're not - but they could be. They're just more closely tied to the underlying programming language - "pedal to the metal". This is so they can handle all the edge cases you mention. In fact, the only significant difference between memoizing build systems (fbuild or mem) and all the "declarative" build systems is in the expression of the dependency graph (function application vs predetermined).

To reiterate: for fbuild or mem, from the perspective of file mutations, individual transformations should be deterministic and pure.

P.S. Imagine using xonsh with fbuild or mem, instead of the standard subprocess modules or ctx.execute...

@refi64
Copy link
Contributor

refi64 commented Apr 26, 2016

@orbisvicis

  • Holy crap, xonsh looks amazing. Where has that been all my life...?

  • I'm worried that maintaining an extra file cache directory could have the potential to significantly balloon file sizes. My idea is to add an extra argument, --extra-cache-dir, that says whether or not that should be enabled. It's definitely a neat idea for certain situations. How does that sound?

  • I completely agree that managing compiler flags with caching is a total, complete mess. Changing build system flags requires explicitly passing --configure, even though it shouldn't. However, I would like to add something sort of "magic" to implicitly do this. For instance:

    @fbuild.db.caches
    def myfunc(ctx, ...):
        myopt = ctx.options.xyz # This would implicitly make myfunc depend on xyz's value; if xyz changes, then this is re-run.
        myvar = ctx.env['CFLAGS'] # Same thing here.
  • Actually, you may not know any outputs until the build step is run. For instance, with Shiboken, you have to first run it, then inspect the output directory to find all the output files.

@skaller
Copy link
Member

skaller commented Apr 26, 2016

"I don't see any problems stemming from what you've mentioned. The outputs aren't known - specify them manually"

I dont understand. The outputs are NOT KNOWN. They can't be specified in advance at all. Here is a tarball. Unpack it, you can see the outputs. There's no way to put the list of outputs in the build system. You have to unpack it to find that list, and that's the operation anyhow.

@skaller
Copy link
Member

skaller commented Apr 26, 2016

"Every build system requires predetermined inputs and outputs, be it make, omake, waf, mem, or even fbuild. You can't get anything done without doing so."

NO, that is not true. Interscript had neither, it was a fixpoint builder. Similarly fbuild does NOT require the inputs or outputs be known, they're only required if you want to cache a build step. This is why, when running Felix build under fbuild, the source packages are unpacked every time, precisely because the outputs are NOT known by fbuild. The unpacking tool, however, does not overwrite unchanged files, so time stamps are only modified if the output is changed. This allows subsequent steps to be optimised.

It's simply not true that all build systems track dependencies from pre-determined data: more precisely all BROKEN build systems rely on that. As I said, the only correct way to build is with a series of arbitrary EXECUTABLE steps. The build is bottom up. It is driven by the inputs. Goal oriented (functional) build systems cannot work.

Fbuild does not use user provided dependency information. It captures execution steps as they're done. In other words it tracks inputs, outputs, and dependencies by actually executing the build: this information is NOT specified by the author of the build script. Instead, if you need A to build B, you simply have a Python function for building B that calls the function for building A.

Some of these builders, such as the C builder, can cache inputs and outputs and they can check for changes to avoid the compilation being done.

I think you should look at a REAL fbuild build system: here is the root:

https://github.com/felix-lang/felix/blob/master/fbuildroot.py

and the modules:

https://github.com/felix-lang/felix/tree/master/buildsystem

There are NO inputs specified, there are NO outputs specified and there are NO dependencies specified. It's an executable Python program. That's how fbuild works, by design. It just executes your script. SOME of the steps are cached, and SOME of the low level build steps invoke C++ or Ocaml compilers which are supported with dependency checking, and of course at that point the command line for the compiler has to tell the compiler the inputs so they're known, and the outputs, generally, are also known. But even this is not entirely true as you yourself pointed out, since many operations of linkers, for example, use dynamic searches that can't be predicted easily.

Of course you are basically correct, that the inputs have to be known for a given function, in order to use the cached result and avoid recalculation. But getting up to such a step in fbuild and all sane builders (which is no other builder I know of than fbuild) is determined by procedural control flow, not by prediction based on user supplied dependency data.

I'm trying to explain that functional building is simply WRONG. Make fails almost immediately you get slightly complex builds of even C programs. So someone came along and threw in auto-shit, but auto-shit only works for C, and frankly, it often fails even then. Ocaml also has its own dependency detectors which can be used to create Makefiles, and they don't work a lot of the time either because specifying the paths correctly is extremely difficult. Now consider a "compiler" which doesn't follow the pattern, such as an archive extractor, or a literate programming tool, or a document generator, all of which have multiple modes and options, and which generate more or less non-determinable outputs and may also have non-determinable inputs, and you begin to grasp reality.

A build system is a general purpose programming language. Any optimisations are either special cases the programmer codes INTO the program, or they're special optimisations which the compiler knows about. In this case the "compiler" is the build system. It basically just executes the build program. It does the user coded optimisations because that's what the user said to do, and it does the automatic ones because it can. The Felix build system bootstrap is a Python program. It happens to use the fbuild library for some steps to help optimise the build. But not all the steps are optimised, not all can be, and some aren't worth the effort.

So let me repeat: NO working build system can require the user to provide all the inputs, outputs, intermediate files, and dependencies. That is why all the other build systems out there simply do not work. No programmer can do it. The ONLY viable way to specify dependency graphs is not as a data structure, but as a CONTROL structure: specified by the execution of the build program.

And in theory a fixpoint builder doesn't even need that. It can work by executing the build steps in a random order. It just keeps doing it until a complete set of steps makes no changes to the outputs. So a fixpoint builder requires ONLY to know the outputs, and they can usually be detected by execution, and do not have to be specified.

The key point is that with executable builders like fbuild, optimisations have to be conservative: if there's any doubt, rebuild it: the key thing is that skipping build steps and using cached results is an optimisation which does not change the result. If there is no dependency data, you can't optimise the step. Who cares? All the other build systems care because they requires this data and they require it to be exact and precise, and that is untenable. It can't be done. That's why all the other build systems are broken.

@orbisvicis
Copy link
Author

orbisvicis commented Apr 26, 2016

xonsh

Yes xonsh looks amazing. I found out about it and have always been meaning to switch... sigh. I've been torn between xonsh and the haskell shells (there are several).

--extra-cache-dir

I'd argue it is ideal for most situations, though it does require careful consideration of which functions to memoize - too many and file sizes could balloon. I think most projects would benefit from it, rather than vice versa.

--configure

You should check out mem's papers:

Mem had to decide between passing in the full environment or only the required flags. The full environment would require ignoring extraneous variables while declaring the requisite ones as external dependencies. Passing in only the required flags would get automatic memoization, at the expense of some tedium; mem alleviates this via a helpful decorator. As you can guess, mem chose the second option whereas you prefer the first. I don't see any problems because the context object sidesteps the problems mem had - it isn't memoized.

However, to keep the interface simple, why not reuse existing methods of specifying external dependencies:

@fbuild.db.caches
def myfunc(ctx, ...):
    # isn't ctx.options already used by command-line options? 
    myenv_a = ctx.db.add_external_dependencies_to_call(ctx.environment.CFLAGS)
    myenv_b = ctx.environment["CFLAGS"] # already added as a dependency, no need to re-add
    myenv_c = ctx.environment.get_memoized(ctx.environment.xyz)

Explicit over implicit?

orbisvicis: The outputs aren't known - specify them manually"

kirbyfan64: Actually, you may not know any outputs until the build step is run. For instance, with Shiboken, you have to first run it, then inspect the output directory to find all the output files.

skaller: The outputs are NOT KNOWN. They can't be specified in advance at all. Here is a tarball. Unpack it, you can see the outputs. There's no way to put the list of outputs in the build system

I don't think this changes much. While these are both situations that standard build systems don't handle well (at least not in a straightforward fashion), both mem and fbuild do, gracefully.

  1. Return multiple outputs.

    Fbuild supports this through fbuild.db.DSTS. Mem, by virtue of being a young project, never had the need to support multiple outputs - not to say that doing so wouldn't be simple:

    @mem.memoize
    def obj(...):
      # to return a single file (already supported)
      return mem.nodes.File(target)
      # to return multiple files (option one, recommended)
      # check with isinstance(result, abc.collections.Sequence)
      return [mem.nodes.File(t) for t in targets]
      # to return multiple files (option two, not recommended)
      return mem.nodes.Files(targets)
  2. External dependencies not known beforehand

    Both fbuild and mem support adding external dependencies dynamically. The example uses mem, because I find it more straightforward:

    @mem.memoize
    def obj(target, source):
       # run an external utility (i.e. gcc -M)
       externals = get_dependencies_from_utility(source)
       # parse build(target, source) output - this is demonstrated by QuxBuilder
       # in fbuild's manual
       output = subprocess.check_output(["compiler", target, source])
       externals = get_dependencies_from_output(source, output)
    
       for external in externals:
           mem.add_dep(mem.nodes.File(external))
       return mem.nodes.File(target)
  3. Inputs not know beforehand

    Irrespective of the build system, it isn't possible to build a project without knowing the top-level inputs - in the previous case, source.

  4. Outputs not known beforehand

    If a build command accepts an output path, then three factors determine target output: input path, input contents, and output path. If the build command doesn't accept such a path, then clearly the output is only determined by: input path, and input contents (hopefully, see evil_compiler below).

    This can be handled quite simply in mem or fbuild, for example by calling the previously defined function as so:

    obj(None, source)

    Though hopefully such a memoized function would be specialized to the specific builder, and not require a target:

    @mem.memoize
    def obj(source):
       output = subprocess.check_output(["compiler", source])
       # perhaps it is necessary to parse output to determine the output targets
       targets = get_output_targets_by_parse(output, source)
       # or perhaps it can be done by calling an external utility
       targets = get_output_targets_by_external(source)
       # or perhaps it can be done within python (ie, using ZipFile)
       targets = get_output_targets_by_function(source)
    
       return [mem.nodes.File(t) for t in targets]

    However, every build system will fail - including fbuild and mem - if it is impossible to determine target outputs, even dynamically. This is an example of a non-deterministic compiler:

    def random_string(a, b):
       return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(random.randint(a,b)))
    
    def evil_compiler(source):
       out = random_string(5,50)
       out_fake = random_string(5,50)
       print("Compiling {} to {}".format(source, out_fake)
       with open(out, "w") as file:
           file.write(str(random.randint(10**10, 10**20)) + do_compile(source) + str(time.time()))
       return

    However, this compiler can't be supported by mem or fbuild, though technically it would work with memoize.py - but that's a different story. For the purposes of build systems, input->output transformations must remain deterministic and pure. Delayed values (inputs and outputs not known until runtime) do not break these requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants