Redesign task and target infrastructure #1

ibab · 2015-04-15T20:19:59Z

The task and target infrastructure will need to be reimplemented to be easier to understand, more robust and more compatible with additional targets (like remote files).

It turned out to be quite useful to detect changes in the dependency graph through a checksum mechanism:

Each initial target (i.e. not created by a task) computes a checksum of its contents
Each task computes a checksum calculated from the checksums of its inputs and its code
Each created target gets passed a deterministic key by its task
Each intermediate or final target computes its checksum from the checksums of its task and its deterministic key
All checksums are saved in a database, and changes in a target's checksum force the target to be recreated
Old checksums need to be cleaned properly, or switching back and forth won't cause a rebuild

How do we compute the initial checksums?
A relatively nice way is to use pickle to dump the object into a string, which can be hashed with SHA1.

How do timestamps (and possibly other info) fit in with this?
They could be treated as a fallback, as in

IF a target's checksum hasn't changed AND it offers timestamps THEN check if the timestamp has changed compared to the last saved state

ibab · 2015-04-20T11:23:58Z

The Target and Task classes have been rewritten to make them more flexible:

Targets have access to their version after the last run (using LevelDB) and have full control over deciding whether they are up to date
Initial targets (without a creating task) now don't compute their hash from their contents (this lead to many problems). Instead, every target must now have a unique identifier (what this means depends on the Target, e.g. a filepath). This also allows us to scrap the unique_key field set on a task's outputs.

ibab · 2015-04-22T09:19:45Z

To make datapipe a lot more efficient, the only thing saved in LevelDB now is the memory dict that contains the relevant data for checking the state of a target.
The memory is also serialized through simplejson instead of pickle, which should make things more robust.

ibab · 2015-04-22T11:53:18Z

This could also be the basis of a REST API that allows targets to be queried and synced across several machines.

ibab added the enhancement label Apr 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign task and target infrastructure #1

Redesign task and target infrastructure #1

ibab commented Apr 15, 2015

ibab commented Apr 20, 2015

ibab commented Apr 22, 2015

ibab commented Apr 22, 2015

Redesign task and target infrastructure #1

Redesign task and target infrastructure #1

Comments

ibab commented Apr 15, 2015

ibab commented Apr 20, 2015

ibab commented Apr 22, 2015

ibab commented Apr 22, 2015