Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign task and target infrastructure #1

Open
ibab opened this issue Apr 15, 2015 · 3 comments
Open

Redesign task and target infrastructure #1

ibab opened this issue Apr 15, 2015 · 3 comments

Comments

@ibab
Copy link
Owner

ibab commented Apr 15, 2015

The task and target infrastructure will need to be reimplemented to be easier to understand, more robust and more compatible with additional targets (like remote files).

It turned out to be quite useful to detect changes in the dependency graph through a checksum mechanism:

  • Each initial target (i.e. not created by a task) computes a checksum of its contents
  • Each task computes a checksum calculated from the checksums of its inputs and its code
  • Each created target gets passed a deterministic key by its task
  • Each intermediate or final target computes its checksum from the checksums of its task and its deterministic key
  • All checksums are saved in a database, and changes in a target's checksum force the target to be recreated
  • Old checksums need to be cleaned properly, or switching back and forth won't cause a rebuild

How do we compute the initial checksums?
A relatively nice way is to use pickle to dump the object into a string, which can be hashed with SHA1.

How do timestamps (and possibly other info) fit in with this?
They could be treated as a fallback, as in

  • IF a target's checksum hasn't changed AND it offers timestamps THEN check if the timestamp has changed compared to the last saved state
@ibab
Copy link
Owner Author

ibab commented Apr 20, 2015

The Target and Task classes have been rewritten to make them more flexible:

  • Targets have access to their version after the last run (using LevelDB) and have full control over deciding whether they are up to date
  • Initial targets (without a creating task) now don't compute their hash from their contents (this lead to many problems). Instead, every target must now have a unique identifier (what this means depends on the Target, e.g. a filepath). This also allows us to scrap the unique_key field set on a task's outputs.

@ibab
Copy link
Owner Author

ibab commented Apr 22, 2015

To make datapipe a lot more efficient, the only thing saved in LevelDB now is the memory dict that contains the relevant data for checking the state of a target.
The memory is also serialized through simplejson instead of pickle, which should make things more robust.

@ibab
Copy link
Owner Author

ibab commented Apr 22, 2015

This could also be the basis of a REST API that allows targets to be queried and synced across several machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant