Avoid running a chain if there are no changes at its sources #605

shcheklein · 2024-11-16T20:51:59Z

It's quite common that I these days do this:

if "dclm-raw-text" not in datasets:
   (
      DataChain.from_dataset("dclm-index")
         .settings(cache=True)
         .limit(1)
         .gen(extract, output={"file": File, "json": dict})
         .save("dclm-raw-text")
   )

to avoid running that code again if the dataset is ready.

The downside is that I still need to run it from time to time (e.g. I change params, or something changed at it's source - dclm-index in this case).

I think we can make save() analyze the dependencies (including the query) and avoid running (by a flag or default?).

It brings a great additional value compared to basic data processing libs - our ability to analyze the graph of dependencies.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid running a chain if there are no changes at its sources #605

Avoid running a chain if there are no changes at its sources #605

shcheklein commented Nov 16, 2024

Avoid running a chain if there are no changes at its sources #605

Avoid running a chain if there are no changes at its sources #605

Comments

shcheklein commented Nov 16, 2024