We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It's quite common that I these days do this:
if "dclm-raw-text" not in datasets: ( DataChain.from_dataset("dclm-index") .settings(cache=True) .limit(1) .gen(extract, output={"file": File, "json": dict}) .save("dclm-raw-text") )
to avoid running that code again if the dataset is ready.
The downside is that I still need to run it from time to time (e.g. I change params, or something changed at it's source - dclm-index in this case).
dclm-index
I think we can make save() analyze the dependencies (including the query) and avoid running (by a flag or default?).
save()
It brings a great additional value compared to basic data processing libs - our ability to analyze the graph of dependencies.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
It's quite common that I these days do this:
to avoid running that code again if the dataset is ready.
The downside is that I still need to run it from time to time (e.g. I change params, or something changed at it's source -
dclm-index
in this case).I think we can make
save()
analyze the dependencies (including the query) and avoid running (by a flag or default?).It brings a great additional value compared to basic data processing libs - our ability to analyze the graph of dependencies.
The text was updated successfully, but these errors were encountered: