You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A common pattern in pipelines is to wake up, extract data from a data lake or object store, and then turn it into a graph. However, this can get tricky when the pipeline throws an unhandled exception, is no longer scheduled, or some other change happens that means that it terminates progress. Assuming the user in some way restarts the pipeline, the pipeline will start from the beginning.
A quick google search of "data pipeline checkpoints" will reveal that this is a common pattern/solution. While nodestream doesn't endeavor to have every bell and whistle, this one seems like an obvious win for long running pipelines.
Describe the solution you'd like
The proposed solution has several components:
Allow users to specify an interval that checkpoints occur
Provide an interface that allows extractors to produce a checkpoint: async def checkpoint(self). The extractor can return anything that is pickle-able.
Provide an interface that allows extractors to resume from a checkpoint: async def resume_checkpoint(self, checkpoint): If this method throws an exception, the error will be logged and extract_records will be called like normal.
Introduce a pluggable ObjectStore api that can be used for more than just checkpoints. [REQUEST] Record Schema Inference and Enforcement #37 would also benefit from this. Possible implementations are (null, tempfile, and s3)
Describe alternatives you've considered
The only other alternative i can think of is for each step to implement that behavior on its own. However, this is troublesome.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
A common pattern in pipelines is to wake up, extract data from a data lake or object store, and then turn it into a graph. However, this can get tricky when the pipeline throws an unhandled exception, is no longer scheduled, or some other change happens that means that it terminates progress. Assuming the user in some way restarts the pipeline, the pipeline will start from the beginning.
A quick google search of "data pipeline checkpoints" will reveal that this is a common pattern/solution. While nodestream doesn't endeavor to have every bell and whistle, this one seems like an obvious win for long running pipelines.
Describe the solution you'd like
The proposed solution has several components:
async def checkpoint(self)
. The extractor can return anything that is pickle-able.async def resume_checkpoint(self, checkpoint):
If this method throws an exception, the error will be logged andextract_records
will be called like normal.ObjectStore
api that can be used for more than just checkpoints. [REQUEST] Record Schema Inference and Enforcement #37 would also benefit from this. Possible implementations are (null
,tempfile
, ands3
)Describe alternatives you've considered
The only other alternative i can think of is for each step to implement that behavior on its own. However, this is troublesome.
The text was updated successfully, but these errors were encountered: