-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Checkpointer
step
#1114
Add Checkpointer
step
#1114
Conversation
… specified points during the life of the pipeline
…frequency of file writes
Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1114/ |
CodSpeed Performance ReportMerging #1114 will degrade performances by 35.85%Comparing Summary
Benchmarks breakdown
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! just some comments
src/distilabel/steps/checkpointer.py
Outdated
from huggingface_hub import HfApi | ||
|
||
|
||
class Checkpointer(Step): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better if we call it HuggingFaceHubCheckpointer
?
if not self._api.repo_exists(repo_id=self.repo_id, repo_type="dataset"): | ||
self._logger.info(f"Creating repo {self.repo_id}") | ||
self._api.create_repo( | ||
repo_id=self.repo_id, repo_type="dataset", private=self.private | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this logic to a method that can be unit tested
src/distilabel/steps/checkpointer.py
Outdated
for i, input in enumerate(inputs): | ||
# Each section of *inputs corresponds to a different configuration of the pipeline | ||
with tempfile.NamedTemporaryFile( | ||
mode="w", suffix=".jsonl", delete=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete=False
instead of letting the context manager handle that?
Description
Adds a first version of
Checkpointer
step to write data to the hub while the pipeline executes, useful for longer pipelines.logs for the sample pipeline: