-
Notifications
You must be signed in to change notification settings - Fork 1
[DATA] Extract data preprocessing code and add DVC pipeline #228
Copy link
Copy link
Open
Description
User Story:
As a data scientist,
I want to extract our monolithic code into discrete pipeline stages managed by DVC,
So that I can run steps independently, track dependencies clearly, and maintain a clean, reproducible workflow.
Acceptance Criteria:
- Data download logic extracted to
src/data_download.py - Training logic extracted to
src/train.py - DVC pipeline with two stages in
dvc.yaml - Pipeline supports independent execution of each stage
- Clear input/output contracts between stages
- Parameter management via
config.yaml
Definition of Done:
- All acceptance criteria met
- Code reviewed and approved
- Both stages run successfully independently and sequentially f.e. using
dvc repro <stage> - Documentation updated in
docs/dvc_workflow.md - Git-tracked pipeline files
Pipeline Architecture:
graph LR
A[data_download.py] -->|outputs| B[data/raw]
B --> C[train.py]
C -->|outputs| D[models/trained_model.pth]
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels