Skip to content

[DATA] Extract data preprocessing code and add DVC pipeline #228

@rojberr

Description

@rojberr

User Story:

As a data scientist,
I want to extract our monolithic code into discrete pipeline stages managed by DVC,
So that I can run steps independently, track dependencies clearly, and maintain a clean, reproducible workflow.


Acceptance Criteria:

  1. Data download logic extracted to src/data_download.py
  2. Training logic extracted to src/train.py
  3. DVC pipeline with two stages in dvc.yaml
  4. Pipeline supports independent execution of each stage
  5. Clear input/output contracts between stages
  6. Parameter management via config.yaml

Definition of Done:

  • All acceptance criteria met
  • Code reviewed and approved
  • Both stages run successfully independently and sequentially f.e. using dvc repro <stage>
  • Documentation updated in docs/dvc_workflow.md
  • Git-tracked pipeline files

Pipeline Architecture:

graph LR
    A[data_download.py] -->|outputs| B[data/raw]
    B --> C[train.py]
    C -->|outputs| D[models/trained_model.pth]

Loading

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions