Skip to content

[Bounty: 50 RTC] Training data generator for SFT pipeline #2

@Scottcjn

Description

@Scottcjn

Description

Port the SophiaCore data generation pattern to ShaprAI for supervised fine-tuning (SFT) training data generation.

Requirements

  • Port the proven pattern from sophiacore_data_generator.py to work with ShaprAI's template system
  • Generate ChatML-formatted training data with proper <|im_start|> / <|im_end|> tokens
  • Support identity-weighted examples (personality-defining responses weighted higher in training)
  • Customizable personality templates — users define their agent's voice, values, and behavioral boundaries

Acceptance Criteria

  • shaprai/training/sft_generator.py module created
  • Generates valid ChatML JSONL output
  • Identity-weighted sampling: core personality examples appear 3-5x more frequently
  • Template-driven: personality defined via YAML/JSON config, not hardcoded
  • CLI command: shaprai generate-sft --template my_agent.yaml --output train.jsonl --count 1000
  • Includes at least 3 example personality templates
  • Compatible with HuggingFace TRL SFTTrainer format
  • Unit tests for generator logic

Bounty

50 RTC — Paid on merge to main.

How to Claim

Comment on this issue to claim it. Submit a PR referencing this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bountyRTC bounty availableenhancementNew feature or requesttrainingTraining and fine-tuning

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions