Skip to content

Dataset hashing: RFC 6962 Merkle tree with domain separation #5

@joelteply

Description

@joelteply

Summary

Implement deterministic dataset hashing for attestation.

Specification (from ATTESTATION.md)

  1. List all files in dataset directory
  2. Sort filenames lexicographically
  3. Leaf: SHA-256(0x00 || file_contents) (domain prefix)
  4. Internal node: SHA-256(0x01 || left || right) (domain prefix)
  5. Odd leaves: promote last unpaired
  6. Record root hash + number of items evaluated

Why domain separation

Without 0x00/0x01 prefixes (RFC 6962 pattern), a two-file dataset could produce the same root as a single file whose contents are the concatenation of two leaf hashes.

Subset coverage

Hash covers ONLY the subset used. If MMLU-Pro subset of MMLU, hash the subset files, not the full dataset.

Implementation needed in

  • Rust (source of truth)
  • Python (sentinel-ai uses this during forge)
  • TypeScript (verification)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions