Summary
Implement deterministic dataset hashing for attestation.
Specification (from ATTESTATION.md)
- List all files in dataset directory
- Sort filenames lexicographically
- Leaf:
SHA-256(0x00 || file_contents) (domain prefix)
- Internal node:
SHA-256(0x01 || left || right) (domain prefix)
- Odd leaves: promote last unpaired
- Record root hash + number of items evaluated
Why domain separation
Without 0x00/0x01 prefixes (RFC 6962 pattern), a two-file dataset could produce the same root as a single file whose contents are the concatenation of two leaf hashes.
Subset coverage
Hash covers ONLY the subset used. If MMLU-Pro subset of MMLU, hash the subset files, not the full dataset.
Implementation needed in
- Rust (source of truth)
- Python (sentinel-ai uses this during forge)
- TypeScript (verification)
Summary
Implement deterministic dataset hashing for attestation.
Specification (from ATTESTATION.md)
SHA-256(0x00 || file_contents)(domain prefix)SHA-256(0x01 || left || right)(domain prefix)Why domain separation
Without 0x00/0x01 prefixes (RFC 6962 pattern), a two-file dataset could produce the same root as a single file whose contents are the concatenation of two leaf hashes.
Subset coverage
Hash covers ONLY the subset used. If MMLU-Pro subset of MMLU, hash the subset files, not the full dataset.
Implementation needed in