Skip to content

feat: enhance vocabulary inspector with dump functionality#58

Open
yxjiang wants to merge 1 commit intomainfrom
basics-multi-query-attention
Open

feat: enhance vocabulary inspector with dump functionality#58
yxjiang wants to merge 1 commit intomainfrom
basics-multi-query-attention

Conversation

@yxjiang
Copy link
Member

@yxjiang yxjiang commented Oct 26, 2025

Summary

This PR enhances the vocabulary inspector tool with comprehensive vocabulary dump functionality and includes cleanup of unused test files.

Changes

✨ New Features

  • Vocabulary Dump: Added --dump-vocab option to vocab_inspect.py that exports complete vocabulary to JSON
  • Rich Metadata: Dumped vocabulary includes token metadata (special tokens, punctuation, digits, etc.)
  • Example Script: Added example_vocab_dump.py demonstrating usage patterns

🧹 Cleanup

  • Removed modeling/basics/test_attention.py (unused test file)

Usage Examples

# Dump vocabulary to JSON
python vocab_inspect.py --model-path Qwen/Qwen2-0.5B-Instruct --dump-vocab qwen_vocab.json

# Run example demonstrations
python example_vocab_dump.py

JSON Output Structure

The dumped vocabulary includes:

  • Model metadata (path, vocab size)
  • Special token mappings
  • Complete token list with metadata:
    • Token ID and string
    • Character length
    • Token type flags (special, punctuation, digit, etc.)

Testing

  • ✅ Vocabulary dump functionality works with various model types
  • ✅ JSON output is properly formatted and complete
  • ✅ Example script demonstrates usage patterns
  • ✅ Backward compatibility maintained for existing functionality

- Add vocabulary dump feature to vocab_inspect.py
- Add example_vocab_dump.py demonstration script
- Remove test_attention.py (cleanup)
- Support dumping complete vocabulary to JSON with metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant