Graphtage is a semantic diff/merge utility for tree-like structured data formats (JSON, XML, HTML, YAML, plist, CSS, CSV). It works as both a command-line tool and Python library.
Key capabilities:
- Semantic understanding of tree structures (recognizes key vs value changes)
- Cross-format diffing (e.g., JSON vs YAML with output in any format)
- Extensible architecture for custom node types and file formats
- HTML output support for visual diffs
TreeNode: Protocol for tree node implementations - all nodes must implement thisEdit: Protocol for edit operations with cost boundsGraphtageFormatter: Base formatter for printing nodes and edits
LeafNode: Terminal nodes (strings, numbers, booleans, null)ListNode: Ordered sequencesDictNode: Key-value mappingsKeyValuePairNode: Individual dict entries
Match: No change neededReplace: Substitute one value for anotherInsert: Add new elementRemove: Delete elementCompoundEdit: Multiple edits grouped together
- matching.py: Bipartite matching for optimal node correspondences
- levenshtein.py: String edit distance with Unicode combining marks
- search.py: Iterative tightening search for edit cost optimization
- bounds.py: Cost range calculations (Range class)
- fibonacci.py: Fibonacci search for optimization
Each format implements its own TreeNode subclasses and parser:
- json.py, yaml.py, xml.py, csv.py, plist.py, pickle.py
# Install with dev dependencies
pip install -e .[dev]
# Or just the package
pip install graphtagepytest # All tests
pytest test/test_graphtage.py # Specific module
pytest -q # Quiet output# Ruff is configured in pyproject.toml
ruff check graphtage test
ruff check --fix graphtage test
# CI currently uses flake8
flake8 graphtage test --select=E9,F63,F7,F82cd docs && make html
# Output in docs/_build/html/- Create
graphtage/newformat.py - Define TreeNode subclasses for format-specific structures
- Implement a
build_tree(content: str) -> TreeNodefunction - Register the filetype in
graphtage/__init__.py - Add tests in
test/test_newformat.py
- Edit costs are computed lazily via
bounds()method - Use
has_non_zero_cost()to check if an edit represents a change initial_boundsstores the first computed bounds for optimization
The printing system is extensible:
- Check for specialized formatter for the edit type
- Fall back to edit's
print()method - Fall back to node's
print()method
- Line length: 120 characters (configured in ruff)
- Python version: 3.8+ compatibility required
- Type hints: Use typing_extensions for Protocol support
- Docstrings: Google style for public APIs
- Tests: Mirror package structure in test/ directory
# Basic diff
graphtage original.json modified.json
# Cross-format diff
graphtage file.json file.yaml --format yaml
# Condensed output
graphtage -j original.json modified.json
# Show only edits
graphtage -e original.json modified.json
# HTML output
graphtage --html original.json modified.json > diff.html- Test files are in
test/directory - Use
test_*.pynaming convention - Tests are organized by module (test_matching.py tests matching.py)
- Performance tests in timing.py (not run by default)