-
Notifications
You must be signed in to change notification settings - Fork 23
feat(sbom): symlink-aware SBOM filesystem graph (fs_tree) #459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
willis89pr
wants to merge
212
commits into
main
Choose a base branch
from
fstree
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add `fs_tree: nx.DiGraph` to `SBOM`, excluded from JSON serialization
- Populate `fs_tree` in SBOM constructor via `_add_software_to_fs_tree`, splitting each `installPath` into parent–child edges and tagging leaf nodes with `software_uuid`
- Introduce `SBOM._record_symlink(link, target, subtype)` to record symlink edges in both:
- the main relationship graph (`MultiDiGraph`) with `type="symlink"`
- the filesystem graph (`fs_tree`) with `type="symlink"` and optional `subtype` ("file" or "directory")
- Enhance `add_software_entries()` to scan each `installPath` and its immediate children for symlinks, invoking `_record_symlink` for both file- and directory–level symlinks
- Update `generate.py` to inject filename- and install-path symlinks into each `Software` entry before adding to SBOM, so they’re captured by `add_software_entries()`
- Refactor `elf_relationship` plugin to:
- Prefer `fs_tree`–based `get_software_by_path()` lookups for ELF dependencies
- Fall back to legacy `installPath` matching, then a directory-based symlink heuristic
- Emit detailed `logger.debug()` statements (via Loguru) indicating which resolution path was used
- Improve docstrings around RPATH/RUNPATH, DST substitution, and relationship phases
- Expand DST-handling helpers (`generate_search_paths`, `generate_runpaths`, `substitute_all_dst`) with clearer comments, normalization, and debug traces
- Update `.NET` relationship plugin to use `get_software_by_path` for absolute imports and cleaned-up probing logic
- Add comprehensive unit tests:
- `tests/sbomtypes/test_fs_tree.py` to verify `fs_tree` population and `get_software_by_path`
- `tests/relationships/test_elf_relationship.py` covering absolute, relative, system, origin, RPATH, and symlink heuristics
- Minor cleanup: prevent `fs_tree` from being serialized and remove unused whitespace
for more information, see https://pre-commit.ci
Add “# pylint: disable=redefined-outer-name” to the top of: - tests/relationships/test_elf_relationship.py - tests/sbomtypes/test_fs_tree.py This silences warnings about pytest fixtures shadowing outer-scope names.
for more information, see https://pre-commit.ci
- Documented _add_software_to_fs_tree method with explanation of behavior, arguments, and side effects - Enhanced safety: ensure final install path node exists before tagging - Normalized install paths to POSIX format for consistency - Added type hints for clarity - No logic changes to other methods; only added minor inline comments and spacing
for more information, see https://pre-commit.ci
…ationship - Introduced `normalize_path` utility in `surfactant.utils.paths` to standardize path handling across components. - Replaced all raw `PurePosixPath` and `PureWindowsPath` calls with `normalize_path` in: - `SBOM` class (`_sbom.py`): install path processing, software lookup, and symlink handling. - `dotnet_relationship.py`: resolving absolute paths for dependency resolution. - Added new utility module `utils.paths` and test suite `test_paths.py` to verify path normalization behavior across various cases.
for more information, see https://pre-commit.ci
- Removed redundant single-argument shortcut that bypassed normalization. - Updated normalize_path() to explicitly replace backslashes in all path parts. - Ensures consistent POSIX-style output for inputs like "C:\\Program Files\\App". - Fixes test failures caused by improper handling of Windows-style paths.
…tion - Replaced all manual `.as_posix()` conversions with `normalize_path(...)` to ensure consistent POSIX-style lookup keys. - Normalized candidate paths used in `sbom.get_software_by_path()` during .NET relationship resolution. - Updated codeBase path resolution to use structured path objects instead of prematurely stringifying. - Refactored `get_dotnet_probedirs()` to normalize all output paths and avoid path handling inconsistencies. - Added docstring to `get_dotnet_probedirs()` for clarity. Fixes failing .NET relationship tests caused by inconsistent path formats in `installPath` vs lookup paths.
✅ No SBOM Changes DetectedFor commit 2dca48d (Run 20787774601) |
… heuristic test - In `example_sbom` fixture, record a symlink from `/opt/alt/lib/libalias.so` to `/opt/alt/lib/libreal.so` for `sw8` to exercise the symlink handling logic - Add a new parametrized test `test_symlink_heuristic_match_edge` that clears existing fs_tree entries and verifies that the heuristic correctly matches symlinked dependencies when no direct matches exist
for more information, see https://pre-commit.ci
…d suppress pylint protected-access warning
for more information, see https://pre-commit.ci
…ionally removing fs_tree edge and node Updated `test_symlink_heuristic_match_edge` to defensively check for the existence of the symlink edge and node in `fs_tree` before attempting to remove them. This avoids `KeyError` raised by NetworkX when the edge does not exist, ensuring the test remains stable even if the graph structure changes upstream. Improves test resilience and correctness by explicitly targeting the intended symlink edge (`/opt/alt/lib/libalias.so` → `/opt/alt/lib/libreal.so`).
for more information, see https://pre-commit.ci
…k and logging - Updated `get_windows_pe_dependencies()` to use a modern three-phase resolution strategy: 1. Primary: Exact path match using `sbom.get_software_by_path()` (fs_tree) 2. Secondary: Legacy string-based matching on `installPath` and `fileName` 3. Tertiary: Heuristic fallback using shared directories and `fileName` match - Replaced `find_installed_software()` usage with normalized path lookups. - Introduced detailed `loguru.debug()` logging to trace each match attempt and outcome. - Enhanced `establish_relationships()` with structured import phase handling and debug output. - Improved `has_required_fields()` using a cleaner `any(...)` check with docstring and type hint. - Added full docstrings to clarify purpose and logic for maintainability. These changes bring PE relationship handling in line with ELF and .NET plugins, ensuring consistency, improved symlink resolution, and better match accuracy across Windows-style paths.
for more information, see https://pre-commit.ci
- Introduced test suite for `pe_relationship.py` covering: - Primary resolution via `fs_tree` using `get_software_by_path()` - Legacy fallback using `installPath` + `fileName` matching - Heuristic fallback using same-directory + filename pattern - Negative test case for unmatched DLLs - Unit test for `has_required_fields()` utility function - Includes thorough docstrings and inline comments for clarity and maintainability. - Ensures consistent behavior with ELF/.NET plugin resolution logic. File added: tests/relationships/test_pe_relationship.py
for more information, see https://pre-commit.ci
…d tests - Replaced legacy class-based resolution with dynamic 3-phase import matching: 1. Exact path resolution via sbom.get_software_by_path() (fs_tree) 2. Legacy fallback via installPath + fileName match 3. Heuristic fallback via shared directory and filename - Removed static _ExportDict and global class-to-UUID mapping - Added detailed logging and comments for maintainability - Introduced helper `class_to_path()` for FQCN to class file path test: - Added pytest suite covering all resolution phases: - fs_tree match - legacy installPath fallback - heuristic directory-based fallback - negative case with no match New file: tests/relationships/test_java_relationship.py
for more information, see https://pre-commit.ci
This change removes the symlink-aware directory heuristic from ELF dependency resolution. The heuristic could incorrectly match dependencies when a software entry shared a directory with a candidate search path but did not actually contain the required file. This resulted in false positives such as resolving "somedir/abc" to an entry that only provided "somedir/def". With the heuristic removed, dependency resolution now relies exclusively on: 1. fs_tree-based lookup (symlink-aware, hash-aware, evidence-based) 2. strict legacy installPath matching (filename + exact installPath) The docstring has been rewritten to reflect the two-phase model, and detailed inline comments have been added to explain ELF path interpretation, candidate path generation, fs_tree behavior, and the legacy fallback rules. This change improves correctness and prevents inferred relationships that are not supported by the filesystem snapshot.
for more information, see https://pre-commit.ci
The ELF relationship resolver no longer includes the Phase 3 heuristic
(same-directory + filename fallback). Two test blocks depended on that
behavior—`symlink_heuristic_sbom`, `test_symlink_heuristic_match`, and
`test_symlink_heuristic_match_edge`—and have been removed.
The remaining `example_sbom` fixture has been updated to mirror the actual
SBOM construction pipeline. In particular:
• All software entries are added before recording symlinks, ensuring that
fs_tree contains the target file nodes.
• The symlink `/opt/alt/lib/libalias.so → libreal.so` is recorded only
after software insertion.
• `expand_pending_file_symlinks()` is called so the alias path becomes
visible to fs_tree traversal, matching generate.py behavior.
These adjustments ensure that the “symlink” test case resolves correctly
through fs_tree rather than relying on the removed heuristic. The full test
suite now reflects the intended two-phase ELF resolution model.
…mments Phase 3 of PE dependency resolution (the symlink-aware heuristic) has been removed. This logic is now redundant due to improvements in SBOM generation, including fs_tree integration, symlink injection, and hash-equivalence mapping. The remaining two phases—direct fs_tree resolution and legacy installPath-based matching—fully cover the intended behavior. The docstring has been rewritten to describe the updated two-phase approach, clarify the scope of static DLL resolution, and better document Windows loader semantics. Inline comments were added to Phase 1 to explain how probed directories are derived and how fs_tree resolution leverages symlink and hash-equivalence metadata. This simplifies the resolver while maintaining correctness and improving maintainability.
…matching
This change updates PE dependency resolution to correctly emulate
Windows filesystem behavior and eliminate a known false-positive case
in legacy matching.
Phase 1 now invokes get_software_by_path() with case_insensitive=True,
enabling Windows-style case-insensitive basename matching in the fs_tree
lookup.
Phase 2 has been updated to:
• normalize DLL names and installPath basenames using casefold()
• require that fileName[] contains the DLL name (case-insensitive)
• match installPath entries only when both the parent directory is in
probedirs and the basename equals the DLL name
• use PureWindowsPath for Windows path handling
These changes preserve the intent of the historical
find_installed_software() logic while preventing the false-positive
scenario identified by reviewers.
The get_software_by_path() docstring was updated to document the new
optional case-insensitive fallback behavior.
for more information, see https://pre-commit.ci
…tness
This update strengthens the PE relationship resolver test suite by adding two
targeted cases:
• test_pe_no_false_positive_mismatched_basename:
Ensures the resolver no longer produces a false positive when the
imported DLL name appears in fileName[] but the installPath in the
probed directory has a different basename.
• test_pe_case_insensitive_matching:
Verifies correct Windows-style case-insensitive matching between the
imported DLL name and installPath basename.
These tests validate recent fixes to the PE resolver and prevent regression.
In addition, the unused Phase-3 heuristic fallback was removed from the Java
resolver, as it duplicated behavior now handled by earlier phases.
for more information, see https://pre-commit.ci
This change restores full legacy behavior in relationship inference by
introducing an explicit export-dict–based Phase 2. The new phase mirrors
java_relationship_legacy, ensuring that imports are resolved via
javaExports whenever fs_tree (Phase 1) cannot locate the supplier.
Key updates:
- Add _ExportDict with create_export_dict() and get_supplier() helpers.
- Build the export dictionary once per process before relationship resolution.
- Replace old installPath+fileName fallback with legacy export-dict logic.
- Align resolution phases with documented behavior and improve robustness
via safer metadata checks.
This guarantees that the modern resolver does not regress relative to
legacy behavior, while still benefiting from fs_tree-based path-aware
resolution. :contentReference[oaicite:0]{index=0}
The former Phase 3 heuristic (fileName + shared directory fallback) was removed from the java_relationship implementation when Phase 2 was updated to exclusively use the legacy javaExports export-dict mechanism. As a result, the heuristic behavior is no longer valid or reachable. This commit deletes test_phase_3_heuristic_match, which asserted the old directory-based matching logic. Remaining tests continue to cover fs_tree resolution, legacy export-dict fallback, and no-match behavior.
for more information, see https://pre-commit.ci
…solution This change cleans up unused code in the Java relationship resolver: - Remove the now-unused `fname` variable from import resolution. - Comment out the Phase 1 fs_tree/path-based lookup logic, effectively disabling it while retaining the code for future re-enablement. - Leave Phase 2 (legacy export-dict fallback) as the sole active resolution mechanism. These updates reflect the temporary shift to legacy-only behavior and eliminate dead code paths.
for more information, see https://pre-commit.ci
The legacy-style _ExportDict previously cached its contents across calls, using a `created` flag to avoid rebuilding. This caused stale export mappings to persist between SBOMs and across test runs, leading to incorrect relationship resolution. This change removes the `created` flag and forces create_export_dict() to rebuild the export→supplier map for each SBOM invocation. The update mirrors legacy behavior while ensuring correctness in multi-SBOM and pytest environments. No functional changes to export lookup; only state isolation is improved.
nightlark
reviewed
Dec 11, 2025
nightlark
reviewed
Dec 11, 2025
nightlark
reviewed
Dec 11, 2025
nightlark
reviewed
Dec 11, 2025
…ph-first fallback Ensure .NET unmanaged (dotnetImplMap) absolute-path imports preserve legacy PureWindowsPath equality semantics while still benefiting from the fs_tree when available. The resolver now: - Attempts a graph-first lookup via sbom.get_software_by_path() to leverage fs_tree symlink resolution. - Falls back to the legacy strict scan across Software.installPath entries when the graph lookup does not find a match. - Preserves legacy behavior by skipping probing and filename variants for absolute paths. - Avoids self-dependencies. This keeps legacy outputs stable while safely incorporating graph-based resolution when present.
for more information, see https://pre-commit.ci
Update .NET relationship tests to match legacy behavior by removing Culture="neutral" from the basic fixture, placing the subdir DLL in a legacy-probed location, and disabling the version-mismatch filtering test (since version filtering isn’t enforced).
for more information, see https://pre-commit.ci
willis89pr
commented
Dec 30, 2025
Collaborator
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the variable names back to legacy and made sure it followed legacy as close as possible.
Restore legacy behavior in ELF relationship resolution while keeping fs_tree as the primary lookup mechanism. Key changes: - Ensure DT_NEEDED slash detection is performed on the raw dependency string, not a normalized path object. - Normalize relative dependency paths after joining with installPath parents, matching legacy behavior. - Use legacy filename filtering semantics during installPath fallback. - Revert RUNPATH/RPATH selection logic to legacy precedence rules. - Preserve literal RPATH/RUNPATH entries by returning paths without DST tokens unchanged in substitute_all_dst(). - Align docstrings with actual fs_tree and DST behavior.
for more information, see https://pre-commit.ci
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Symlink-aware SBOM filesystem graph; fs_tree lookup across relationship plugins; path utils & tests
Summary
This PR makes relationship resolution symlink-aware and more accurate by introducing a first-class filesystem graph (
fs_tree) inside the SBOM model and teaching the .NET/ELF/PE/Java plugins to resolve dependencies via exact path lookups before falling back to legacy heuristics. It also adds small ergonomics (path utils), targeted logging, safer error handling, and a comprehensive test suite.Motivation
Encoding the install tree and symlink edges in a graph lets us: (1) resolve by canonical path, (2) follow links deterministically, and (3) avoid spurious edges.
What changed
1) SBOM model: new
fs_treeand helper APIsAdd
fs_tree: nx.DiGraphtracking directory hierarchy and symlink edges (type="symlink", optionalsubtype="file|directory").New path and lookup helpers:
_add_software_to_fs_tree()builds path hierarchy and tags nodes withsoftware_uuid.get_software_by_path()normalizes paths and resolves entries viafs_treewith symlink traversal.get_symlink_sources_for_path()performs reverse traversal to find all symlinks pointing to a given target.record_symlink(),_add_symlink_edge(), andexpand_pending_dir_symlinks()/expand_pending_file_symlinks()handle immediate and deferred symlink creation.record_hash_node()andget_hash_equivalents()track content-equivalent files via SHA-256 nodes.inject_symlink_metadata()regenerates legacy-stylefileNameSymlinksandinstallPathSymlinksfields from the graph.Extend
add_software_entriesto merge duplicates, discover symlinks, attachContainsedges, and link identical hashes.Split graph builders:
build_rel_graph()for logical relationships;fs_treeis kept separate; filter outPath/symlinkedges fromto_dict_override().Added docstrings and safety checks across all new helpers.
2)
generate.py: symlink capture during crawlSoftwareentries before adding them.inject_symlink_metadata().3) Relationship plugins
.NET (
dotnet_relationship.py):normalize_path.sbom.get_software_by_path, handleapp.config(probing.privatePath,<codeBase href=...>), unmanaged imports viadotnetImplMap.ELF (
elf_relationship.py):DF_1_NODEFLIB.$ORIGINand$LIB."[ELF][final] {dependent} Uses {name} → UUID={match} [phase]"or"... → no match".DF_1_NODEFLIBcheck and explicitly logs default search paths when the flag is not set.printwithlogger.debugfor expanded runpaths.PE (
pe_relationship.py):Java (
java_relationship.py):4) Merge and graph hygiene
Pathnodes in root computation, merges, and relationship output.5) New path utilities
surfactant/utils/paths.pynormalize_path(*parts) → strensures consistent POSIX normalization across Windows/Unix.basename_posix(path) → str.6) Tests
.NET:
fs_tree.add_node().test_dotnet_culture_subdirdocstring (filtering only).test_dotnet_heuristic_matchfor Phase 3.ELF:
_record_symlinkto public API; expanded docstrings for clarity.uuid-4-consumer) to exercise system-path fallback; fixture also demonstrates alias mapping forlibalias.so.test_symlink_heuristic_match_edgeto force heuristic after clearing direct symlink edge.PE:
Java:
Added
tests/utils/test_paths.pyfornormalize_path.Added
tests/sbomtypes/test_fs_tree.pyto validatefs_treepopulation and lookup.Risk & compatibility
fs_treeis internal (non-serialized).Performance considerations
Logging & DX
logger.debugtraces for resolution phases.SBOM model note (post-deserialization)
__post_init__sofs_treelookups work consistently across load/merge workflows. This scans eachinstallPath(and directory children) to re-register symlinks and hashes into both graphs.Test plan
pytest -q tests/relationships/test_dotnet_relationship.py pytest -q tests/relationships/test_elf_relationship.py pytest -q tests/relationships/test_java_relationship.py pytest -q tests/relationships/test_pe_relationship.py pytest -q tests/utils/test_paths.py pytest -q tests/sbomtypes/test_fs_tree.py pytest -q # full suite