Skip to content

Conversation

@willis89pr
Copy link
Collaborator

@willis89pr willis89pr commented Jul 17, 2025

Symlink-aware SBOM filesystem graph; fs_tree lookup across relationship plugins; path utils & tests

Summary

This PR makes relationship resolution symlink-aware and more accurate by introducing a first-class filesystem graph (fs_tree) inside the SBOM model and teaching the .NET/ELF/PE/Java plugins to resolve dependencies via exact path lookups before falling back to legacy heuristics. It also adds small ergonomics (path utils), targeted logging, safer error handling, and a comprehensive test suite.

Motivation

Encoding the install tree and symlink edges in a graph lets us: (1) resolve by canonical path, (2) follow links deterministically, and (3) avoid spurious edges.

What changed

1) SBOM model: new fs_tree and helper APIs

  • Add fs_tree: nx.DiGraph tracking directory hierarchy and symlink edges (type="symlink", optional subtype="file|directory").

  • New path and lookup helpers:

    • _add_software_to_fs_tree() builds path hierarchy and tags nodes with software_uuid.
    • get_software_by_path() normalizes paths and resolves entries via fs_tree with symlink traversal.
    • get_symlink_sources_for_path() performs reverse traversal to find all symlinks pointing to a given target.
    • record_symlink(), _add_symlink_edge(), and expand_pending_dir_symlinks() / expand_pending_file_symlinks() handle immediate and deferred symlink creation.
    • record_hash_node() and get_hash_equivalents() track content-equivalent files via SHA-256 nodes.
    • inject_symlink_metadata() regenerates legacy-style fileNameSymlinks and installPathSymlinks fields from the graph.
  • Extend add_software_entries to merge duplicates, discover symlinks, attach Contains edges, and link identical hashes.

  • Split graph builders: build_rel_graph() for logical relationships; fs_tree is kept separate; filter out Path/symlink edges from to_dict_override().

  • Added docstrings and safety checks across all new helpers.

2) generate.py: symlink capture during crawl

  • Inject filename- and install-path symlinks into Software entries before adding them.
  • Record hash-based equivalence edges for symlink pairs.
  • After crawl, expand deferred symlinks and inject legacy-style symlink metadata using inject_symlink_metadata().
  • Debug logs added; broad exceptions tightened.

3) Relationship plugins

  • .NET (dotnet_relationship.py):

    • Unified path normalization with normalize_path.
    • Resolve via sbom.get_software_by_path, handle app.config (probing.privatePath, <codeBase href=...>), unmanaged imports via dotnetImplMap.
    • Added culture/version filtering; clarified that Culture is used for filtering, not probing subdirs.
    • Fixed probe directory construction to normalize consistently.
  • ELF (elf_relationship.py):

    • Introduced structured docstrings and debug logs.
    • RPATH/RUNPATH resolution respects DF_1_NODEFLIB.
    • Expanded DST substitution with $ORIGIN and $LIB.
    • Phased resolution: fs_tree → legacy → heuristic.
    • New: standardized final-emission debug logs and clarified when no match is found, e.g. "[ELF][final] {dependent} Uses {name} → UUID={match} [phase]" or "... → no match".
    • New: condensed DF_1_NODEFLIB check and explicitly logs default search paths when the flag is not set.
    • New: replaced ad-hoc print with logger.debug for expanded runpaths.
  • PE (pe_relationship.py):

    • Uses fs_tree → legacy → same-directory match (Phase 2/3).
    • Clarified test case docstring and renamed misleading “symlink heuristic” test to same-directory match.
  • Java (java_relationship.py):

    • Corrected Phase 1 test to align importer path with fs_tree lookups.
    • Corrected Phase 3 heuristic test to require same parent dir + filename, with Phase 2 failing.

4) Merge and graph hygiene

  • Skip Path nodes in root computation, merges, and relationship output.

5) New path utilities

  • surfactant/utils/paths.py

    • normalize_path(*parts) → str ensures consistent POSIX normalization across Windows/Unix.
    • basename_posix(path) → str.

6) Tests

  • .NET:

    • Removed redundant manual fs_tree.add_node().
    • Clarified test_dotnet_culture_subdir docstring (filtering only).
    • Added test_dotnet_heuristic_match for Phase 3.
  • ELF:

    • Switched from private _record_symlink to public API; expanded docstrings for clarity.
    • Updated example fixture: includes an explicit consumer (uuid-4-consumer) to exercise system-path fallback; fixture also demonstrates alias mapping for libalias.so.
    • Added test_symlink_heuristic_match_edge to force heuristic after clearing direct symlink edge.
  • PE:

    • Cleaned up commented fs_tree code.
    • Renamed and clarified same-directory test; added explicit path normalization assert.
  • Java:

    • Corrected Phase 1 test to actually hit fs_tree resolution.
    • Corrected Phase 3 test to properly require same-dir heuristic.
  • Added tests/utils/test_paths.py for normalize_path.

  • Added tests/sbomtypes/test_fs_tree.py to validate fs_tree population and lookup.

Risk & compatibility

  • Behavioral improvements only: JSON outputs exclude path/symlink edges.
  • fs_tree is internal (non-serialized).

Performance considerations

  • Linear path-tree construction, O(1) symlink recording.
  • Faster lookups due to direct node queries; legacy/heuristic only if needed.

Logging & DX

  • Concise logger.debug traces for resolution phases.
  • Failures to record symlinks warn, not error.

SBOM model note (post-deserialization)

  • New: after constructing from existing data, SBOM now rebuilds symlink and hash edges during __post_init__ so fs_tree lookups work consistently across load/merge workflows. This scans each installPath (and directory children) to re-register symlinks and hashes into both graphs.

Test plan

pytest -q tests/relationships/test_dotnet_relationship.py
pytest -q tests/relationships/test_elf_relationship.py
pytest -q tests/relationships/test_java_relationship.py
pytest -q tests/relationships/test_pe_relationship.py
pytest -q tests/utils/test_paths.py
pytest -q tests/sbomtypes/test_fs_tree.py
pytest -q  # full suite

willis89pr and others added 14 commits July 17, 2025 14:35
- Add `fs_tree: nx.DiGraph` to `SBOM`, excluded from JSON serialization
- Populate `fs_tree` in SBOM constructor via `_add_software_to_fs_tree`, splitting each `installPath` into parent–child edges and tagging leaf nodes with `software_uuid`
- Introduce `SBOM._record_symlink(link, target, subtype)` to record symlink edges in both:
  - the main relationship graph (`MultiDiGraph`) with `type="symlink"`
  - the filesystem graph (`fs_tree`) with `type="symlink"` and optional `subtype` ("file" or "directory")
- Enhance `add_software_entries()` to scan each `installPath` and its immediate children for symlinks, invoking `_record_symlink` for both file- and directory–level symlinks
- Update `generate.py` to inject filename- and install-path symlinks into each `Software` entry before adding to SBOM, so they’re captured by `add_software_entries()`
- Refactor `elf_relationship` plugin to:
  - Prefer `fs_tree`–based `get_software_by_path()` lookups for ELF dependencies
  - Fall back to legacy `installPath` matching, then a directory-based symlink heuristic
  - Emit detailed `logger.debug()` statements (via Loguru) indicating which resolution path was used
  - Improve docstrings around RPATH/RUNPATH, DST substitution, and relationship phases
- Expand DST-handling helpers (`generate_search_paths`, `generate_runpaths`, `substitute_all_dst`) with clearer comments, normalization, and debug traces
- Update `.NET` relationship plugin to use `get_software_by_path` for absolute imports and cleaned-up probing logic
- Add comprehensive unit tests:
  - `tests/sbomtypes/test_fs_tree.py` to verify `fs_tree` population and `get_software_by_path`
  - `tests/relationships/test_elf_relationship.py` covering absolute, relative, system, origin, RPATH, and symlink heuristics
- Minor cleanup: prevent `fs_tree` from being serialized and remove unused whitespace
Add “# pylint: disable=redefined-outer-name” to the top of:
- tests/relationships/test_elf_relationship.py
- tests/sbomtypes/test_fs_tree.py

This silences warnings about pytest fixtures shadowing outer-scope names.
- Documented _add_software_to_fs_tree method with explanation of behavior, arguments, and side effects
- Enhanced safety: ensure final install path node exists before tagging
- Normalized install paths to POSIX format for consistency
- Added type hints for clarity
- No logic changes to other methods; only added minor inline comments and spacing
…ationship

- Introduced `normalize_path` utility in `surfactant.utils.paths` to standardize path handling across components.
- Replaced all raw `PurePosixPath` and `PureWindowsPath` calls with `normalize_path` in:
  - `SBOM` class (`_sbom.py`): install path processing, software lookup, and symlink handling.
  - `dotnet_relationship.py`: resolving absolute paths for dependency resolution.
- Added new utility module `utils.paths` and test suite `test_paths.py` to verify path normalization behavior across various cases.
- Removed redundant single-argument shortcut that bypassed normalization.
- Updated normalize_path() to explicitly replace backslashes in all path parts.
- Ensures consistent POSIX-style output for inputs like "C:\\Program Files\\App".
- Fixes test failures caused by improper handling of Windows-style paths.
…tion

- Replaced all manual `.as_posix()` conversions with `normalize_path(...)` to ensure consistent POSIX-style lookup keys.
- Normalized candidate paths used in `sbom.get_software_by_path()` during .NET relationship resolution.
- Updated codeBase path resolution to use structured path objects instead of prematurely stringifying.
- Refactored `get_dotnet_probedirs()` to normalize all output paths and avoid path handling inconsistencies.
- Added docstring to `get_dotnet_probedirs()` for clarity.

Fixes failing .NET relationship tests caused by inconsistent path formats in `installPath` vs lookup paths.
@github-actions
Copy link

github-actions bot commented Jul 24, 2025

✅ No SBOM Changes Detected

For commit 2dca48d (Run 20787774601)
Compared against commit 465b000 (Run 20728878973)

willis89pr and others added 15 commits July 24, 2025 18:33
… heuristic test

- In `example_sbom` fixture, record a symlink from `/opt/alt/lib/libalias.so` to `/opt/alt/lib/libreal.so` for `sw8` to exercise the symlink handling logic
- Add a new parametrized test `test_symlink_heuristic_match_edge` that clears existing fs_tree entries and verifies that the heuristic correctly matches symlinked dependencies when no direct matches exist
…ionally removing fs_tree edge and node

Updated `test_symlink_heuristic_match_edge` to defensively check for the existence of the symlink edge and node
in `fs_tree` before attempting to remove them. This avoids `KeyError` raised by NetworkX when the edge does not exist,
ensuring the test remains stable even if the graph structure changes upstream.

Improves test resilience and correctness by explicitly targeting the intended symlink edge
(`/opt/alt/lib/libalias.so` → `/opt/alt/lib/libreal.so`).
…k and logging

- Updated `get_windows_pe_dependencies()` to use a modern three-phase resolution strategy:
  1. Primary: Exact path match using `sbom.get_software_by_path()` (fs_tree)
  2. Secondary: Legacy string-based matching on `installPath` and `fileName`
  3. Tertiary: Heuristic fallback using shared directories and `fileName` match

- Replaced `find_installed_software()` usage with normalized path lookups.
- Introduced detailed `loguru.debug()` logging to trace each match attempt and outcome.
- Enhanced `establish_relationships()` with structured import phase handling and debug output.
- Improved `has_required_fields()` using a cleaner `any(...)` check with docstring and type hint.
- Added full docstrings to clarify purpose and logic for maintainability.

These changes bring PE relationship handling in line with ELF and .NET plugins, ensuring consistency,
improved symlink resolution, and better match accuracy across Windows-style paths.
- Introduced test suite for `pe_relationship.py` covering:
  - Primary resolution via `fs_tree` using `get_software_by_path()`
  - Legacy fallback using `installPath` + `fileName` matching
  - Heuristic fallback using same-directory + filename pattern
  - Negative test case for unmatched DLLs
  - Unit test for `has_required_fields()` utility function

- Includes thorough docstrings and inline comments for clarity and maintainability.
- Ensures consistent behavior with ELF/.NET plugin resolution logic.

File added: tests/relationships/test_pe_relationship.py
…d tests

- Replaced legacy class-based resolution with dynamic 3-phase import matching:
  1. Exact path resolution via sbom.get_software_by_path() (fs_tree)
  2. Legacy fallback via installPath + fileName match
  3. Heuristic fallback via shared directory and filename

- Removed static _ExportDict and global class-to-UUID mapping
- Added detailed logging and comments for maintainability
- Introduced helper `class_to_path()` for FQCN to class file path

test:
- Added pytest suite covering all resolution phases:
  - fs_tree match
  - legacy installPath fallback
  - heuristic directory-based fallback
  - negative case with no match

New file: tests/relationships/test_java_relationship.py
pre-commit-ci bot and others added 7 commits December 9, 2025 16:03
This change removes the symlink-aware directory heuristic from ELF dependency
resolution. The heuristic could incorrectly match dependencies when a software
entry shared a directory with a candidate search path but did not actually
contain the required file. This resulted in false positives such as resolving
"somedir/abc" to an entry that only provided "somedir/def".

With the heuristic removed, dependency resolution now relies exclusively on:

  1. fs_tree-based lookup (symlink-aware, hash-aware, evidence-based)
  2. strict legacy installPath matching (filename + exact installPath)

The docstring has been rewritten to reflect the two-phase model, and detailed
inline comments have been added to explain ELF path interpretation, candidate
path generation, fs_tree behavior, and the legacy fallback rules.

This change improves correctness and prevents inferred relationships that are
not supported by the filesystem snapshot.
The ELF relationship resolver no longer includes the Phase 3 heuristic
(same-directory + filename fallback). Two test blocks depended on that
behavior—`symlink_heuristic_sbom`, `test_symlink_heuristic_match`, and
`test_symlink_heuristic_match_edge`—and have been removed.

The remaining `example_sbom` fixture has been updated to mirror the actual
SBOM construction pipeline. In particular:

  • All software entries are added before recording symlinks, ensuring that
    fs_tree contains the target file nodes.
  • The symlink `/opt/alt/lib/libalias.so → libreal.so` is recorded only
    after software insertion.
  • `expand_pending_file_symlinks()` is called so the alias path becomes
    visible to fs_tree traversal, matching generate.py behavior.

These adjustments ensure that the “symlink” test case resolves correctly
through fs_tree rather than relying on the removed heuristic. The full test
suite now reflects the intended two-phase ELF resolution model.
…mments

Phase 3 of PE dependency resolution (the symlink-aware heuristic) has been
removed. This logic is now redundant due to improvements in SBOM generation,
including fs_tree integration, symlink injection, and hash-equivalence mapping.
The remaining two phases—direct fs_tree resolution and legacy installPath-based
matching—fully cover the intended behavior.

The docstring has been rewritten to describe the updated two-phase approach,
clarify the scope of static DLL resolution, and better document Windows loader
semantics. Inline comments were added to Phase 1 to explain how probed
directories are derived and how fs_tree resolution leverages symlink and
hash-equivalence metadata.

This simplifies the resolver while maintaining correctness and improving
maintainability.
…matching

This change updates PE dependency resolution to correctly emulate
Windows filesystem behavior and eliminate a known false-positive case
in legacy matching.

Phase 1 now invokes get_software_by_path() with case_insensitive=True,
enabling Windows-style case-insensitive basename matching in the fs_tree
lookup.

Phase 2 has been updated to:
  • normalize DLL names and installPath basenames using casefold()
  • require that fileName[] contains the DLL name (case-insensitive)
  • match installPath entries only when both the parent directory is in
    probedirs and the basename equals the DLL name
  • use PureWindowsPath for Windows path handling

These changes preserve the intent of the historical
find_installed_software() logic while preventing the false-positive
scenario identified by reviewers.

The get_software_by_path() docstring was updated to document the new
optional case-insensitive fallback behavior.
@willis89pr willis89pr requested a review from nightlark December 10, 2025 22:00
willis89pr and others added 8 commits December 10, 2025 22:07
…tness

This update strengthens the PE relationship resolver test suite by adding two
targeted cases:

  • test_pe_no_false_positive_mismatched_basename:
      Ensures the resolver no longer produces a false positive when the
      imported DLL name appears in fileName[] but the installPath in the
      probed directory has a different basename.

  • test_pe_case_insensitive_matching:
      Verifies correct Windows-style case-insensitive matching between the
      imported DLL name and installPath basename.

These tests validate recent fixes to the PE resolver and prevent regression.

In addition, the unused Phase-3 heuristic fallback was removed from the Java
resolver, as it duplicated behavior now handled by earlier phases.
This change restores full legacy behavior in relationship inference by
introducing an explicit export-dict–based Phase 2. The new phase mirrors
java_relationship_legacy, ensuring that imports are resolved via
javaExports whenever fs_tree (Phase 1) cannot locate the supplier.

Key updates:
- Add _ExportDict with create_export_dict() and get_supplier() helpers.
- Build the export dictionary once per process before relationship resolution.
- Replace old installPath+fileName fallback with legacy export-dict logic.
- Align resolution phases with documented behavior and improve robustness
  via safer metadata checks.

This guarantees that the modern resolver does not regress relative to
legacy behavior, while still benefiting from fs_tree-based path-aware
resolution.  :contentReference[oaicite:0]{index=0}
The former Phase 3 heuristic (fileName + shared directory fallback) was
removed from the java_relationship implementation when Phase 2 was
updated to exclusively use the legacy javaExports export-dict mechanism.
As a result, the heuristic behavior is no longer valid or reachable.

This commit deletes test_phase_3_heuristic_match, which asserted the
old directory-based matching logic. Remaining tests continue to cover
fs_tree resolution, legacy export-dict fallback, and no-match behavior.
…solution

This change cleans up unused code in the Java relationship resolver:

- Remove the now-unused `fname` variable from import resolution.
- Comment out the Phase 1 fs_tree/path-based lookup logic, effectively
  disabling it while retaining the code for future re-enablement.
- Leave Phase 2 (legacy export-dict fallback) as the sole active
  resolution mechanism.

These updates reflect the temporary shift to legacy-only behavior and
eliminate dead code paths.
The legacy-style _ExportDict previously cached its contents across calls,
using a `created` flag to avoid rebuilding. This caused stale export
mappings to persist between SBOMs and across test runs, leading to
incorrect relationship resolution.

This change removes the `created` flag and forces create_export_dict()
to rebuild the export→supplier map for each SBOM invocation. The update
mirrors legacy behavior while ensuring correctness in multi-SBOM and
pytest environments.

No functional changes to export lookup; only state isolation is improved.
willis89pr and others added 5 commits December 30, 2025 19:58
…ph-first fallback

Ensure .NET unmanaged (dotnetImplMap) absolute-path imports preserve legacy
PureWindowsPath equality semantics while still benefiting from the fs_tree
when available.

The resolver now:
- Attempts a graph-first lookup via sbom.get_software_by_path() to leverage
  fs_tree symlink resolution.
- Falls back to the legacy strict scan across Software.installPath entries
  when the graph lookup does not find a match.
- Preserves legacy behavior by skipping probing and filename variants for
  absolute paths.
- Avoids self-dependencies.

This keeps legacy outputs stable while safely incorporating graph-based
resolution when present.
Update .NET relationship tests to match legacy behavior by removing Culture="neutral" from the basic fixture, placing the subdir DLL in a legacy-probed location, and disabling the version-mismatch filtering test (since version filtering isn’t enforced).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the variable names back to legacy and made sure it followed legacy as close as possible.

@willis89pr willis89pr requested a review from nightlark December 30, 2025 20:54
willis89pr and others added 3 commits December 30, 2025 22:26
Restore legacy behavior in ELF relationship resolution while keeping fs_tree
as the primary lookup mechanism.

Key changes:
- Ensure DT_NEEDED slash detection is performed on the raw dependency string,
  not a normalized path object.
- Normalize relative dependency paths after joining with installPath parents,
  matching legacy behavior.
- Use legacy filename filtering semantics during installPath fallback.
- Revert RUNPATH/RPATH selection logic to legacy precedence rules.
- Preserve literal RPATH/RUNPATH entries by returning paths without DST tokens
  unchanged in substitute_all_dst().
- Align docstrings with actual fs_tree and DST behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants