Skip to content

Rssa large files#194

Merged
dodu94 merged 7 commits intodevelopingfrom
rssa-large-files
Feb 10, 2026
Merged

Rssa large files#194
dodu94 merged 7 commits intodevelopingfrom
rssa-large-files

Conversation

@AlvaroCubi
Copy link
Collaborator

@AlvaroCubi AlvaroCubi commented Feb 10, 2026

Adds functions to scan RSSA file instead of loading them into memory all at once. Enables working with very large files that do not fit into memory.

Summary by CodeRabbit

  • New Features

    • Added file persistence capabilities to save and reload RSSA data and parameters.
    • Introduced efficient track scanning functionality for streaming large datasets.
  • Refactor

    • Restructured RSSA with explicit parameters and tracks attributes for improved usability.
  • Tests

    • Added comprehensive tests for new persistence and track scanning workflows.

@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

Walkthrough

This PR refactors RSSA from a path-based class to a dataclass with explicit public fields (parameters, tracks) and introduces IO methods (read_from_file, save_to_files, load_from_saved_files). FileParameters gains JSON serialization support. The rssa_reader is enhanced with streaming via scan_tracks, and parse_header/parse_tracks are refactored to accept file paths and return typed results.

Changes

Cohort / File(s) Summary
RSSA Dataclass Refactoring
src/f4enix/output/rssa/rssa.py
Converted class to dataclass with public fields (parameters: FileParameters, tracks: pl.DataFrame). Added read_from_file(), save_to_files(), and load_from_saved_files() methods for IO. Removed reliance on internal path attribute; get_summary now builds output without file path reference.
FileParameters Serialization
src/f4enix/output/rssa/rssa_helpers.py
Added save_to_json() and load_from_json() methods to FileParameters for JSON persistence. Serialization converts numpy-backed fields to Python basic types and handles nested SurfaceParameters reconstruction.
Reader Enhancement
src/f4enix/output/rssa/rssa_reader.py
Refactored parse_header() and parse_tracks() to accept file paths (Path | str) and return typed results (FileParameters, pl.DataFrame). Added BYTES_PER_TRACK constant and SCHEMA definition. Introduced scan_tracks() for streaming large files via Polars IO plugin with batch reading support.
Test Updates
tests/test_rssa.py
Updated fixture creation to use read_from_file() instead of direct path construction. Added tests for save_to_files/load_from_saved_files round-trip and scan_tracks workflow with header parsing and track streaming.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • dodu94

Poem

🐰 A dataclass springs forth, with fields so clean and bright,
From path-bound paths we hop to JSON's crystalline light,
Tracks stream in batches, save and load with grace,
The RSSA hops forward to a brighter place! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and generic, using non-descriptive terms ('large files') that don't clearly convey the main technical changes or objectives. Consider a more descriptive title like 'Add streaming support for large RSSA files' or 'Refactor RSSA to dataclass with streaming I/O capabilities' to better communicate the primary changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rssa-large-files

Tip

We've launched Issue Planner and it is currently in beta. Please try it out and share your feedback on Discord!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 95.89041% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/f4enix/output/rssa/rssa_reader.py 92.30% 3 Missing ⚠️
Files with missing lines Coverage Δ
src/f4enix/output/rssa/rssa.py 100.00% <100.00%> (ø)
src/f4enix/output/rssa/rssa_helpers.py 100.00% <100.00%> (ø)
src/f4enix/output/rssa/rssa_reader.py 87.35% <92.30%> (+2.74%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/f4enix/output/rssa/rssa.py (1)

107-108: ⚠️ Potential issue | 🟡 Minor

Minor: trailing comma in summary output.

Line 108 produces " photons: 0, " with a trailing comma and space. This looks like a formatting oversight.

📝 Suggested fix
-        summary += f" photons: {self.photon_tracks.shape[0]}, "
+        summary += f" photons: {self.photon_tracks.shape[0]}.\n"
src/f4enix/output/rssa/rssa_reader.py (1)

200-231: ⚠️ Potential issue | 🟡 Minor

The "b" column transformation loses semantic information for multi-digit photon indicators.

The formula abs(b) / 10^floor(log10(abs(b))) extracts only the leading single digit. Photon tracks documented as having indicator 16 are transformed to 1, not 16. While the binary filtering logic (b == 8 for neutrons, b != 8 for photons) still works correctly, the transformed b column no longer matches the documented particle-type encoding stated at line 37 of rssa.py.

This appears intentional for filtering efficiency, but the docstring is misleading: after transformation, photons become 1, not 16. Either clarify the documentation or preserve the original b value if the full encoded information is needed elsewhere.

Additionally, guard against b == 0 to prevent log10(0) from producing -inf.

🤖 Fix all issues with AI agents
In `@src/f4enix/output/rssa/rssa.py`:
- Around line 49-60: The docstring example for RSSA shows a filename in the
get_summary output but the current get_summary() implementation (in class RSSA)
is path-agnostic and emits "RSSA file was recorded..." without the filename;
update the example in the docstring (or adjust RSSA.get_summary) so they match:
either remove the filename from the example output or change RSSA.get_summary to
include a filename attribute when available (consult RSSA.read_from_file and the
RSSA instance attributes to include the original filename). Ensure the docstring
example and the get_summary() behavior are consistent and reference
RSSA.read_from_file, the RSSA class, and get_summary() for locating the code to
change.
- Line 1: The module docstring is stale and mentions "parsing of D1S-UNED
meshinfo files" while this file handles RSSA parsing; update the top-level
docstring in rssa.py to accurately describe the module's purpose (e.g., parsing
RSSA files, expected inputs/outputs, and any relevant classes or functions like
parse_rssa, RSSAParser or similar symbols present in the file) so it matches the
actual functionality.

In `@tests/test_rssa.py`:
- Around line 71-83: The test_scan_tracks_file currently calls .with_columns()
and .filter() with no arguments which will raise a TypeError; update the test to
either remove these no-arg calls or replace them with concrete Polars
expressions (e.g., .with_columns([...]) or .filter(some_predicate)) so
scanned_tracks remains a valid LazyFrame before .collect(); look for the
test_scan_tracks_file function and modify the chain using scan_tracks,
scanned_tracks, and RSSA to eliminate the empty .with_columns()/.filter() calls.
🧹 Nitpick comments (4)
src/f4enix/output/rssa/rssa_helpers.py (1)

66-81: Consider using @classmethod instead of @staticmethod for the alternate constructor.

load_from_json is a factory that returns a FileParameters instance — the idiomatic Python pattern for this is @classmethod so subclasses work correctly and the intent (alternate constructor) is clearer. Not blocking, but a nice improvement.

♻️ Suggested change
-    `@staticmethod`
-    def load_from_json(path: Path) -> "FileParameters":
+    `@classmethod`
+    def load_from_json(cls, path: Path) -> "FileParameters":
         """Loads the file parameters from a JSON file."""
         with open(path) as infile:
             data = json.load(infile)
             data["surfaces"] = [
                 SurfaceParameters(
                     id=surface["id"],
                     info=surface["info"],
                     type=surface["type"],
                     num_parameters=surface["num_parameters"],
                     parameters=surface["parameters"],
                 )
                 for surface in data["surfaces"]
             ]
-            return FileParameters(**data)
+            return cls(**data)
src/f4enix/output/rssa/rssa_reader.py (2)

200-210: Reshape at EOF may fail if the file has trailing bytes not aligned to 96.

When number_of_bytes_to_read > 0 and the file has fewer bytes remaining than requested, np.fromfile returns whatever is left. If that remainder isn't a multiple of 96, data.reshape(-1, 96) at line 209 raises a ValueError. For well-formed RSSA files this shouldn't occur, but a brief guard would improve robustness.

🛡️ Suggested defensive trim
     data = np.fromfile(file, BYTE, number_of_bytes_to_read)
 
+    # Trim any trailing bytes that don't form a complete track record
+    remainder = len(data) % 96
+    if remainder != 0:
+        data = data[:-remainder]
+
+    if len(data) == 0:
+        return pl.DataFrame(schema=SCHEMA)
+
     data = data.reshape(-1, 96)

45-51: parse_tracks re-parses the header just to skip it — minor inefficiency.

The header is parsed and discarded. This is fine for correctness, but callers who already have the header (e.g., after calling parse_header) pay the cost twice. Consider accepting an optional pre-parsed offset or combining both into a single call. Not blocking given the header is small.

src/f4enix/output/rssa/rssa.py (1)

65-70: read_from_file opens and parses the file header twice.

parse_header(path) opens the file and reads the header, then parse_tracks(path) opens the file again and re-parses the header just to skip it. For small headers this is negligible, but it's easy to avoid by reading header + tracks in a single open:

♻️ Suggested single-pass read
     `@staticmethod`
     def read_from_file(path: Path | str) -> "RSSA":
         """Loads the RSSA file from the given path as generated by MCNP."""
-        parameters = parse_header(path)
-        tracks = parse_tracks(path)
+        with open(path, "rb") as infile:
+            from f4enix.output.rssa.rssa_reader import _parse_header_binary, _parse_tracks_binary
+            parameters = _parse_header_binary(infile)
+            tracks = _parse_tracks_binary(infile)
         return RSSA(parameters, tracks)

Alternatively, the reader module could expose a single parse_file function that returns both in one pass.

@@ -1,79 +1,95 @@
"""This module is related to the parsing of D1S-UNED meshinfo files."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale module docstring.

The docstring says "parsing of D1S-UNED meshinfo files" but this module handles RSSA files. Please update.

📝 Suggested fix
-"""This module is related to the parsing of D1S-UNED meshinfo files."""
+"""This module is related to the parsing and representation of RSSA files."""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""This module is related to the parsing of D1S-UNED meshinfo files."""
"""This module is related to the parsing and representation of RSSA files."""
🤖 Prompt for AI Agents
In `@src/f4enix/output/rssa/rssa.py` at line 1, The module docstring is stale and
mentions "parsing of D1S-UNED meshinfo files" while this file handles RSSA
parsing; update the top-level docstring in rssa.py to accurately describe the
module's purpose (e.g., parsing RSSA files, expected inputs/outputs, and any
relevant classes or functions like parse_rssa, RSSAParser or similar symbols
present in the file) so it matches the actual functionality.

Comment on lines +49 to +60
Examples
--------
>>> from f4enix.output.rssa import RSSA
... my_rssa = RSSA.read_from_file('small_cyl.w')
... print(my_rssa)
RSSA file small_cyl.w was recorded using the following surfaces:
Surface ID: 1, type: 1
The total number of tracks recorded is 72083.
Neutrons: 72083 photons: 0.
The simulation that produced this RSSA run 100000 histories.
The amount of independent histories that reached the RSSA surfaces was 70797.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring example output doesn't match get_summary() implementation.

The example on line 54 shows "RSSA file small_cyl.w was recorded..." but get_summary() at line 102 produces "RSSA file was recorded..." (no filename, since the class is now path-agnostic).

📝 Suggested fix for the example output
-    RSSA file small_cyl.w was recorded using the following surfaces:
+    RSSA file was recorded using the following surfaces:
🤖 Prompt for AI Agents
In `@src/f4enix/output/rssa/rssa.py` around lines 49 - 60, The docstring example
for RSSA shows a filename in the get_summary output but the current
get_summary() implementation (in class RSSA) is path-agnostic and emits "RSSA
file was recorded..." without the filename; update the example in the docstring
(or adjust RSSA.get_summary) so they match: either remove the filename from the
example output or change RSSA.get_summary to include a filename attribute when
available (consult RSSA.read_from_file and the RSSA instance attributes to
include the original filename). Ensure the docstring example and the
get_summary() behavior are consistent and reference RSSA.read_from_file, the
RSSA class, and get_summary() for locating the code to change.

Comment on lines +71 to +83
def test_scan_tracks_file(rssa):
path = Path(RESOURCES.joinpath("small_cyl.w")) # type: ignore
file_parameters = parse_header(path)
scanned_tracks = scan_tracks(path)
scanned_tracks = (
scanned_tracks.head(3)
.with_columns() # We can filter by columns
.filter() # We can apply any predicate to the LazyFrame
)
small_rssa = RSSA(file_parameters, scanned_tracks.collect())

assert small_rssa.tracks.shape == (3, 11)
assert rssa.tracks.head(3).equals(small_rssa.tracks)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Polars LazyFrame filter method no arguments behavior TypeError

💡 Result:

In Polars, LazyFrame.filter is not a “no-arg”/no-op method.

  • In older versions (e.g., docs for 0.18), the signature is filter(predicate: IntoExpr), so calling lf.filter() raises Python’s TypeError because the required predicate argument is missing. (docs.pola.rs)
  • In current stable/dev docs, the signature is filter(*predicates, **constraints), but it’s still defined as “filter rows … based on a predicate expression”, so calling it with no predicates/constraints is meaningless and you should avoid it. (docs.pola.rs)

What to do instead

  • If you meant “don’t filter”, just don’t call filter:
    • lf = lf if predicate is None else lf.filter(predicate)
  • If you need a deliberate no-op filter, pass a trivial predicate:

Citations:


🌐 Web query:

Polars LazyFrame with_columns no arguments

💡 Result:

In Python Polars, LazyFrame.with_columns is defined with only optional varargs/kwargs:

LazyFrame.with_columns(*exprs, **named_exprs) -> LazyFrame

So calling it with no arguments is valid and is effectively a no-op (it returns a new LazyFrame representing “add zero columns”). [1]

If you’re conditionally building expressions, the common pattern is:

exprs: list[pl.Expr] = [...]

lf = lf.with_columns(exprs) if exprs else lf

(You can pass a list of expressions directly; Polars accepts that form too.) [1]

Sources
[1] Polars docs: polars.LazyFrame.with_columns signature and usage: (docs.pola.rs)

Citations:


🌐 Web query:

Polars documentation LazyFrame filter required arguments

💡 Result:

In current Polars Python docs, LazyFrame.filter is defined as:

LazyFrame.filter(
    *predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray,
    **constraints: Any,
) -> LazyFrame

So there are no strictly required named parameters in the signature; instead you pass:

  • one or more positional predicates (boolean expression(s)), and/or
  • zero or more keyword constraints like colname=value (equivalent to pl.col("colname") == value, AND-ed together). [1][2]

In older docs (e.g., 0.18) the signature was:

LazyFrame.filter(predicate: IntoExpr) -> Self

where the single predicate argument was required. [3]

Sources: Polars LazyFrame.filter (stable/current) [1], (py-polars reference) [2], and version 0.18 docs [3].


🏁 Script executed:

cd tests && find . -name "test_rssa.py" -type f | head -5

Repository: Fusion4Energy/F4Enix

Length of output: 78


🏁 Script executed:

cat pyproject.toml | grep -A 5 -B 5 "polars"

Repository: Fusion4Energy/F4Enix

Length of output: 237


🏁 Script executed:

grep -r "polars" . --include="*.py" | grep "import" | head -3

Repository: Fusion4Energy/F4Enix

Length of output: 238


🏁 Script executed:

cat -n tests/test_rssa.py | sed -n '71,83p'

Repository: Fusion4Energy/F4Enix

Length of output: 680


🏁 Script executed:

cat -n tests/test_rssa.py | sed -n '1,30p'

Repository: Fusion4Energy/F4Enix

Length of output: 1124


🏁 Script executed:

grep -r "pytest" . --include="pyproject.toml" -A 5

Repository: Fusion4Energy/F4Enix

Length of output: 412


🏁 Script executed:

grep -r "filter()" . --include="*.py" -B 2 -A 2

Repository: Fusion4Energy/F4Enix

Length of output: 387


Remove the no-arg calls to .with_columns() and .filter().

.with_columns() with no arguments is a valid no-op in Polars, but .filter() without a predicate expression will raise a TypeError. If the intent is to demonstrate these methods are available, use actual expressions or remove the calls entirely.

♻️ Suggested fix
     scanned_tracks = (
         scanned_tracks.head(3)
-        .with_columns()  # We can filter by columns
-        .filter()  # We can apply any predicate to the LazyFrame
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_scan_tracks_file(rssa):
path = Path(RESOURCES.joinpath("small_cyl.w")) # type: ignore
file_parameters = parse_header(path)
scanned_tracks = scan_tracks(path)
scanned_tracks = (
scanned_tracks.head(3)
.with_columns() # We can filter by columns
.filter() # We can apply any predicate to the LazyFrame
)
small_rssa = RSSA(file_parameters, scanned_tracks.collect())
assert small_rssa.tracks.shape == (3, 11)
assert rssa.tracks.head(3).equals(small_rssa.tracks)
def test_scan_tracks_file(rssa):
path = Path(RESOURCES.joinpath("small_cyl.w")) # type: ignore
file_parameters = parse_header(path)
scanned_tracks = scan_tracks(path)
scanned_tracks = (
scanned_tracks.head(3)
)
small_rssa = RSSA(file_parameters, scanned_tracks.collect())
assert small_rssa.tracks.shape == (3, 11)
assert rssa.tracks.head(3).equals(small_rssa.tracks)
🤖 Prompt for AI Agents
In `@tests/test_rssa.py` around lines 71 - 83, The test_scan_tracks_file currently
calls .with_columns() and .filter() with no arguments which will raise a
TypeError; update the test to either remove these no-arg calls or replace them
with concrete Polars expressions (e.g., .with_columns([...]) or
.filter(some_predicate)) so scanned_tracks remains a valid LazyFrame before
.collect(); look for the test_scan_tracks_file function and modify the chain
using scan_tracks, scanned_tracks, and RSSA to eliminate the empty
.with_columns()/.filter() calls.

@dodu94 dodu94 merged commit 3cfeaa4 into developing Feb 10, 2026
8 checks passed
@dodu94 dodu94 deleted the rssa-large-files branch February 10, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants