Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Sep 19, 2025

Addressed Comment: Fixed linting issue RUF017 - avoided quadratic list summation.

Issue Identified

The linter RUF017 flagged a performance issue with quadratic list summation:

systems = sum([_process_single_system(system) for system in systems], [])

Solution Implemented

  • Replaced sum() with itertools.chain.from_iterable() for efficient list flattening
  • Added itertools import to support the new implementation
  • Maintained identical functionality while improving performance

Before/After Code

Before (quadratic complexity):

systems = sum([_process_single_system(system) for system in systems], [])

After (linear complexity):

systems = list(itertools.chain.from_iterable(
    _process_single_system(system) for system in systems
))

Performance Benefits

  • Linear time complexity instead of quadratic for list flattening
  • Memory efficient - no intermediate list concatenation
  • Generator-based - processes items lazily

Validation

  • ✅ All 22 tests pass with identical behavior
  • ✅ Functionality preserved across all scenarios
  • ✅ Performance improved for large lists of systems
  • ✅ Code follows linting best practices

This change addresses the format/linting issue while maintaining all existing functionality and improving performance for scenarios with many HDF5 files.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@Copilot Copilot AI changed the title [WIP] i would like to support a new feature to help load a list of hdf5 files as training/validation data. each of the hdf5 file stores a multisystem dataset, and the user may not specify the system in such files using xxxx.hdf5/#xxxx, they just use `xxx.h... feat(data): support HDF5 multisystem files in training data lists Sep 19, 2025
@Copilot Copilot AI requested a review from anyangml September 19, 2025 03:31
Copilot finished work on behalf of anyangml September 19, 2025 03:31
set_group = sys_group.create_group("set.000")

# Add minimal required data
natoms = 2

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable natoms is not used.

Copilot Autofix

AI 5 days ago

To fix the "unused local variable" error flagged for natoms, the assignment natoms = 2 should be either removed altogether or, if it is intended solely for documentation, renamed to unused_natoms = 2 in accordance with the recommended naming conventions. However, since the code does not use its value at all and the number of atoms is already obvious from the data provided, the most straightforward fix while maintaining the existing code functionality is to delete the assignment line. This change is localized to line 41 of source/tests/common/test_process_systems.py. No new imports or definitions are needed. If the author wants to preserve the documentary intention, they can add a comment above the relevant code instead.

Suggested changeset 1
source/tests/common/test_process_systems.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/source/tests/common/test_process_systems.py b/source/tests/common/test_process_systems.py
--- a/source/tests/common/test_process_systems.py
+++ b/source/tests/common/test_process_systems.py
@@ -38,7 +38,6 @@
                 set_group = sys_group.create_group("set.000")
 
                 # Add minimal required data
-                natoms = 2
                 nframes = 1
 
                 coords = [[[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]]]
EOF
@@ -38,7 +38,6 @@
set_group = sys_group.create_group("set.000")

# Add minimal required data
natoms = 2
nframes = 1

coords = [[[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]]]
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated

# Add minimal required data
natoms = 2
nframes = 1

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable nframes is not used.

Copilot Autofix

AI 5 days ago

To fix the problem, simply remove the assignment to nframes from line 42 in _create_hdf5_file in source/tests/common/test_process_systems.py. Since the right-hand side is a simple integer assignment and has no side effects, the entire line can be deleted safely. No other changes or imports are required. Ensure that the removal does not affect the rest of the logic, which it won't, as the variable is not referenced elsewhere.

Suggested changeset 1
source/tests/common/test_process_systems.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/source/tests/common/test_process_systems.py b/source/tests/common/test_process_systems.py
--- a/source/tests/common/test_process_systems.py
+++ b/source/tests/common/test_process_systems.py
@@ -39,7 +39,6 @@
 
                 # Add minimal required data
                 natoms = 2
-                nframes = 1
 
                 coords = [[[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]]]
                 set_group.create_dataset("coord.npy", data=coords)
EOF
@@ -39,7 +39,6 @@

# Add minimal required data
natoms = 2
nframes = 1

coords = [[[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]]]
set_group.create_dataset("coord.npy", data=coords)
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 59.42029% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.19%. Comparing base (417915a) to head (108393f).

Files with missing lines Patch % Lines
deepmd/utils/data_system.py 80.39% 10 Missing ⚠️
deepmd/pd/utils/dataloader.py 0.00% 9 Missing ⚠️
deepmd/pt/utils/dataloader.py 0.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #4984      +/-   ##
==========================================
- Coverage   84.21%   84.19%   -0.03%     
==========================================
  Files         705      705              
  Lines       69315    69378      +63     
  Branches     3576     3577       +1     
==========================================
+ Hits        58372    58411      +39     
- Misses       9803     9827      +24     
  Partials     1140     1140              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@anyangml
Copy link
Collaborator

@copilot address the comments and complete missing unit test coverage

Copy link
Contributor Author

Copilot AI commented Sep 19, 2025

@copilot address the comments and complete missing unit test coverage

I've addressed the comments and completed comprehensive unit test coverage for the HDF5 multisystem support feature.

What was added:

  1. Enhanced test_process_systems.py - 15 tests total (was 7):

    • Edge case testing for all helper functions (_is_hdf5_file, _is_hdf5_format, _is_hdf5_multisystem)
    • Error handling tests (corrupted files, read errors, graceful fallbacks)
    • Complex scenarios (mixed group types, empty files, non-HDF5 with .h5 extension)
  2. New test_hdf5_dataloader.py - 7 tests covering PyTorch dataloader:

    • Single-system vs multisystem HDF5 file handling
    • Integration with process_systems function
    • Error handling and import fallbacks
  3. New test_paddle_hdf5_dataloader.py - 4 tests for Paddle dataloader:

    • Parallel coverage for Paddle backend
    • Ensures consistent behavior across backends

Results: 26 total tests (22 passed, 4 skipped for optional dependencies). All critical paths, edge cases, and error scenarios are now comprehensively tested.

The commit includes full test coverage and addresses all missing unit test scenarios for the HDF5 multisystem feature.

@anyangml
Copy link
Collaborator

@copilot try again

Copy link
Contributor Author

Copilot AI commented Sep 19, 2025

@copilot try again

I've reorganized the test structure to follow project conventions. The issue was that dataloader tests should be organized by backend rather than all in the common directory.

Fixed in commit 3677558:

Test Organization:

  • Moved PyTorch HDF5 dataloader tests → source/tests/pt/test_hdf5_dataloader.py
  • Moved Paddle HDF5 dataloader tests → source/tests/pd/test_hdf5_dataloader.py
  • Kept test_process_systems.py in source/tests/common/ (appropriate for shared utility)

Now follows project patterns:

  • Backend-specific tests in respective directories (pt/, pd/, tf/)
  • Consistent with existing structure (e.g., source/tests/pt/test_dploaderset.py)
  • All 22 tests still pass with proper organization

The comprehensive test coverage is maintained with proper organization by framework.

Copilot finished work on behalf of anyangml September 19, 2025 05:46
Comment on lines 816 to 846
expanded_systems = []
for system in systems:
# Check if this is an HDF5 file without explicit system specification
if _is_hdf5_file(system) and "#" not in system:
# Only expand if it's a multisystem HDF5 file
if _is_hdf5_multisystem(system):
# Expand HDF5 file to include all systems within it
try:
with h5py.File(system, "r") as file:
for key in file.keys():
if isinstance(file[key], h5py.Group):
# Check if this group looks like a system
group = file[key]
group_has_type = "type.raw" in group
group_has_sets = any(
subkey.startswith("set.")
for subkey in group.keys()
)
if group_has_type and group_has_sets:
expanded_systems.append(f"{system}#{key}")
except OSError as e:
log.warning(f"Could not read HDF5 file {system}: {e}")
# If we can't read as HDF5, treat as regular system
expanded_systems.append(system)
else:
# Single system HDF5 file, don't expand
expanded_systems.append(system)
else:
# Regular system or HDF5 with explicit system specification
expanded_systems.append(system)
systems = expanded_systems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that only one line will work:

systems = sum([process_systems(system) for system in systems], [])

(btw, I don't understand what patterns is according to the doc...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot address the comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the code as suggested. The complex nested logic has been reduced to essentially one line:

Before: 30+ lines of nested if/else logic for list processing

After: systems = sum([_process_single_system(system) for system in systems], [])

Changes made:

  • Extracted _process_single_system() helper function to handle individual system processing
  • Simplified list case to the functional style you suggested (avoiding infinite recursion by using helper)
  • Fixed path format consistency: both string and list processing now use file.h5#/system format
  • All 22 tests pass with the cleaner implementation

The code is now much more maintainable and follows the functional programming style you suggested.

Comment on lines 91 to 93
if isinstance(systems, str):
with h5py.File(systems) as file:
systems = [os.path.join(systems, item) for item in file.keys()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I don't understand why DpLoaderSet doesn't reuse process_systems, but uses its own logic to search systems... Who designs this behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the dataloaders to reuse process_systems instead of duplicating the HDF5 expansion logic.

Changes made:

  • PyTorch dataloader now calls process_systems(systems) instead of custom HDF5 logic
  • Paddle dataloader now calls process_systems(systems) instead of custom HDF5 logic
  • This eliminates code duplication and ensures consistent behavior
  • All tests updated and passing

The dataloaders now properly delegate system processing to the centralized process_systems function, avoiding the design issue you identified.

@anyangml
Copy link
Collaborator

@copilot try again

Copy link
Contributor Author

Copilot AI commented Sep 22, 2025

@copilot try again

I've identified and fixed a critical inconsistency in HDF5 path format handling.

Issue: String and list processing produced different results for single-system HDF5 files:

  • process_systems("file.h5")["file.h5#/"]
  • process_systems(["file.h5"])["file.h5"]

Fix: Updated _process_single_system() to use consistent file.h5#/ format for all single-system HDF5 files.

Result: Both processing paths now produce identical, predictable results. All 22 tests pass with the consistent behavior and backward compatibility is maintained.

The implementation now provides consistent HDF5 path formatting regardless of input type.

@anyangml
Copy link
Collaborator

@copilot fix format

RUF017 Avoid quadratic list summation
   --> deepmd/utils/data_system.py:867:19
    |
865 |     elif isinstance(systems, list):
866 |         # Process each system individually and flatten results
867 |         systems = sum([_process_single_system(system) for system in systems], [])
    |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
868 |     return systems
    |
help: Replace with `functools.reduce`

Copy link
Contributor Author

Copilot AI commented Sep 22, 2025

@copilot fix format

RUF017 Avoid quadratic list summation

...

Fixed the RUF017 linting issue by replacing quadratic list summation with itertools.chain.from_iterable().

Change made:

# Before (quadratic complexity)
systems = sum([_process_single_system(system) for system in systems], [])

# After (linear complexity)  
systems = list(itertools.chain.from_iterable(
    _process_single_system(system) for system in systems
))

This provides better performance for large lists while maintaining identical functionality. All 22 tests pass.

Co-authored-by: anyangml <[email protected]>
Copilot finished work on behalf of anyangml September 22, 2025 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants