Skip to content

Conversation

@hoangsonww
Copy link
Member

SUMMARY

This PR introduces a transformation pipeline that converts ML-format input data (CSV) into standardized OMOP-format CSVs, enabling compatibility with downstream clinical data analytics tools.

Key features:

  • Adds a robust convert.py script that reads data.csv and generates four OMOP-compliant files: person.csv, measurement.csv, observation.csv, and visit_occurrence.csv.
  • Ensures reproducibility and safety by appending rows with fixed starting IDs and without modifying or reading existing rows.
  • Includes verification.py, a lightweight validation script to check primary key uniqueness and referential integrity across all output files.
  • Enhances analytics and answer repository/service logic with UPSERT behavior and transactional integrity (e.g., add_or_update() with PostgreSQL's ON CONFLICT, session commits).

TEST PLAN

  • ✅ Ran convert.py locally on a sample data.csv, verified correct file generation and row counts.

  • ✅ Verified content and structure of all four OMOP output files (.csv) via manual inspection and sample slicing.

  • ✅ Executed verification.py to validate:

    • uniqueness of primary keys,
    • referential integrity of foreign keys across tables.
  • ✅ Confirmed that new logic in analytics_repository.py and answer_repository.py preserves data as expected across repeated API invocations (tested with mocked sessions and live PostgreSQL).

  • ✅ All modified services and repository functions were manually tested for correctness, including access control checks and commit persistence.


Pre-merge author checklist

  • I've clearly explained:

    • What problem this PR is solving.
    • How this problem was solved.
    • How reviewers can test my changes.
  • I've included tests I've run to ensure my changes work.

  • I've added unit tests for any new code, if applicable.

  • I've documented any added code.

@hoangsonww hoangsonww requested a review from Copilot July 11, 2025 21:28
@hoangsonww hoangsonww self-assigned this Jul 11, 2025
@hoangsonww hoangsonww added documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers duplicate This issue or pull request already exists labels Jul 11, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a complete ML-to-OMOP CSV conversion pipeline with validation, updates service and repository layers to enforce transactional commits and UPSERT behavior, and fixes a typo in the answer service.

  • Introduces convert.py to append OMOP-compliant CSVs and verification.py to validate them.
  • Enhances answer_repository and analytics_repository with session commits and PostgreSQL ON CONFLICT logic.
  • Adjusts service layers (configuration_service, answer_service, analytics_service) to ensure proper commits, typo fixes, and upsert calls.

Reviewed Changes

Copilot reviewed 7 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/user/service/configuration_service.py Added session.commit() after CSV processing in the service.
src/answer/service/answer_service.py Fixed typo user_eamiluser_email and authorization check.
src/answer/repository/answer_repository.py Changed add_answer to commit session; added docstring.
src/analytics/service/analytics_service.py Imported timezone, updated to use add_or_update, added doc.
src/analytics/repository/analytics_repository.py Implemented add_or_update with PostgreSQL ON CONFLICT.
script/transform_csv/verification.py New script to verify primary keys, FKs, and print summary.
script/transform_csv/convert.py New script to transform data.csv into OMOP-style CSVs.
Comments suppressed due to low confidence (3)

script/transform_csv/verification.py:1

  • There are no tests covering this validation script. Adding unit or integration tests would help ensure these sanity checks remain correct as the pipeline evolves.
#!/usr/bin/env python3

src/user/service/configuration_service.py:39

  • [nitpick] Directly committing the session in the service layer may blur transaction ownership. Consider delegating commit responsibility to the repository or using a transaction manager to maintain clear API boundaries.
        self.repository.session.commit()

src/analytics/service/analytics_service.py:37

  • Missing import: get_user_email_from_jwt is not imported in this module, which will raise a NameError at runtime. Please add the appropriate import (e.g., from src.common.auth_utils import get_user_email_from_jwt).
        user_email: str = get_user_email_from_jwt()

@hoangsonww hoangsonww merged commit 4d07236 into main Jul 11, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation duplicate This issue or pull request already exists enhancement New feature or request good first issue Good for newcomers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants