Skip to content

Conversation

@hoangsonww
Copy link
Member

SUMMARY

This PR introduces a transformation pipeline that converts ML-format input data (CSV) into standardized OMOP-format CSVs, enabling compatibility with downstream clinical data analytics tools.

Key features:

  • Adds a robust convert.py script that reads data.csv and generates four OMOP-compliant files: person.csv, measurement.csv, observation.csv, and visit_occurrence.csv.
  • Ensures reproducibility and safety by appending rows with fixed starting IDs and without modifying or reading existing rows.
  • Includes verification.py, a lightweight validation script to check primary key uniqueness and referential integrity across all output files.
  • Enhances analytics and answer repository/service logic with UPSERT behavior and transactional integrity (e.g., add_or_update() with PostgreSQL's ON CONFLICT, session commits).

TEST PLAN

  • ✅ Ran convert.py locally on a sample data.csv, verified correct file generation and row counts.

  • ✅ Verified content and structure of all four OMOP output files (.csv) via manual inspection and sample slicing.

  • ✅ Executed verification.py to validate:

    • uniqueness of primary keys,
    • referential integrity of foreign keys across tables.
  • ✅ Confirmed that new logic in analytics_repository.py and answer_repository.py preserves data as expected across repeated API invocations (tested with mocked sessions and live PostgreSQL).

  • ✅ All modified services and repository functions were manually tested for correctness, including access control checks and commit persistence.


Pre-merge author checklist

  • I've clearly explained:

    • What problem this PR is solving.
    • How this problem was solved.
    • How reviewers can test my changes.
  • I've included tests I've run to ensure my changes work.

  • I've added unit tests for any new code, if applicable.

  • I've documented any added code.

@hoangsonww hoangsonww requested a review from Copilot July 11, 2025 21:27
@hoangsonww hoangsonww self-assigned this Jul 11, 2025
@hoangsonww hoangsonww added documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers labels Jul 11, 2025
@hoangsonww hoangsonww closed this Jul 11, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a full pipeline for converting ML-format CSV data into OMOP-compliant CSVs, plus validation and repository changes to support safe persistence and upsert behavior.

  • Introduces convert.py to generate four OMOP CSV outputs (person.csv, measurement.csv, observation.csv, visit_occurrence.csv).
  • Adds verification.py to perform primary-key uniqueness and referential-integrity checks on those outputs.
  • Updates service and repository layers to commit sessions and use PostgreSQL ON CONFLICT upserts.

Reviewed Changes

Copilot reviewed 7 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/user/service/configuration_service.py Commit session after processing CSV rows
src/answer/service/answer_service.py Correct typo in user_email variable
src/answer/repository/answer_repository.py Add commit in add_answer(), annotate with Session type
src/analytics/service/analytics_service.py Enhance record_metrics() with timezone, durations, upsert call
src/analytics/repository/analytics_repository.py Implement add_or_update() using PostgreSQL ON CONFLICT upsert
script/transform_csv/verification.py New validation script for OMOP CSV outputs
script/transform_csv/convert.py New conversion script from ML-format CSV to OMOP-format CSVs
Comments suppressed due to low confidence (2)

script/transform_csv/convert.py:1

  • There are no automated tests covering the conversion logic in convert.py. Adding unit tests for ID generation, field mappings, and output file creation will improve reliability and make future refactoring safer.
#!/usr/bin/env python3

src/analytics/service/analytics_service.py:37

  • Missing import for get_user_email_from_jwt(), which will cause a NameError at runtime. Please import it (for example from src.user.auth_utils import get_user_email_from_jwt) at the top of the file.
        user_email: str = get_user_email_from_jwt()

result["status"] = "failed"
responses.append(result)

self.repository.session.commit()
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Committing directly in the service layer couples transaction management with business logic. Consider encapsulating commit behavior inside the repository or using a dedicated transaction manager to centralize persistence concerns.

Suggested change
self.repository.session.commit()
self.repository.commit_transaction()

Copilot uses AI. Check for mistakes.
print(f" {status:<4}{name}")
all_ok &= result

print("\nOverall:", "✔︎ ALL CHECKS PASSED" if all_ok else "✘ SOME CHECKS FAILED")
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation script prints a PASS/FAIL summary but always exits with status 0. Consider using sys.exit(1) when all_ok is false so that CI jobs can detect failures automatically.

Suggested change
print("\nOverall:", "✔︎ ALL CHECKS PASSED" if all_ok else "✘ SOME CHECKS FAILED")
print("\nOverall:", "✔︎ ALL CHECKS PASSED" if all_ok else "✘ SOME CHECKS FAILED")
if not all_ok:
sys.exit(1)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants