Skip to content

APP-356 Convert POTNTL_DUP_INV_SUM (Potential Duplicate Investigations)#3131

Open
krista-skylight wants to merge 50 commits intomainfrom
kc/convert_POTNTL_DUP_INV_SUM
Open

APP-356 Convert POTNTL_DUP_INV_SUM (Potential Duplicate Investigations)#3131
krista-skylight wants to merge 50 commits intomainfrom
kc/convert_POTNTL_DUP_INV_SUM

Conversation

@krista-skylight
Copy link
Copy Markdown
Contributor

@krista-skylight krista-skylight commented Apr 6, 2026

Description

Please include a summary of the changes and any key information a reviewer may need.

Tickets

Checklist for adding a library:

  • Copy this checklist into your (draft) PR description
  • Check the existing description for the library in apps/modernization-api/src/main/resources/db/changelog/report/execution/03_ODSE_Data_Report_Library_Init.sql
    • If the report execution service is not yet in use by any STLT (even beta), you should update this file
    • Update the description in the library-specific migration
  • Add a migration to apps/modernization-api/src/main/resources/db/report/execution/libraries named <your library name>.sql
    • Copy another migration and update the variables at the top of the file
  • Add the migration to apps/modernization-api/src/main/resources/db/report-execution-changelog.yml. It should be added to the latest changeSet since the last release - this could require a new changeSet if there isn't one since last release
  • Check the migration works by building/running the containerized dev setup
  • Add a python library file to apps/report-execution/src/libraries named in lowercase, but generally following the naming convention of SAS (needs to be human recognizable as the same library)
  • Follow the pattern established in existing libraries to create your library. The execute function is the required method and its signature will (someday) be checked for validity
    • Make sure pertinent text embedded in the reports is included in the description of the report
    • Note in the doc string any deviation from the SAS
  • Keep an eye out for things in the sas library that are/could be centralized to pre- or post-processing or a util would be helpful for (e.g. common formatting needs)
  • Add integration tests in a apps/report-execution/tests/libraries/<your_library_here>py file following the conventions established for other libraries
    • The subset_sql can be assumed to be well tested by the modernization-api, so focus on any logic and additional joins/queries/analysis that is added in the library
    • The database used in integration testing is our standard golden db + seed used in other testing - if additional data is needed to test the report, add it to the seeding process
  • Add e2e test coverage for the report
    • If this step is not set up, create a ticket to track this as follow on work needed for this report
  • Check that the report has parity with the NBS 6 version using (process TBD)
    • If this step is not set up, create a ticket to track this as follow on work needed for this report
  • Keep an eye out for opportunities to:
    • Consolidate reports by using parameters (parameter design tbd)
    • Improve the report spec or result expectations
  • Update the SAS to Python tracking spreadsheet with information about the newly converted library
  • Update/clarify these instructions in the epic based on your experience

Krista Chan and others added 30 commits April 1, 2026 13:44
…into kc/convert_POTNTL_DUP_INV_SUM

merge branches
…into kc/convert_POTNTL_DUP_INV_SUM

merge main
…into kc/convert_POTNTL_DUP_INV_SUM

merge from main
…into kc/convert_POTNTL_DUP_INV_SUM

merge with main
…into kc/convert_POTNTL_DUP_INV_SUM

merge changes from main
…into kc/convert_POTNTL_DUP_INV_SUM

merge from main
Comment thread apps/report-execution/src/libraries/potntl_dup_inv_sum.py
Comment thread apps/report-execution/src/execute_report.py Outdated
for col in self.expected_columns:
data.get_column(col)

def test_execute_report_with_small_days_value(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a negative number of days is passed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I believe no data is returned since, in my query, I filter for anything less than the days_value. Should I have the library throw an error instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the front end doesn't let you enter negatives, so I don't think it's high risk in practice. We probably just want to make sure it doesn't explode in an opaque way and if it does make it error more clearly - add a test for this case?

Comment on lines +118 to +121
assert result.subheader == 'Duplicate Investigations Time Frame: 30 Days'

# Just verify it runs and returns data
assert len(result.content.data) >= 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test functionally different than the first one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's just testing a different days_value. delete?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's 🪓

result_30 = execute_report(spec_30)

# 3650 days should return more or equal rows than 30 days
assert len(result.content.data) >= len(result_30.content.data)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect more always for that difference - do we ever see equal?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be equal if all potential duplicate investigations in the data are already less than 30 days.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the data is largely a fixed data set, is that true of our data? I'd expect not given the randomness at play

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not true of our data. I can modify to >, but add a little comment about this.

assert result.content_type == 'table'

# Verify only selected diseases appear
disease_cds = result.content.get_column('Disease Code')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test seems to be testing the subset sql not really the library - remove?

Copy link
Copy Markdown
Contributor

@JordanGuinn JordanGuinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good!! 🔥

library_name: str = Field(min_length=1)
data_source_name: str = Field(min_length=1)
subset_query: str = Field(min_length=1)
days_value: int | None = None # Specific to potntl_dup_inv_sum
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thought, nb): Probably not worth addressing at this moment, but if there end up being multiple reports that require unique properties like this, maybe we end up sketching out a custom_props Object field or similar to capture them all? Just in the spirit of minimizing the amount of library-specific bits on the ReportSpec model.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea def a good idea to revisit as we work through more translations!

sas:
image: ghcr.io/cdcent/nbs7-sas-linux:v1.0.4
platform: linux/amd64
container_name: "${COMPOSE_PROJECT_NAME}-sas"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q, nb): Any particular reason for this addition?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know enough to understand why but when I was troubleshooting issues with @mcmcgrath13 doing sas translations, she suggested making this change to get around an error message that I can no longer recall lol.

Mary, do you think I should push this for everyone or leave it as something only I modify?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to push - makes working with the sas container easier via compose


db_tables = [t['table_name'] for t in schema['tables']]
fk_tables = schema['config']['nbs']['fk_tables']
fk_tables = schema['config'].get('nbs', {}).get('fk_tables', [])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q, nb): What's this change for?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if there are no fk tables specified (not always relevant), then we default to empty list

]

def test_execute_report_with_days_value(self, snapshot):
"""Test with a specific days value (e.g., 365 days)."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q, nb): Can/should we do any actual date comparisons of the underlying results returned by the report lib, to ensure they actually match the days value provided?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unfortunately, the report doesn't retain data about the two event dates for the potential duplicate or how close in days they are to each other. That's why I did a comparison between 3650 and 30 days to check that 3650 has more potential dupilcates than 30.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

@JordanGuinn JordanGuinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

- column_name: ILLNESS_ONSET_DATE
type: string
data: None
null_percentage: 1.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q, nb): why include the column, but have everything be null?

data: f"{fake.date_between(start_date='-1y', end_date='today').strftime('%Y-%m-%d')} 12:00:00.000"
null_percentage: 0.05

- column_name: EVENT_DATE_TYPE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be linked off of the non-null ness of EVENT_DATE? (similar to how in phdc the state is dependent on state cd)


# KLUDGE: NULL writing is not always correct
result = result.replace(' nan,', ' NULL,')
result = result.replace('nan', ' NULL')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(q, nb): why did we need to add this one? there's a risk that a valid part of a string with nan in it ill now be turned into NULL is it the opening paren case of (nan,?

with db_transaction(conn_string) as trx:
# Tables with foreign keys pointing to the table we want to replace need to
# be backed up and cleared out to avoid FK constraint violations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@mcmcgrath13
Copy link
Copy Markdown
Contributor

I only see the SAS output and not the python in the report catalog - can you update with the python?

Also, could you update the spreadsheet tracker and add the e2e ticket?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants