Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create usable Nextflow regression test Action #16

Merged
merged 62 commits into from
Mar 5, 2024

Conversation

nwiltsie
Copy link
Member

@nwiltsie nwiltsie commented Mar 1, 2024

Description

This follows on from #12 to create a functional Nextflow regression test Action (technically it's a reusable workflow). All that must be done to use it is for pipelines to add this workflow...

---
on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  tests:
    uses: uclahs-cds/tool-Nextflow-action/.github/workflows/nextflow-tests.yml@main

... which will then automatically discover and run any test files. See uclahs-cds/pipeline-recalibrate-BAM#52 for a working example.

Interface

Test File Format

There are a few changes to the test format I laid out in #12.

First, test files are now required to have filenames starting with configtest. That makes it so that they are discoverable and do not need to be specifically registered.

Second, I removed the empty_files and mapped_files fields. In discussing this with various lab members I couldn't justify any reason for them to exist - configurations that depend upon the contents of untracked files are bad configurations. The loss of empty_files does mean that most tests will need to add the mock check_paths: "".

Third, I added the field nextflow_version. The updated test files written out at runtime have that value updated based on the container, which should help protect against version mismatches. Currently it's still hardcoded in a few places to 23.10.0 but I want to support multiple versions simultaneously.

Fourth, I'm sanitizing the expected_results fields to avoid spurious diffs. All dates in dated_fields are replaced with 19970704T165655Z (Pathfinder's landing), and all Object.toString()-like values (e.g. [Ljava.lang.String;@49c7b90e) have the hash replaced with dec0ded (although they really shouldn't be showing up anyway).

Results Reporting

Each test gets a status check on the PR:

status_checks

The specific changes are reported as annotations within the code view:

annotation

Each test, successful or not, attaches the output file to the check run. That makes it easy to download and overwrite test files when changes occur.

artifacts

Internals

Action Reusable Workflow Structure

The workflow has two jobs:

discover-tests... discovers all the tests.

run-test has a matrix strategy such that it runs each test independently and in parallel. For some unfathomable reason I could not get this to work with a standard uses: docker:// line, as GitHub insists on pulling the image at the start of the job before I can provide any credentials.

Docker Image

Rather than a python script that calls Docker, the entire test process is now bundled up in a pre-compiled Docker image (https://github.com/uclahs-cds/tool-Nextflow-action/pkgs/container/nextflow-config-tests). The Dockerfile is tracked in this repository, but the build action is a little unusual - it only builds on commits to main that change files in the run-nextflow-tests/ folder or the build workflow itself. It also adds two custom tags for the Nextflow version and the Nextflow version and the current date (e.g. 23.10.0 and 23.10.0-2024-03-01. It does not build on git tags.

The image itself is mostly the same - the key changes are that I made the Nextflow hijacking process a little more robust and installed python.


I deleted all of the bundled test files, but once this gets merged I'm intended to open a variety of PRs to establish them in their corresponding pipelines.

Checklist

  • This PR does NOT contain Protected Health Information (PHI). A repo may need to be deleted if such data is uploaded.
    Disclosing PHI is a major problem1 - Even a small leak can be costly2.

  • This PR does NOT contain germline genetic data3, RNA-Seq, DNA methylation, microbiome or other molecular data4.

  • This PR does NOT contain other non-plain text files, such as: compressed files, images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other output files.

  To automatically exclude such files using a .gitignore file, see here for example.

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have set up or verified the main branch protection rule following the github standards before opening this pull request.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

Footnotes

  1. UCLA Health reaches $7.5m settlement over 2015 breach of 4.5m patient records

  2. The average healthcare data breach costs $2.2 million, despite the majority of breaches releasing fewer than 500 records.

  3. Genetic information is considered PHI.
    Forensic assays can identify patients with as few as 21 SNPs

  4. RNA-Seq, DNA methylation, microbiome, or other molecular data can be used to predict genotypes (PHI) and reveal a patient's identity.

@nwiltsie nwiltsie requested review from a team March 1, 2024 19:44
@kiarod
Copy link

kiarod commented Mar 5, 2024

This is awesome work Nick! LGrTM!

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! If no other concerns, I'm ok with merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants