Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability for more complicated mocks #26

Merged
merged 2 commits into from
Mar 12, 2024
Merged

Conversation

nwiltsie
Copy link
Member

@nwiltsie nwiltsie commented Mar 9, 2024

Description

This PR adds the capability for more complicated mocks that can return different values depending upon the arguments.

This is specifically to solve handle configs like this:

methods {
    get_ids_from_bams = {
        params.samples_to_process = [] as Set
        params.input.each { k, v ->
            v.each { sampleMap ->
                def bam_path = sampleMap['path']
                def bam_header = bam_parser.parse_bam_header(bam_path)
                def sm_tags = bam_header['read_group'].collect{ it['SM'] }.unique()
                if (sm_tags.size() != 1) {
                    throw new Exception("${bam_path} contains multiple samples! Please run pipeline with single sample BAMs.")
                }
                if (params.samples_to_process.any { it.orig_id == sm_tags[0] }) {
                    throw new Exception("Sample ${sm_tags[0]} was found in multiple BAMs. Please provide only one BAM per sample")
                }

The pipeline accepts paired BAMs, and it performs the reasonable check that those BAMs do not have matching sample IDs. However, our only mockable hook in there is parse_bam_header, and returning the same value twice will cause the configuration to fail.

So. I'm adding a layer of complexity with "dynamic mocks" that look like this:

  "mocks": {
    "DYNAMIC|parse_bam_header": {
      "[\"/hot/software/pipeline/pipeline-SQC-DNA/Nextflow/development/test-input/HG002.N-n2.bam\"]": {
        "read_group": [
          {
            "SM": "098765"
          }
        ]
      },
      "[\"/hot/software/pipeline/pipeline-SQC-DNA/Nextflow/development/test-input/S2.T-n2.bam\"]": {
        "read_group": [
          {
            "SM": "52345245"
          }
        ]
      }
    }
  }

The function name has DYNAMIC| prepended to it, and there is another level that maps from the function's arguments to the return value. If the mocked function is called with any other arguments it will report them and fail:

	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at betterconfig.print_configuration(betterconfig.groovy:223)
	at betterconfig.run(betterconfig.groovy:241)
Caused by: java.lang.Exception: Dynamic mock parse_bam_header does not contain ["/hot/software/pipeline/pipeline-SQC-DNA/Nextflow/development/test-input/HG002.N-n2.bam"]
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)

The function arguments are also presented as a JSONified string of an Object []. That means that it is always an array, even for zero- or one-argument functions.

Checklist

  • This PR does NOT contain Protected Health Information (PHI). A repo may need to be deleted if such data is uploaded.
    Disclosing PHI is a major problem1 - Even a small leak can be costly2.

  • This PR does NOT contain germline genetic data3, RNA-Seq, DNA methylation, microbiome or other molecular data4.

  • This PR does NOT contain other non-plain text files, such as: compressed files, images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other output files.

  To automatically exclude such files using a .gitignore file, see here for example.

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have set up or verified the main branch protection rule following the github standards before opening this pull request.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

Footnotes

  1. UCLA Health reaches $7.5m settlement over 2015 breach of 4.5m patient records

  2. The average healthcare data breach costs $2.2 million, despite the majority of breaches releasing fewer than 500 records.

  3. Genetic information is considered PHI.
    Forensic assays can identify patients with as few as 21 SNPs

  4. RNA-Seq, DNA methylation, microbiome, or other molecular data can be used to predict genotypes (PHI) and reveal a patient's identity.

@nwiltsie nwiltsie requested a review from a team March 9, 2024 00:19
@nwiltsie
Copy link
Member Author

nwiltsie commented Mar 9, 2024

I'm completely open to feedback here. If I hadn't just spent a week littering our organization with regression tests I probably would have added a dynamic_mocks field instead of overloading mocks.

I'm also trying to walk the line between enforcing a restricted coding standard on the pipeline configurations and supporting reasonable use cases. In this case specifically the config seemed reasonable as-written, so I wanted to support it with added complexity in this tool.

@nwiltsie
Copy link
Member Author

Here's a quick rundown of possible solutions and why I don't like them:

Approach Problem
Mocked functions keep a counter, return "result-1", "result-2", etc Two problems: first, the mocked functions can return arbitrarily complex objects, and there would need to be logic for how to updated the innermost key. Second, this would be sensitive to innocuous changes in the config logic; if I changed data.each(...) to data.sorted().each(...) and the tests broke, I would be mad.
Embedded groovy code that can be executed I don't really want to get into protecting eval'd code... there shouldn't actually be any risk, as users can change the configs to execute arbitrary code, but I just don't want to open the door.
Overloading mocks with this "dynamic mock" approach It's interleaving data and logic, which I don't love. It's also very verbose and cryptic (why are there escaped double quotes?). However, it is very explicit, and should be stable against configuration logic changes.
Adding new dynamic_mocks key A little cleaner than the above, but with the downside that I would need to fix all of the existing tests to add in this key. That's probably better, but...

The least worst approach seems to be this dynamic mocking, but I welcome any discussion about that.

@yashpatel6
Copy link
Contributor

Upon further consideration, I can't think of any other solutions that wouldn't more complicated than the ones above. I'm ok with adopting this dynamic mocking if no concerns/additional suggestions from others

@kiarod
Copy link

kiarod commented Mar 11, 2024

agreed, all other solutions come with their own headaches. I like that this approach won't affect the current tests.

@sorelfitzgibbon
Copy link

LGTM, assuming this would work with any number and combination of input tumor and normal BAM files.

@nwiltsie nwiltsie merged commit 48de7af into main Mar 12, 2024
1 check passed
@nwiltsie nwiltsie deleted the nwiltsie-dynamic-mocks branch March 12, 2024 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants