Add capability for more complicated mocks #26

nwiltsie · 2024-03-09T00:19:35Z

Description

This PR adds the capability for more complicated mocks that can return different values depending upon the arguments.

This is specifically to solve handle configs like this:

methods {
    get_ids_from_bams = {
        params.samples_to_process = [] as Set
        params.input.each { k, v ->
            v.each { sampleMap ->
                def bam_path = sampleMap['path']
                def bam_header = bam_parser.parse_bam_header(bam_path)
                def sm_tags = bam_header['read_group'].collect{ it['SM'] }.unique()
                if (sm_tags.size() != 1) {
                    throw new Exception("${bam_path} contains multiple samples! Please run pipeline with single sample BAMs.")
                }
                if (params.samples_to_process.any { it.orig_id == sm_tags[0] }) {
                    throw new Exception("Sample ${sm_tags[0]} was found in multiple BAMs. Please provide only one BAM per sample")
                }

The pipeline accepts paired BAMs, and it performs the reasonable check that those BAMs do not have matching sample IDs. However, our only mockable hook in there is parse_bam_header, and returning the same value twice will cause the configuration to fail.

So. I'm adding a layer of complexity with "dynamic mocks" that look like this:

  "mocks": {
    "DYNAMIC|parse_bam_header": {
      "[\"/hot/software/pipeline/pipeline-SQC-DNA/Nextflow/development/test-input/HG002.N-n2.bam\"]": {
        "read_group": [
          {
            "SM": "098765"
          }
        ]
      },
      "[\"/hot/software/pipeline/pipeline-SQC-DNA/Nextflow/development/test-input/S2.T-n2.bam\"]": {
        "read_group": [
          {
            "SM": "52345245"
          }
        ]
      }
    }
  }

The function name has DYNAMIC| prepended to it, and there is another level that maps from the function's arguments to the return value. If the mocked function is called with any other arguments it will report them and fail:

	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at betterconfig.print_configuration(betterconfig.groovy:223)
	at betterconfig.run(betterconfig.groovy:241)
Caused by: java.lang.Exception: Dynamic mock parse_bam_header does not contain ["/hot/software/pipeline/pipeline-SQC-DNA/Nextflow/development/test-input/HG002.N-n2.bam"]
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)

The function arguments are also presented as a JSONified string of an Object []. That means that it is always an array, even for zero- or one-argument functions.

Checklist

This PR does NOT contain Protected Health Information (PHI). A repo may need to be deleted if such data is uploaded.
Disclosing PHI is a major problem¹ - Even a small leak can be costly².
This PR does NOT contain germline genetic data³, RNA-Seq, DNA methylation, microbiome or other molecular data⁴.

This PR does NOT contain other non-plain text files, such as: compressed files, images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other output files.

To automatically exclude such files using a .gitignore file, see here for example.

I have read the code review guidelines and the code review best practice on GitHub check-list.
I have set up or verified the main branch protection rule following the github standards before opening this pull request.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

UCLA Health reaches $7.5m settlement over 2015 breach of 4.5m patient records ↩
The average healthcare data breach costs $2.2 million, despite the majority of breaches releasing fewer than 500 records. ↩
Genetic information is considered PHI.
Forensic assays can identify patients with as few as 21 SNPs ↩
RNA-Seq, DNA methylation, microbiome, or other molecular data can be used to predict genotypes (PHI) and reveal a patient's identity. ↩

nwiltsie · 2024-03-09T00:27:57Z

I'm completely open to feedback here. If I hadn't just spent a week littering our organization with regression tests I probably would have added a dynamic_mocks field instead of overloading mocks.

I'm also trying to walk the line between enforcing a restricted coding standard on the pipeline configurations and supporting reasonable use cases. In this case specifically the config seemed reasonable as-written, so I wanted to support it with added complexity in this tool.

nwiltsie · 2024-03-11T21:54:21Z

Here's a quick rundown of possible solutions and why I don't like them:

Approach	Problem
Mocked functions keep a counter, return "result-1", "result-2", etc	Two problems: first, the mocked functions can return arbitrarily complex objects, and there would need to be logic for how to updated the innermost key. Second, this would be sensitive to innocuous changes in the config logic; if I changed `data.each(...)` to `data.sorted().each(...)` and the tests broke, I would be mad.
Embedded groovy code that can be executed	I don't really want to get into protecting `eval`'d code... there shouldn't actually be any risk, as users can change the configs to execute arbitrary code, but I just don't want to open the door.
Overloading `mocks` with this "dynamic mock" approach	It's interleaving data and logic, which I don't love. It's also very verbose and cryptic (why are there escaped double quotes?). However, it is very explicit, and should be stable against configuration logic changes.
Adding new `dynamic_mocks` key	A little cleaner than the above, but with the downside that I would need to fix all of the existing tests to add in this key. That's probably better, but...

The least worst approach seems to be this dynamic mocking, but I welcome any discussion about that.

yashpatel6 · 2024-03-11T23:32:45Z

Upon further consideration, I can't think of any other solutions that wouldn't more complicated than the ones above. I'm ok with adopting this dynamic mocking if no concerns/additional suggestions from others

kiarod · 2024-03-11T23:35:09Z

agreed, all other solutions come with their own headaches. I like that this approach won't affect the current tests.

sorelfitzgibbon · 2024-03-12T00:15:07Z

LGTM, assuming this would work with any number and combination of input tumor and normal BAM files.

nwiltsie added 2 commits March 8, 2024 15:59

Add dynamic mocks

70c4492

Update CHANGELOG

5ee6e60

nwiltsie requested a review from a team March 9, 2024 00:19

nwiltsie mentioned this pull request Mar 9, 2024

Add standard pipeline workflows uclahs-cds/pipeline-generate-SQC-BAM#37

Merged

8 tasks

kiarod approved these changes Mar 11, 2024

View reviewed changes

yashpatel6 assigned kiarod, aholmes, dan-knight, yashpatel6, j2salmingo, sorelfitzgibbon and Faizal-Eeman Mar 11, 2024

nwiltsie merged commit 48de7af into main Mar 12, 2024
1 check passed

nwiltsie deleted the nwiltsie-dynamic-mocks branch March 12, 2024 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add capability for more complicated mocks #26

Add capability for more complicated mocks #26

nwiltsie commented Mar 9, 2024

nwiltsie commented Mar 9, 2024

nwiltsie commented Mar 11, 2024

yashpatel6 commented Mar 11, 2024

kiarod commented Mar 11, 2024

sorelfitzgibbon commented Mar 12, 2024

Add capability for more complicated mocks #26

Add capability for more complicated mocks #26

Conversation

nwiltsie commented Mar 9, 2024

Description

Checklist

Footnotes

nwiltsie commented Mar 9, 2024

nwiltsie commented Mar 11, 2024

yashpatel6 commented Mar 11, 2024

kiarod commented Mar 11, 2024

sorelfitzgibbon commented Mar 12, 2024