Fix test subset script #519

vivbak · 2023-07-19T23:42:40Z

Addresses #514

Several changes including

Script now queries analyses (previously this was not implemented)
Script now upserts SG's (previously only samples and assays)
Fixes to additional_samples and additional_families inputs, so they are more compatible with each other.
Implementation of multiple graphql queries (replacing api endpoints)* Not all endpoints have been replaced.
Fixes analyses bug. I think this script was previously responsible for test being overpopulated with analyses entries. We weren't handling upserting correctly here.

Querying has been significantly simplified in this PR, although unfortunately upserting is more complex. So net neutral.

Note: There is another pending ticket to improve and refactor this script. This doesn't address that, and the main goal was to get the script functional to unblock everyone. Happy to take on all feedback of course, but keen to move improvements upon the initial script to another ticket.

codecov-commenter · 2023-07-19T23:44:23Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (68532b8) 72.05% compared to head (8c7d0ff) 72.05%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@           Coverage Diff           @@
##              dev     #519   +/-   ##
=======================================
  Coverage   72.05%   72.05%           
=======================================
  Files          90       90           
  Lines        7746     7746           
=======================================
  Hits         5581     5581           
  Misses       2165     2165

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

- As we run scripts via AR, the interactivity never makes sense. - Most of the validation is inaccurate now, as those use cases have been handled

scripts/create_test_subset.py

daniaki · 2023-08-02T01:48:56Z

scripts/create_test_subset.py

+    existing_data = query(EXISTING_DATA_QUERY, {'project': target_project})
+
+    samples = original_project_subset_data.get('project').get('samples')
+    transfer_samples_sgs_assays(


Is there a way to add a dry_run cmd line option so that these transfer_ methods print what they will do, rather than actually doing the thing? For example rather than calling aapi.create_analaysis it would log the details of the analysis that will be created. Would be good to have so that a user can run a quick validation check of the operations that will be performed.

This is a great suggestion and I just looked into implementing it -- however I think a dry-run is less trivial than I anticipated because for example, transfer_samples_sgs_assays is dependent on transfer_participants having happened first. Happy to create a ticket to think about this further, but perhaps beyond the scope of this PR?

scripts/create_test_subset.py

MattWellie · 2023-09-18T05:00:12Z

I haven't dug all the way into this, but a couple of usage thoughts. Apologies for farting this into a comment box, I think it would be useful to have as a conversation, possibly with a lot of pen + paper. Maybe tearing some of it into little shapes and moving them around.

I'm not a fan of the interactions between random family (int) and specific family (list) (same applies to samples). A real world use case would be 'I only want to transfer trio X to test for some manual interrogation' - In that instance this script will also throw in another 7 unwanted samples just to hit the default sample size. Likewise if I want Family X and 2 additional families, the current process subtracts specific families from the random family count, which is misleading. It would be nice to make this procedural so that these arguments can all act more independently. I've had a crack at this here:

Default params of 0 for sample_n and family_n, empty set for samples and families
Did you ask for specific SGs? Cool - they're in the set of SGs already
Did you ask for specific families? Cool - add all their member SGs to the set of SGs
Did you ask for random families?
- Find family IDs from any SGs selected specifically, remove from consideration
- Do a size-weighted selection from remaining families
- Get all SGs from the random families and add them to the set of SGs
Did you ask for random SGs?
- remove any collected so far, and randomly sample from remaining SGs

This framing means that any combination of the 4 arguments can be used and should never clash. I do think that any time a SG is transferred to test, it should also be accompanied by its family members and their data too. Not sure how that interacts with this.

I'm a big fan of not deleting the current test project contents by default. Is there a partner script that can be used for 'delete everything in this project' so we can fully separate the transfer and delete behaviours (and so that we can enable anyone to clear the test projects without needing your assistance)?
The weighted family selection has been dropped! There are simpler ways to implement it using random.choices(population, weights, sample_size) but the idea is cool.
There's a load of methods remaining in this script that are no longer called, happy to delete them?
I have strong feelings about Click... I think in this context ArgParse is just much better as it allows an arbitrary number of arguments after a flag, which usability wise is the difference between

create_test_subset.py --add-family FAM1 --add-family FAM2 --add-family FAM3 --add-sample SAM1 --add-sample SAM2

and

create_test_subset.py --add-family FAM1 FAM2 FAM3 --add-sample SAM1 SAM2

Co-authored-by: vivbak <[email protected]>

vivbak · 2023-09-26T05:16:25Z

Hello
Optional Reviewers:
@MattWellie
@jmarshall
@illusional

Please re-review in line with changes requested:
@daniaki

MattWellie · 2023-09-26T05:22:22Z

Looks delightful 👌

vivbak added 3 commits July 12, 2023 14:15

add family sg function + silence the linter

2252b27

Merge branch 'dev' into fix-test-subset-analysis

589606c

Lots of changes tbh

24fa8d3

vivbak added 2 commits July 20, 2023 09:51

Remove interactivity. Remove validation

b04f6ae

- As we run scripts via AR, the interactivity never makes sense. - Most of the validation is inaccurate now, as those use cases have been handled

Fixes + Cleanup

8c7d0ff

vivbak requested review from daniaki and jmarshall July 21, 2023 01:42

daniaki suggested changes Aug 2, 2023

View reviewed changes