Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test subset script #519

Merged
merged 12 commits into from
Sep 27, 2023
Merged

Fix test subset script #519

merged 12 commits into from
Sep 27, 2023

Conversation

vivbak
Copy link
Contributor

@vivbak vivbak commented Jul 19, 2023

Addresses #514

Several changes including

  • Script now queries analyses (previously this was not implemented)
  • Script now upserts SG's (previously only samples and assays)
  • Fixes to additional_samples and additional_families inputs, so they are more compatible with each other.
  • Implementation of multiple graphql queries (replacing api endpoints)* Not all endpoints have been replaced.
  • Fixes analyses bug. I think this script was previously responsible for test being overpopulated with analyses entries. We weren't handling upserting correctly here.

Querying has been significantly simplified in this PR, although unfortunately upserting is more complex. So net neutral.

Note: There is another pending ticket to improve and refactor this script. This doesn't address that, and the main goal was to get the script functional to unblock everyone. Happy to take on all feedback of course, but keen to move improvements upon the initial script to another ticket.

@codecov-commenter
Copy link

codecov-commenter commented Jul 19, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (68532b8) 72.05% compared to head (8c7d0ff) 72.05%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #519   +/-   ##
=======================================
  Coverage   72.05%   72.05%           
=======================================
  Files          90       90           
  Lines        7746     7746           
=======================================
  Hits         5581     5581           
  Misses       2165     2165           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

- As we run scripts via AR, the interactivity never makes sense.
- Most of the validation is inaccurate now, as those use cases have been handled
@vivbak vivbak requested review from daniaki and jmarshall July 21, 2023 01:42
scripts/create_test_subset.py Outdated Show resolved Hide resolved
scripts/create_test_subset.py Show resolved Hide resolved
existing_data = query(EXISTING_DATA_QUERY, {'project': target_project})

samples = original_project_subset_data.get('project').get('samples')
transfer_samples_sgs_assays(
Copy link
Contributor

@daniaki daniaki Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to add a dry_run cmd line option so that these transfer_ methods print what they will do, rather than actually doing the thing? For example rather than calling aapi.create_analaysis it would log the details of the analysis that will be created. Would be good to have so that a user can run a quick validation check of the operations that will be performed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion and I just looked into implementing it -- however I think a dry-run is less trivial than I anticipated because for example, transfer_samples_sgs_assays is dependent on transfer_participants having happened first. Happy to create a ticket to think about this further, but perhaps beyond the scope of this PR?

scripts/create_test_subset.py Outdated Show resolved Hide resolved
scripts/create_test_subset.py Outdated Show resolved Hide resolved
scripts/create_test_subset.py Outdated Show resolved Hide resolved
scripts/create_test_subset.py Outdated Show resolved Hide resolved
scripts/create_test_subset.py Outdated Show resolved Hide resolved
@vivbak vivbak mentioned this pull request Sep 18, 2023
@MattWellie
Copy link
Contributor

MattWellie commented Sep 18, 2023

I haven't dug all the way into this, but a couple of usage thoughts. Apologies for farting this into a comment box, I think it would be useful to have as a conversation, possibly with a lot of pen + paper. Maybe tearing some of it into little shapes and moving them around.

  • I'm not a fan of the interactions between random family (int) and specific family (list) (same applies to samples). A real world use case would be 'I only want to transfer trio X to test for some manual interrogation' - In that instance this script will also throw in another 7 unwanted samples just to hit the default sample size. Likewise if I want Family X and 2 additional families, the current process subtracts specific families from the random family count, which is misleading. It would be nice to make this procedural so that these arguments can all act more independently. I've had a crack at this here:
  1. Default params of 0 for sample_n and family_n, empty set for samples and families
  2. Did you ask for specific SGs? Cool - they're in the set of SGs already
  3. Did you ask for specific families? Cool - add all their member SGs to the set of SGs
  4. Did you ask for random families?
    • Find family IDs from any SGs selected specifically, remove from consideration
    • Do a size-weighted selection from remaining families
    • Get all SGs from the random families and add them to the set of SGs
  5. Did you ask for random SGs?
    • remove any collected so far, and randomly sample from remaining SGs

This framing means that any combination of the 4 arguments can be used and should never clash. I do think that any time a SG is transferred to test, it should also be accompanied by its family members and their data too. Not sure how that interacts with this.

  • I'm a big fan of not deleting the current test project contents by default. Is there a partner script that can be used for 'delete everything in this project' so we can fully separate the transfer and delete behaviours (and so that we can enable anyone to clear the test projects without needing your assistance)?

  • The weighted family selection has been dropped! There are simpler ways to implement it using random.choices(population, weights, sample_size) but the idea is cool.

  • There's a load of methods remaining in this script that are no longer called, happy to delete them?

  • I have strong feelings about Click... I think in this context ArgParse is just much better as it allows an arbitrary number of arguments after a flag, which usability wise is the difference between

create_test_subset.py --add-family FAM1 --add-family FAM2 --add-family FAM3 --add-sample SAM1 --add-sample SAM2

and

create_test_subset.py --add-family FAM1 FAM2 FAM3 --add-sample SAM1 SAM2

@MattWellie MattWellie self-requested a review September 18, 2023 06:28
@vivbak
Copy link
Contributor Author

vivbak commented Sep 26, 2023

Hello
Optional Reviewers:
@MattWellie
@jmarshall
@illusional

Please re-review in line with changes requested:
@daniaki

@vivbak vivbak requested a review from daniaki September 26, 2023 05:16
@MattWellie
Copy link
Contributor

Looks delightful 👌

@vivbak vivbak merged commit 44563e0 into dev Sep 27, 2023
2 checks passed
@vivbak vivbak deleted the fix-test-subset-analysis branch September 27, 2023 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants