-
Notifications
You must be signed in to change notification settings - Fork 6
Pbinder/task feat changes #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: michelle/de_checks
Are you sure you want to change the base?
Conversation
1f50514
to
a75a668
Compare
09ca021
to
cb77ff1
Compare
@@ -0,0 +1,406 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewer note: this will be removed before PR is merged.
@@ -0,0 +1,484 @@ | |||
# This tests that the perturbation prediection task with the new data formats matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewer note: this will be removed before PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Will test your branch later today and add additional feedback.
src/czbenchmarks/tasks/single_cell/perturbation_expression_prediction.py
Outdated
Show resolved
Hide resolved
self.adata.var.index = model_adata.var.index | ||
|
||
# Apply cell barcode ordering | ||
self.adata.uns["cell_barcode_index"] = model_adata.obs.index.astype(str).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check for the existence of this key when the file is opened. Please add the expectations for data to the planned documentation updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and will add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving as a reminder for the documentation addition
src/czbenchmarks/tasks/single_cell/perturbation_expression_prediction.py
Outdated
Show resolved
Hide resolved
src/czbenchmarks/tasks/single_cell/perturbation_expression_prediction.py
Outdated
Show resolved
Hide resolved
src/czbenchmarks/tasks/single_cell/perturbation_expression_prediction.py
Outdated
Show resolved
Hide resolved
pred_lfc = cell_representation[np.ix_(condition_idx, gene_indices)].mean( | ||
axis=0 | ||
) - cell_representation[np.ix_(control_idx, gene_indices)].mean(axis=0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here to me to ensure we check that the input data do not look like counts (i.e. no fractional components)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this may be something for you to pick up if you have time. Let's discuss on Friday after I speak with Laksshman today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment on this here: https://github.com/chanzuckerberg/cz-benchmarks/pull/404/files#r2372452575
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it!
default=0.55, | ||
help="Minimum standardized mean difference for DE filtering (used when --metric=t-test)", | ||
) | ||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the help under "metric", could we list the two possibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, all default values should match what the default is set to in the respective method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure the values match the defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percent genes to mask does not mask (that's the one I was looking at when I wrote this). The rest are indeed idential. In the dataset class:
percent_genes_to_mask: float = 0.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, metric should be removed as an arg from the script -- it's been deleted from the dataset/task since there is only one possibility right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
"The file should have: cell representations in .X, gene names in .var.index, " | ||
"and cell identifiers in .obs.index. " | ||
"The gene names and cell identifiers should match the task input, although the ordering does not need to be the same.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 113 notes
TODO: Once PR 381 is merged, use the new load_local_dataset function
PR 381 has been merged. Can this be done or should this comment be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed, thanks for catching it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the resolution here? Is it not possible to use the new function?
src/czbenchmarks/tasks/single_cell/perturbation_expression_prediction.py
Outdated
Show resolved
Hide resolved
This creates a PerturbationExpressionPredictionTaskInput from stored files, | ||
allowing the task to be instantiated without going through the full dataset | ||
loading process. | ||
Load perturbation task inputs from saved separate files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this function now that the output artifacts have been simplified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be useful to fully process the dataset, then to run the tasks. (This saves time, and ensures consistency if there's random sampling)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. If we're going to keep it, let's do a quick check for "cell_barcode_condition_index"
in adata.uns since it's a direct input to the task class. (I realize it's also done in validate, but that won't catch this one.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
src/czbenchmarks/tasks/single_cell/perturbation_expression_prediction.py
Show resolved
Hide resolved
def __init__( | ||
self, | ||
metric: str = "wilcoxon", | ||
control_prefix: str = "non-targeting", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use nomenclature and default value from the dataset class:
control_name: str = "ctrl"
In the example script, we can use the hydra config values from the dataset yaml config to set control_prefix and condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like it's non-targeting everywhere including in the dataset.yaml
src/czbenchmarks/tasks/single_cell/perturbation_expression_prediction.py
Outdated
Show resolved
Hide resolved
Validates the following: | ||
- Condition format must be one of: | ||
- ``{control_name}`` or ``{control_name}_{perturb}`` for matched control samples. | ||
- ``{control_name}_{perturb}`` for matched control samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, this should run on the data before it's control matched too, so I think we need to leave the ability to match against just the control_name and update the end of this to say "unmatched or matched control samples". Does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
src/czbenchmarks/tasks/utils.py
Outdated
return adata.obsm[obsm_key] | ||
|
||
|
||
def guess_is_lognorm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thought of this -- it looks very similar to the other repo, including the title. If so we will need to do SWIPAT checks before release, which would impact our release. I think there are other ways to do this check that might not require that (and we should change the function name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it up!
run: | | ||
echo "VERSION=$(uv version --short)" >> $GITHUB_OUTPUT | ||
- name: Display version being published |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looks like the merge didn't quite work -- I had this issue with my PR too. I'd merged, but for some reason github didn't detect it. I had to do the merge within the PR on GitHub even though it was trivial.
Changes in the dataset and the task features.