Skip to content

Conversation

polinabinder1
Copy link
Collaborator

Changes in the dataset and the task features.

@polinabinder1 polinabinder1 changed the base branch from main to michelle/de_checks September 16, 2025 19:08
@polinabinder1 polinabinder1 force-pushed the pbinder/task_feat_changes branch from 1f50514 to a75a668 Compare September 17, 2025 19:33
@polinabinder1 polinabinder1 force-pushed the pbinder/task_feat_changes branch from 09ca021 to cb77ff1 Compare September 18, 2025 04:18
@@ -0,0 +1,406 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewer note: this will be removed before PR is merged.

@@ -0,0 +1,484 @@
# This tests that the perturbation prediection task with the new data formats matches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewer note: this will be removed before PR is merged.

Copy link
Collaborator

@mlgill mlgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Will test your branch later today and add additional feedback.

self.adata.var.index = model_adata.var.index

# Apply cell barcode ordering
self.adata.uns["cell_barcode_index"] = model_adata.obs.index.astype(str).values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check for the existence of this key when the file is opened. Please add the expectations for data to the planned documentation updates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added and will add

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving as a reminder for the documentation addition

pred_lfc = cell_representation[np.ix_(condition_idx, gene_indices)].mean(
axis=0
) - cell_representation[np.ix_(control_idx, gene_indices)].mean(axis=0)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here to me to ensure we check that the input data do not look like counts (i.e. no fractional components)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this may be something for you to pick up if you have time. Let's discuss on Friday after I speak with Laksshman today.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it!

default=0.55,
help="Minimum standardized mean difference for DE filtering (used when --metric=t-test)",
)
parser.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the help under "metric", could we list the two possibilities?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, all default values should match what the default is set to in the respective method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the values match the defaults

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percent genes to mask does not mask (that's the one I was looking at when I wrote this). The rest are indeed idential. In the dataset class:
percent_genes_to_mask: float = 0.5

Copy link
Collaborator

@mlgill mlgill Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, metric should be removed as an arg from the script -- it's been deleted from the dataset/task since there is only one possibility right now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

"The file should have: cell representations in .X, gene names in .var.index, "
"and cell identifiers in .obs.index. "
"The gene names and cell identifiers should match the task input, although the ordering does not need to be the same.",
)
Copy link
Collaborator

@mlgill mlgill Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 113 notes

TODO: Once PR 381 is merged, use the new load_local_dataset function

PR 381 has been merged. Can this be done or should this comment be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, thanks for catching it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the resolution here? Is it not possible to use the new function?

This creates a PerturbationExpressionPredictionTaskInput from stored files,
allowing the task to be instantiated without going through the full dataset
loading process.
Load perturbation task inputs from saved separate files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this function now that the output artifacts have been simplified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful to fully process the dataset, then to run the tasks. (This saves time, and ensures consistency if there's random sampling)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. If we're going to keep it, let's do a quick check for "cell_barcode_condition_index" in adata.uns since it's a direct input to the task class. (I realize it's also done in validate, but that won't catch this one.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

def __init__(
self,
metric: str = "wilcoxon",
control_prefix: str = "non-targeting",
Copy link
Collaborator

@mlgill mlgill Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use nomenclature and default value from the dataset class:

control_name: str = "ctrl"

https://github.com/chanzuckerberg/cz-benchmarks/blob/main/src/czbenchmarks/datasets/single_cell_perturbation.py#L92

In the example script, we can use the hydra config values from the dataset yaml config to set control_prefix and condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like it's non-targeting everywhere including in the dataset.yaml

Validates the following:
- Condition format must be one of:
- ``{control_name}`` or ``{control_name}_{perturb}`` for matched control samples.
- ``{control_name}_{perturb}`` for matched control samples.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this should run on the data before it's control matched too, so I think we need to leave the ability to match against just the control_name and update the end of this to say "unmatched or matched control samples". Does that sound right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it

return adata.obsm[obsm_key]


def guess_is_lognorm(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought of this -- it looks very similar to the other repo, including the title. If so we will need to do SWIPAT checks before release, which would impact our release. I think there are other ways to do this check that might not require that (and we should change the function name).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it up!

run: |
echo "VERSION=$(uv version --short)" >> $GITHUB_OUTPUT
- name: Display version being published
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like the merge didn't quite work -- I had this issue with my PR too. I'd merged, but for some reason github didn't detect it. I had to do the merge within the PR on GitHub even though it was trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants