DO NOT MERGE tracking commits to equivalency testing files #377

mlgill · 2025-08-19T23:20:07Z

PR for tracking efforts to validate task -- for review by @polinabinder1 only.

* WIP data stuff * Types cleanup * First pass of datasets proposal * Cleanup of dataset files * All tasks except perturbation prediction * Updates * Add simple caching method * Update note * Ruff * Remove datatype enum and datatypespec * Add simple Embedding type * Remove unload_data * WIP decoupling all anndata references * Move random seed to top level constants file * Type conversion * Fix imports, purge DataType * More import purges * Add back some base dataset validation * Base dataset tests pass * Single cell datasets pass * Dataset tests all pass * Working on task tests * WIP task tests * WIP * More cleanup * Note * One task test bug fixed * WIP * WIP * One more test fixed, two minor test issues remain * Merge geneexpression type with embedding * Improve arg name and error messages * Fixing function calls * Cross species test fixed, one test remains * WIP perturb prediction task * Remove unused conftest from tasks * Oops * Update src/czbenchmarks/tasks/base.py Co-authored-by: Andrew Tolopko <[email protected]> * Add cli todo * FIXME for task.display_name * Update type to CellRepresentation * Improve error message in perturbseq dataset * Validation for multiple organisms * Bug fix * Remove debugging code * Add a fixme note * Perturbation test passes ... finally! 🥳🫠 * Cleanup task inputs * Unused fixtures * Final changes * Add fixme --------- Co-authored-by: Andrew Tolopko <[email protected]>

…icts

* datasets.utils.load_dataset complete * reuse existing s3 remote download code * cache implemented. created file_utils with cache support --------- Co-authored-by: atolopko <[email protected]>

* BaseDataset updates - rename to Dataset - new scaffolding to serialize task-specific inputs from dataset: - constructor takes a dataset_type_name param, used for output dir path during serialization - constructor requires a Path type (not str) - replace cache_data() with store_task_inputs() * Change single cell dataset class hierarchy - SingleCellDataset is now base class of SingleCellLabeledDataset and SingleCellPerturbationDataset. - These two subclasses are now independent and enforce different schemas on the AnnData object. - SingleCellPerturbationDataset no longer requires "cell_type" (label) column in AnnData object. - SingleCellLabeledDataset takes label_column_key parameter in constructor, defaulting to "cell_type" - One dataset class per module * Dataset validation refactoring and test refactoring - move dataset path validation to constructor, as it is too late to check this in _validate() - pull up common validations from SingleCellLabeledDataset into SingleCellDataset - have concrete dataset class tests all run parent class tests, to ensure concrete classes conform to the parent class validations - wrap dataset tests into test classes to support this test inheritence scheme * Update example.py * Initial updates to datasets docs (wip) * Types refactoring - move CellRepresentation to czb.tasks.types - move ListLike to czb.types

- Remove "Base" prefixes from BaseTask → Task and validator classes - Rename BaseDatasetValidator → DatasetValidator and BaseSingleCellLabeledValidator → SingleCellLabeledValidator - Rename base.py → task.py to match the Task class name - Update all imports and references throughout the codebase - Add Task export to package-level init.py for easier importing - Update documentation to reflect all changes - Regenerate autoapi documentation

* Remove asserts * Fix logic * REVERT ME: remove excess tests * Pydantic refactor: base class and clustering * Pydantic embedding task * Pydantic metadata integration * Pydantic metadata prediction * Pydantic cross species * Pydantic perturbation * Clean up imports * Revert "REVERT ME: remove excess tests" This reverts commit 09ab74f. * Remove var from clustering task * Change set_baseline -> compute_baseline * Move all task display_name instance variables to class attribute * Remove outdated comment about cache PR * Evaluate changing ListLike type hints to collections.abc.Sequence * Fix type issue * Comment clean up * pydantic class for task output * Lint code * Fix cli issue * Fix accidental change * update test readme * feature: datasets write task inputs to files (#299) * BaseDataset updates - rename to Dataset - new scaffolding to serialize task-specific inputs from dataset: - constructor takes a dataset_type_name param, used for output dir path during serialization - constructor requires a Path type (not str) - replace cache_data() with store_task_inputs() * Change single cell dataset class hierarchy - SingleCellDataset is now base class of SingleCellLabeledDataset and SingleCellPerturbationDataset. - These two subclasses are now independent and enforce different schemas on the AnnData object. - SingleCellPerturbationDataset no longer requires "cell_type" (label) column in AnnData object. - SingleCellLabeledDataset takes label_column_key parameter in constructor, defaulting to "cell_type" - One dataset class per module * Dataset validation refactoring and test refactoring - move dataset path validation to constructor, as it is too late to check this in _validate() - pull up common validations from SingleCellLabeledDataset into SingleCellDataset - have concrete dataset class tests all run parent class tests, to ensure concrete classes conform to the parent class validations - wrap dataset tests into test classes to support this test inheritence scheme * Update example.py * Initial updates to datasets docs (wip) * Types refactoring - move CellRepresentation to czb.tasks.types - move ListLike to czb.types * Fix merge issue. Tests pass. * Rename in progress * Fix imports * Fix list type * Fixes and cleanup * Removed metricinput dataclass * Formatting * AutoAPI build issue from a different PR. * feat: Update example.py to use new task input classes (#319) * rename BaseTask to Task and add to package exports - Rename BaseTask class to Task in tasks/base.py - Update all imports and inheritance references from BaseTask to Task - Add Task to tasks/__init__.py exports for easier importing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * update documentation to reflect BaseTask → Task refactoring - Update all references from BaseTask to Task in developer guides - Fix broken links in datasets documentation - Regenerate autoapi documentation to reflect code changes - Update how-to guides and API reference documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * rename validator classes to remove Base prefixes - Rename BaseDatasetValidator to DatasetValidator - Rename BaseSingleCellLabeledValidator to SingleCellLabeledValidator - Rename files base_dataset_validator.py to dataset_validator.py - Rename files base_single_cell_validator.py to single_cell_validator.py - Update all imports and references throughout codebase - Update package exports in validators/__init__.py - All tests passing (38 passed, 2 skipped) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * rename base.py to task.py to match the Task class - Rename base.py to task.py in tasks directory - Update all imports from .base to .task throughout codebase - Update imports in tasks/__init__.py, cli/types.py, tests/utils.py - Update imports in all task implementation files - All tests passing (48 passed, 2 skipped, 14 deselected) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * update documentation for base.py to task.py rename - Update developer_guides/tasks.md to reference task.py instead of base.py - Update how_to_guides/add_new_task.md to reference task.py instead of base.py - Update import statements from tasks.base to tasks.task - Update file path references from base.py to task.py - Regenerate autoapi documentation to reflect new file structure - All task classes now correctly inherit from czbenchmarks.tasks.task.Task 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Update example.py to use new TaskInput structure - Add imports for TaskInput classes from all three tasks - Replace old task_kwargs/metric_kwargs pattern with proper TaskInput objects - Update ClusteringTask to use ClusteringTaskInput - Update EmbeddingTask to use EmbeddingTaskInput - Update MetadataLabelPredictionTask to use MetadataLabelPredictionTaskInput - Combine all results into single dictionary and output as formatted JSON 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * use underscore for unused task input param names * add more tasks to example.py * feat: Remove Base prefixes from classes and rename files (#318) - Remove "Base" prefixes from BaseTask → Task and validator classes - Rename BaseDatasetValidator → DatasetValidator and BaseSingleCellLabeledValidator → SingleCellLabeledValidator - Rename base.py → task.py to match the Task class name - Update all imports and references throughout the codebase - Add Task export to package-level init.py for easier importing - Update documentation to reflect all changes - Regenerate autoapi documentation * ruff --------- Co-authored-by: Claude <[email protected]> --------- Co-authored-by: Andrew Tolopko <[email protected]> Co-authored-by: Claude <[email protected]>

* reverted #307

Created 2 new example notebooks for bring your own model use case. - Basic use with model output embeddings - Use with external model - Use with external model with fine tuning

…umentation (#332) Updated doc strings for accuracy and better API documentation.

- Add regression tests for clustering, embedding, label prediction, and cross-species tests (perturbation skipped, since not validated by compbio). Uses real, pre-generated model embeddings as input to tasks, comparing to real, past results that have been validated. - Add end-to-end integration test (loading dataset and running tasks) - Uses @pytest.mark.integration for selective test execution - Update example.py to use new TaskInput structure with proper baseline workflow - update test README - ignore cli for code coverage

…e changes (#338) * added new test cases. Validated existing test cases for new code changes

- refactor and reorganize cli code - disabled "run" command, implementation for running tasks on model outputs is partially implemented; running tasks will only be supported via Python code usage for now - registry based cli - removed unused file_cache.py module. - fancy output for cli list command

* docs: docs updated for byomodel release

Example Notebooks for using czbenchmarks

…s for new perturb seq task (#364) * Remove other perturb datasets and split key * Temporarily deprecate older perturbseq datasets * Ready to start testing * Dataset directory update * Debugging * Dataset load_data runs to completion * nit * Formatting and improving pandas efficiency * Update * Fix utils tests * WIP tests * Additional working test * No more split columns * store_task_inputs works * 10 genes * Linting * Fixing test * Joblib runs * Remove dual condition, validation works * Tests pass * Linting * Unused imports causing linter failures * Add example script * update * nit * Docstring updates for errors * Fix dimension * Fixed docstring formatting in API * Remove joblib since it doesn't accelerate results * nit * Add gene masking as an input param * Cleanup joblib imports * Update filters * WIP updates to stored inputs * Fixed tests * Clean up FIXMEs * Bug fix and expose min_de_genes * Fix * Expose them all * FIXME * Fix path * Fix test * Filter de_results * Commit equivalency test to this PR * equivalency test updates * Add filter for DE, fixes comparison * Feedback from Jasleen, Part 1/2 * control cells list order not being preserved * Add ability to load in backed mode * Ready for benchmarking * default wilcoxon * Fixed * Cleanup * Formatting * Ruff * Hard code parameters * Add test * Improve tests * Bug fix * gene --> condition * Ruff * Remove testing files, cleanup single cell dataset class comments * Clean up utils * Updates * Requested comment * Ruff * Ruff * Update variable name * Fix tests * uv

…idation

…nExpressionPredictionTask` (#379) * Draft test * Update name * cleanup * ruff

Co-authored-by: Andrew Tolopko <[email protected]>

…diction.py Co-authored-by: Andrew Tolopko <[email protected]>

…idation

mlgill and others added 30 commits June 20, 2025 11:25

feat: Remove Models and Docker code

e2804c4

removed cli and runner

5a69cc8

removed cli test cases

f34ff6f

remove build and docs for model

4b40c90

removed scripts

92d3626

Merge branch 'main' into v1.0-byomodel

0acfbcc

more removal of model-related docs (#304)

f2b3c22

Addressed review comments and code formatting

115d577

restore CLI as is - not working

0c5a3d6

CLI list working. Model execution commented

7685352

cli - removed model execution related commented code

29a1c9c

Merged v1.0-byomodel-remove-models into v1.0-byomodel, resolved confl…

b8d169c

…icts

minor updates in index.rst

75f66f9

feat: 308 improve usability of datasets (#311)

425fd2d

* datasets.utils.load_dataset complete * reuse existing s3 remote download code * cache implemented. created file_utils with cache support --------- Co-authored-by: atolopko <[email protected]>

Align model inputs and outputs (#327)

b3c5f76

chore: reverted #307 (#336)

68ef972

* reverted #307

docs: examples updated, added model fine tuning (#330)

c662dc0

Created 2 new example notebooks for bring your own model use case. - Basic use with model output embeddings - Use with external model - Use with external model with fine tuning

docs: verified and updated docstrings for accuracy and better API doc…

ae22852

…umentation (#332) Updated doc strings for accuracy and better API documentation.

feat: Added few test cases. Validated existing test cases for new cod…

428b55b

…e changes (#338) * added new test cases. Validated existing test cases for new code changes

docs: docs updated for byomodel release (#337)

6dfa716

* docs: docs updated for byomodel release

docs: 340 add update notebook examples (#341)

5b34f8f

Example Notebooks for using czbenchmarks

Remove other perturb datasets and split key

94be5f4

Temporarily deprecate older perturbseq datasets

5f3078a

mlgill and others added 28 commits August 19, 2025 16:33

array fix

3557998

fixes

83a7c8d

deleting file

0a9d0b3

Merge branch 'feat/perturbation_task' into feat/perturbation_task_val…

59759d6

…idation

lint

1f11a16

lint fix2

01efdcf

remove extra tests

beac74c

adding text to example

0e47ed8

small fix

dac36b9

dataset comparisons

ce0da99

small test fix

3e56862

Merge branch 'feat/perturbation_task' into feat/perturbation_task_val…

31467eb

…idation

lint

e6ffe0e

Fix uv.lock

7cbe8de

fixes to test file names

1cc7f4d

some fixes

311c0e4

remove uv.lock

299b817

Merge branch 'feat/perturbation_task' into feat/perturbation_task_val…

19c6c7c

…idation

metric changes

d8fafa5

Merge branch 'feat/perturbation_task' into feat/perturbation_task_val…

8a243cb

…idation

tests run

4363e1f

tests run

ae45525

Integration test for SingleCellPerturbationDataset and `Perturbatio…

b8a36c3

…nExpressionPredictionTask` (#379) * Draft test * Update name * cleanup * ruff

Update examples/example_perturbation_expression_prediction.py

3cb9ccc

Co-authored-by: Andrew Tolopko <[email protected]>

Update src/czbenchmarks/tasks/single_cell/perturbation_expression_pre…

72e8a2a

…diction.py Co-authored-by: Andrew Tolopko <[email protected]>

addressing PR

de6961a

Merge branch 'feat/perturbation_task' into feat/perturbation_task_val…

d4a1349

…idation

fixing task

c833140

Base automatically changed from feat/perturbation_task to v1.0-byomodel August 23, 2025 00:09

Base automatically changed from v1.0-byomodel to main September 16, 2025 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DO NOT MERGE tracking commits to equivalency testing files #377

DO NOT MERGE tracking commits to equivalency testing files #377

Uh oh!

mlgill commented Aug 19, 2025

Uh oh!

Uh oh!

DO NOT MERGE tracking commits to equivalency testing files #377

Are you sure you want to change the base?

DO NOT MERGE tracking commits to equivalency testing files #377

Uh oh!

Conversation

mlgill commented Aug 19, 2025

Uh oh!

Uh oh!