diff --git a/pantheon/factory/templates/agents/single_cell/analysis_expert.md b/pantheon/factory/templates/agents/single_cell/analysis_expert.md index 6cdd85f8..7d07f4dc 100644 --- a/pantheon/factory/templates/agents/single_cell/analysis_expert.md +++ b/pantheon/factory/templates/agents/single_cell/analysis_expert.md @@ -3,11 +3,13 @@ id: analysis_expert name: analysis_expert description: | Analysis expert in Single-Cell and Spatial Omics data analysis, - with expertise in analyze data with python tools in scverse ecosystem and jupyter notebook. + with expertise in analyze data and do gene panel selection with python tools in scverse ecosystem and jupyter notebook. It's has the visual understanding ability can observe and understand the images. toolsets: - file_manager - integrated_notebook + - gene_panel + - python_interpreter --- You are an analysis expert in Single-Cell and Spatial Omics data analysis. You will receive the instruction from the leader agent or other agents for different kinds of analysis tasks. @@ -140,7 +142,8 @@ Before performing any analysis, you must locate and read the skill index file `S > **Document significant method deviations for knowledge sharing.** > > Skill files provide **reference code and recommended methods/tools**. -> **You are encouraged to choose the most appropriate method based on your data and analysis context.** +> **You are encouraged to choose the most appropriate method based on your data and analysis context but for gene panel selection , you must strictly follow the revelant skill since this is a critical task, at every step reread the skill.md to make sure you respect the workflow you do not omit anything.** + > > **Normal usage (no documentation needed):** > - Adjusting parameters based on data characteristics (e.g., thresholds, resolutions, cutoffs) @@ -270,6 +273,231 @@ You should: to see whether the figure format is adjusted as expected. 4. If the figure format is adjusted as expected, you should report the adjusted figure to the reporter agent. +## Workflow to perform gene panel selection (CRITICAL — STRICT COMPLIANCE REQUIRED) + +When performing gene panel selection, you are the **sole executor** of the entire selection pipeline. +You must work **independently** using the detailed steps below. The leader provides only high-level context +(dataset path, biological context, panel size, criteria). You execute ALL steps in order without skipping or abbreviating any. + +Before starting, read the full skill file: `.pantheon/skills/omics/gene_panel_selection.md` (or use `glob` with `pattern="**/omics/gene_panel_selection.md"`). +**Re-read it before each major step** to ensure strict compliance. + +### Step 0: Dataset +If no AnnData path was provided, you **must** search and retrieve a relevant dataset before proceeding. + +> [!IMPORTANT] +> Before searching, read the database access skill index: +> `.pantheon/skills/omics/database_access/SKILL.md` (or use `glob` with `pattern="**/database_access/SKILL.md"`) +> Then read the specific skill files: `cellxgene_census.md` (PRIMARY) and `gget.md` (fallback). + +**Search strategy** (follow the detailed sub-steps in the gene_panel_selection skill, Step 0): +1. **Parse the user query** — extract organism, tissue, disease, cell types of interest from the leader's context. + **Critically, determine the task scope**: is it focused (single tissue/disease) or broad (multi-tissue/cross-disease)? +2. **Search CELLxGENE Census first** — largest curated single-cell collection (217M+ cells), returns AnnData directly. + Always filter `is_primary_data == True`. First look for existing atlases matching the task scope, + then explore metadata and download with refined filters. + **Match dataset scope to task scope** — for a broad task, the dataset must cover ALL relevant + tissues/diseases/contexts; do NOT narrow to a single one. Downsample per category if too large. +3. **Fallback to gget.cellxgene / GEO** — if Census lacks suitable data (rare tissue, spatial data needed, etc.) +4. **Validate** — ensure sufficient cells (>10k), relevant cell types, appropriate biological diversity, and save to workdir + +If a dataset path was provided, use it directly and skip to Step 1. + +### Step 1: Dataset Understanding & Splitting + +#### 1.1 Basic structure +Inspect file format, cell/gene counts, batches/conditions, `.obs`/`.var`/`.obsm`/`.uns`. +Identify `label_key` (true cell type recommended if present), batch/condition columns, and whether `adata.X` is raw counts or normalized. + +#### 1.2 Downsampling (CRITICAL) +- If > 500k cells: downsample preserving all cell types (stratified by `label_key`) +- If > 30000 genes: reduce to <= 30000 via QC/HVG +- Save downsampled adata via `file_manager` +- This downsampled dataset becomes the **only input** for algorithmic selection methods +- Keep full gene list available for biological lookup during curation + +#### 1.3 Splitting +Split into: **1 training dataset** (diversified) + **at least 5 test batches**. +Constraint: each split **< 50k cells**. Preserve all cell type distribution. Maximize non-redundancy. + +#### 1.4–1.5 Preprocessing +Check normalization/PCA/UMAP/clustering status. Recompute only if missing or invalid. +If needed: QC → normalize/log1p/scale → PCA → neighbors → UMAP → batch correction (if needed) → Leiden clustering → DEG & marker detection → cell type annotation → marker plots (dotplots, heatmaps). + +> [!IMPORTANT] +> If notebook kernel crashes due to scale, use `python_interpreter` without reducing data complexity. Report this explicitly. + +--- + +### Step 2: Algorithmic Gene Panel Selection + +Run ALL of these methods (unless user requests specific ones): **HVG, DE, Random Forest, scGeneFit, SpaPROS** + +- Use true cell type as `label_key` whenever available +- Implement HVG / DE via Scanpy in code +- For advanced methods, **always use** `gene_panel_selection_tool` toolset: + - `select_scgenefit` (**ALWAYS**: `max_constraints <= 1000`) + - `select_spapros` (**ALWAYS**: `n_hvg < 3000`) + - `select_random_forest` +- **Always request gene scores** from each method +- **Save each method's score table to CSV** on disk + +--- + +### Step 3: Optimal SEED Panel Discovery (Algorithmic) + +For **each method independently** (HVG, DE, scGeneFit, RF, SpaPROS): + +Let N be the target final panel size requested by the leader. + +1. Load the method-specific gene score CSV and rank genes by score (descending) +2. Build candidate sub-panels of sizes K ∈ {100, 200, …, N} by taking the top-K ranked genes +3. For each K: + - Subset dataset to panel genes: `adata_K = adata[:, panel_genes]` + - Recompute neighbors + Leiden on `adata_K` (same preprocessing policy across K) + - Compute **ARI** between Leiden clusters and true cell types (`label_key`) +4. Plot **ARI vs K** for each method +5. Identify stable ARI plateau and consistently high performance +6. Pick the **seed panel** = (method, K*) with the best ARI + +> [!CRITICAL] +> You **MUST** investigate ARI vs panel size for **ALL** methods to find the truly best one. +> This step uses the **training** adata only. + +--- + +### Step 4: Curation Logic (STRICT ORDER) + +#### Phase 1 — Seed Panel (Algorithmic) +- Use the optimal seed panel identified in Step 3 +- Do **NOT** modify genes in the seed + +#### Phase 2 — Completion (Biological lookup is the PRIMARY mechanism) + +> [!CRITICAL] +> **Biological curation is the MAIN completion mechanism, NOT consensus fill.** +> The purpose of completion is to add biologically meaningful genes that algorithmic methods may have missed. +> Consensus fill is ONLY a small last-resort gap filler. + + +**4.0 Completion Rule**: +Before adding a batch of genes to the panel: +- Test whether the additions make ARI drop considerably or become less stable (on training data) +- If completing the panel up to size **N** degrades performance substantially (eg ARI drop >5%), propose: + - An optimal stable panel (< N) + - A supplemental gene list to reach N if the user requires it +- A modest ARI drop is acceptable if it adds important biological coverage + +**4.1 Assess Seed Coverage First**: +Before doing biological lookup, inspect genes already in the seed panel: +- Map seed gene IDs to gene symbols +- Identify which biological categories from the leader's context are already partially covered +- Note which categories are MISSING or under-represented + +**4.2 Exhaustive Biological Lookup (CRITICAL — MUST BE THOROUGH)**: +Derive the relevant biological categories from the **leader-provided context** (e.g., cell type markers, signaling pathways, functional states, disease-specific genes — whatever the user's goal requires). + +Call `browser_use` **MULTIPLE times**, once per major biological category identified. +For **each category**, collect **all** well-established marker genes (typically 10-30+ per category, not just 3-5). +Sources: GeneCards, GO, UniProt, KEGG, Reactome, MSigDB, published marker gene lists, review articles. + +> [!IMPORTANT] +> A single `browser_use` call returning a handful of genes for an entire panel is INSUFFICIENT. +> The number of biologically curated genes should scale with the gap between seed size and target N. +> Do multiple rounds of lookup — breadth across ALL relevant categories AND depth within each. + +**4.3 Add Biologically Relevant Genes**: +For each candidate gene from the biological lookup: +- Check it is not already in the seed panel +- Ensure no redundancy with genes already added +- Categorize it into a relevant biological category +- Add it to the panel +- After each batch of additions, check the Completion Rule (ARI stability on training) +- If ARI drops sharply after a batch, remove that batch and try a different set +- Continue until all important biological genes are added or panel reaches size N + +**4.4 Consensus Fill (LAST RESORT ONLY — small gap filler)**: +Only if after exhaustive biological lookup, `{seed + biological genes} < N`: +1. Normalize scores per method (same scale, no method dominates) +2. Aggregate into a consensus table +3. Fill the small remaining gap by consensus score priority + + + +**Deliverable**: a gene × {method where it comes from, biological category, biological function, source/reference} table. + +> [!IMPORTANT] +> Every accepted gene must be **justified**, assigned a **biological category**, and referenced with a source +> (seed/method, literature, or website reference) and a gene function description. + +--- + +### Step 5: Benchmarking (MANDATORY) + +#### 5.0 Panel Comparison +Create an **UpSet plot** for all N-size algorithmic panels to visualize overlap. + +#### 5.1 Dataset +Benchmarking is performed on **test datasets** (from Step 1.3). + +#### 5.2 Metrics +For each test split, compute metrics for: +1. All algorithmic **N**-size panels +2. Final curated **N**-size panel +3. If curated N was not optimal per Completion Rule: also benchmark the optimal stable (< N) panel +4. Full gene set baseline + +Compute: +- Leiden over-clustering on panel genes +- **ARI, NMI** between Leiden and true labels +- **Silhouette Index** using Leiden assignments + +Plots: **one figure per metric**, boxplots, high-quality formatting. + +#### 5.3 UMAP Comparison +Compute UMAPs for: full genes (reference), each algorithmic N-size panel, curated panel, and optimal stable panel if applicable. +Compare vs reference: qualitatively + quantitatively (distance correlation / Procrustes-like metrics). + +--- + +### Step 6: Summarizing & Reporting + +Write `report_analysis.md` including the full workflow (Steps 0–5) with at minimum: + +- **Objective & context** (from leader instructions, with your interpretation) +- **Dataset description** (adata structure, labels, preprocessing status) +- **Algorithmic methods run** (HVG/DE/RF/scGeneFit/SpaPROS): what each optimizes (detailed) +- **Sub-panel selection**: + - ARI vs size curves per method + - UpSet plot of different panels (overlaps) + - Selection decision (method + size) and why +- **Consensus table construction**: + - Score normalization choice + - Aggregation rule + - Resulting ranked list +- **Curation & completion reasoning (step-by-step)**: + - Per added gene: lookup → match to context → accept/reject + - Redundancy checks + biological category balance + - **All biological references** (links/citations) +- **Benchmarking results**: + - UpSet plot comparing algorithmic panels and curated panel + - ARI/NMI/SI boxplots across test subsets + - UMAP comparisons + quantitative similarity metrics + - Interpretation of performance differences + +**Mandatory tables**: +1. Recap of final panel (all N genes): + +| Gene | Methods where it appears | Biological Function | Relevance score | +|------|--------------------------|----------------------|-----------------| + +2. Per-category count recap table based on the biological context. + +**Mandatory figures**: ARI vs size curves, UpSet plot, ARI/NMI/SI boxplots, UMAP comparisons. + +Then call `reporter` to generate a well-written PDF as final deliverable. + +--- # Guidelines for notebook usage: You should use the `integrated_notebook` toolset to create, manage and execute the notebooks. @@ -284,6 +512,8 @@ one—write, run, check, adjust—then move on to the next cell after completing If the current available memory is not enough, you should consider freeing the memory by closing some jupyter kernel instances using the `manage_kernel` function in the `integrated_notebook` toolset. +## Specific case for gene panel selection +If closing some Jupyter kernel, still doesn't work and cell execution keep fails.**Do not ligthen computations or reduce to much the data** because we want to catch the complexity of the data, use `python_interpreter` for heavy calculations save the python code in a file and write a markdown in the notebook to precise the path of the code used. But this is last option. Precise in the report that you had to swicth to python_interpreterbecause notebook failed > [!WARNING] > **Close notebooks promptly after completing analysis.** diff --git a/pantheon/factory/templates/agents/single_cell/leader.md b/pantheon/factory/templates/agents/single_cell/leader.md index 1623d1af..3d8aeab4 100644 --- a/pantheon/factory/templates/agents/single_cell/leader.md +++ b/pantheon/factory/templates/agents/single_cell/leader.md @@ -3,13 +3,10 @@ id: leader name: leader toolsets: - file_manager - - shell - - task --- +You are an team leader AI-agent for perform single-cell/Spatial Omics related tasks. -{{agentic_general}} -You are an team leader AI-agent for perform single-cell/Spatial Omics related tasks. # General instructions @@ -48,7 +45,10 @@ You don't need to pass the detail about the analysis task to the `analysis_exper you don't need to guild it, just pass high-level instruction, like: "Perform the basic analysis for understanding the dataset and perform the quality control". And you should remind the `analysis_expert` agent to read the index file for the skills, path: `{cwd}/.pantheon/skills/omics/SKILL.md` and remind agent to **must** read related skills before analysis when calling it at the first time. -simultaneously, you should also provide the absolute path of environment.md (which was created by system_manager) to the analysis_expert +simultaneously, you should also provide the absolute path of environment.md (which was created by system_manager) to the analysis_expert. + +**IMPORTANT**: If the task is **gene panel selection**, you must always remind to the `analysis_expert`at the beginning of the task and at all intermediary steps to **STRICTLY** follow the workflow in `.pantheon/skills/omics/gene_panel_selection.md` (or use `glob` with `pattern="**/omics/gene_panel_selection.md"`) + ## Workdir management: Always try to create a `workdir` for the project and keep results in the `workdir`, which is `rootdir` for all sub-agents. All paths MUST be **absolute paths** . Relative paths are forbidden and you should instruct the sub-agents to use absolute paths. @@ -81,8 +81,123 @@ independent decision-making to call sub-agents for exploration is sufficient. If the user provides clear instructions, follow those instructions to design a workflow and then call different sub-agents to complete the task. Don't always run the exploratory analysis workflow if user doesn't provide any specific instructions (Important!). +**IMPORTANT**: If the user asks to do **gene panel selection**, always follow the **Gene Panel Selection Workflow** below. This takes highest priority over all other workflows. + +Alternatively, if their instructions match another workflow mentioned below, follow that workflow. + +## Gene Panel Selection — MODE LOCK (HIGHEST PRIORITY) + +If the user intent is **gene panel selection** (panel / marker panel / probeset / targeted panel / gene selection for spatial / gene profiling / gene list): +- Set: MODE = GENE_PANEL_SELECTION +- While MODE is active: + 1) **IGNORE** all other workflows and routing rules (including scFM routing and exploratory analysis workflow). + 2) Delegate execution to `analysis_expert` following the workflow below. + 3) Do not exit MODE until all steps including Summary are completed. + +### Delegation Contract for Gene Panel Selection + +When delegating gene panel selection to the `analysis_expert`, pass only **high-level** information: + ++ Path to the dataset, workdir path, shared data directory path ++ Computational environment context (path to `environment.md`) ++ Biological context and criteria sought ++ Target panel size (N) ++ High-level description of the goal + +You do **NOT** need to pass: + ++ Software, packages, version details ++ Code examples ++ Specific analysis steps or algorithms + +The `analysis_expert` knows **independently** how to: +- Analyze and preprocess the dataset +- Run all pre-established selection algorithms (HVG, DE, RF, scGeneFit, SpaPROS) +- Find the optimal seed panel via ARI analysis +- Curate and complete the panel with biological context +- Benchmark the final panel + +**No other agent should intervene in the selection process** (from algorithmic selection through final panel completion). The `analysis_expert` performs this **independently**. + +**IMPORTANT**: When calling `analysis_expert`, always remind it to **STRICTLY** follow the workflow in `.pantheon/skills/omics/gene_panel_selection.md` (or use `glob` with `pattern="**/omics/gene_panel_selection.md"`). Remind it at the beginning of the task and when delegating each major phase. + +### Gene Panel Selection Workflow + +#### 0. Dataset +If the user did **not** provide an AnnData object or dataset path, instruct `analysis_expert` to +**search and retrieve** a relevant dataset from public databases before proceeding. + +When delegating, remind `analysis_expert` to: +- Read the database access skills: `.pantheon/skills/omics/database_access/SKILL.md` + (especially `cellxgene_census.md` as primary source and `gget.md` as fallback) +- Follow the detailed dataset search workflow in Step 0 of `gene_panel_selection.md` +- Extract search parameters (organism, tissue, disease, cell types) from the user's biological context +- Search **CELLxGENE Census first** (largest curated collection, returns AnnData directly) +- Validate the dataset before proceeding (sufficient cells, relevant annotations) + +If the user provided a dataset path, pass it directly to `analysis_expert` and skip dataset retrieval. + +#### 1. Understanding + +**1.a Existing results**: Check for previously generated results. Avoid recomputing. + +**1.b Computational environment**: Check for `environment.md`. If missing, call `system_manager` to gather hardware/software info. Install missing packages via `system_manager`. + +**1.c Dataset understanding**: Call `analysis_expert` to perform dataset inspection, QC, downsampling if needed (>500k cells), gene subsetting if needed (>30k genes). Pass environment info and the path to `environment.md`. +If downsampled, that dataset becomes the only input for algorithmic selection. + +#### 2. Full Selection Pipeline (Steps 2–5) +Pass the biological context, target panel size, algorithms to run, and goal to `analysis_expert`. +Let `analysis_expert` execute the **full selection pipeline independently** following the skill workflow: +- Step 2: Algorithmic methods (HVG, DE, RF, scGeneFit, SpaPROS) +- Step 3: Optimal SEED panel discovery (ARI vs panel size) +- Step 4: Curation (biological completion + consensus fill) +- Step 5: Benchmarking on test splits + +The `analysis_expert` will handle all of these steps autonomously. Do not micromanage individual steps. +After `analysis_expert` completes major milestones, call `biologist` **ONLY to interpret results**. +The `biologist` must **NOT** intervene in the algorithmic seed selection or panel curation — interpretation only. + +**IMPORTANT**: Always ensure `analysis_expert` **STRICTLY** respects the workflow in `.pantheon/skills/omics/gene_panel_selection.md`. + +#### 3. Planning +Based on dataset structure, selection methods, and computational environment, +create a project plan in `todolist.md` (markdown checklist format). + +#### 4. Summary +Call the `reporter` agent to generate the final PDF report. + +Pass all paths/results from all sub-agents: +- figures +- tables +- markdown descriptions +- biological interpretations + +--- + +The final report must include **AT LEAST**: + +- A detailed description of the **selection pipeline** from the `analysis_expert` +- All pre-established algorithm results +- Completion logic and reasoning for determining the optimal size for cell-type separability +- Figures including **ARI vs panel size** curves +- Recap table: + +| Gene | Methods where it appears | Biological relevance (context) | Relevance score | +|------|--------------------------|--------------------------------|-----------------| + +- UpSet plot showing intersections between pre-established algorithm outputs +- Benchmarking section with: + - dataset splitting strategy + - ARI/NMI/SI boxplots + - UMAP comparisons + - quantitative UMAP similarity + +**Workdir:** `` -Alternatively, if their instructions match a workflow mentioned in the paragraph below, follow that workflow. +**Always** ask the reporter agent to generate a well-written PDF report: `report.pdf` in the workdir. +When calling the reporter agent, pass only high-level instructions and result paths — +**do not specify report content explicitly**. ## Workflow for perform the exploratory analysis of single-cell/Spatial Omics data(Important!): diff --git a/pantheon/factory/templates/skills/omics/SKILL.md b/pantheon/factory/templates/skills/omics/SKILL.md index 78075417..fe757794 100644 --- a/pantheon/factory/templates/skills/omics/SKILL.md +++ b/pantheon/factory/templates/skills/omics/SKILL.md @@ -13,6 +13,20 @@ When performing specific analysis tasks, load the relevant skill files to guide ## Available Skills +### Gene Panel Selection Workflow + +End-to-end workflow for designing **biologically meaningful** and **algorithmically robust** +gene panels in scRNA-seq and spatial transcriptomics (HVG/DE/RF/scGeneFit/SpaPROS), +with sub-panel discovery (ARI vs size), consensus scoring, biological completion +(Completion Rule), and benchmarking on test splits. + +**Skill file**: [gene_panel_selection.md](./gene_panel_selection.md) + +**When to use**: +- You need to design a gene panel +- You need to benchmark (ARI/NMI/Silhouette + UMAP similarity) already existing panels +- **IMPORTANT**: When you do gene panel selection you should **STRICLY** respect this workflow + ### Single-Cell to Spatial Mapping If you have both single-cell and spatial data for the same/similar sample, diff --git a/pantheon/factory/templates/skills/omics/gene_panel_selection.md b/pantheon/factory/templates/skills/omics/gene_panel_selection.md new file mode 100644 index 00000000..9931fbaf --- /dev/null +++ b/pantheon/factory/templates/skills/omics/gene_panel_selection.md @@ -0,0 +1,513 @@ +--- +id: gene_panel_selection +name: Gene Panel Selection Workflow +description: | + End-to-end workflow for gene panel design in scRNA-seq and spatial transcriptomics, that should be **STRICTLY** followed: + dataset understanding + smart downsampling + train/test splits, + algorithmic selection (HVG/DE/RF/scGeneFit/SpaPROS), + optimal sub-panel discovery (ARI vs size), + biological completion with a stability gate (Completion Rule), consensus scoring and completion (only if there is still room), + and benchmarking on test splits (ARI/NMI/Silhouette + UMAP similarity). +tags: [gene-panel, selection, scrna-seq, spatial, scanpy, scverse, benchmarking, spapros, scgenefit, random-forest] +--- + +# Gene Panel Selection Workflow + +This skill is used when you need to construct **biologically meaningful** and **algorithmically robust** gene panels. You will receive context from the `leader`agent , use this context and **STRICLTY** follow this **Gene Panel Selection Workflow** + +## Workflow Enforcement (MANDATORY) + +Determine which stage of the workflow (Steps 1–5) is required for the current task, +and **STRICTLY** follow the corresponding step(s). + +Once a step is entered, all its mandatory sub-steps must be executed. +No partial execution or silent degradation is allowed. + + + +## Workdir +Always work in the workdir provided by the leader agent. + +## Calling other agents +You can call other agents by using the `call_agent(agent_name, instruction)` function. + +- **Call the `browser_use` agent** for information collection: + When you encounter software or biological knowledge you are not familiar with, call `browser_use` to search the web and collect the necessary information. + +- **Call the `system_manager` agent** for software environment installation: + When you need to install software packages, call `system_manager` to install them. + +- **Call the `biologist` agent** for results interpretation: + When you plot figures, compute a panel, or have intermediate results, call `biologist` to ask for interpretations and include them in your report. + +- **Call the `reporter`agent** when all the results are obtained, to make a well written pdf. + +## Visual understanding +Use the `observe_images` function in the `file_manager` toolset to examine images and figures. +If a figure is not publication-quality, replot it. + +## Reporting +At the end of the task, write a markdown report named: + +`report_analysis.md` + +The report **must** include: +- Summary +- Data (inputs, key parameters, outputs) +- Results (figures + tables) +- Key findings +- Next steps + +## Large datasets +If the dataset is large, perform **smart downsampling** while preserving **all cell types**. + +--- + +# Workflow (IMPORTANT : STRICLY FOLLOW NEEDED STEPS) + +## 0. Dataset + +**If the user provided an AnnData object / dataset path → skip to Step 1.** + +If no dataset was provided, you **must** search and retrieve a relevant dataset +before proceeding. Follow the sub-steps below **in order**. + +> [!IMPORTANT] +> Before starting, read the database access skill index: +> `.pantheon/skills/omics/database_access/SKILL.md` +> (or use `glob` with `pattern="**/database_access/SKILL.md"`) +> and the relevant skill files it references (especially `cellxgene_census.md` and `gget.md`). + +### 0.1 Parse the user query +Extract search parameters from the leader-provided context: +- **Organism**: e.g., "Homo sapiens", "Mus musculus" +- **Tissue / organ**: e.g., "lung", "brain", "bone marrow", "tumor" +- **Disease context**: e.g., "COVID-19", "cancer", "normal" +- **Cell types of interest**: e.g., "immune cells", "T cells", "neurons" +- **Assay preference**: e.g., scRNA-seq, spatial transcriptomics +- **Scope**: Is this a **focused** task (single tissue/disease/system) or a **broad** task + (multi-tissue, pan-disease, cross-system)? This is critical for dataset selection. + +> [!CRITICAL] +> **Match dataset scope to task scope.** The dataset(s) you retrieve must be +> representative of the **full biological diversity** the panel is designed for. +> - **Focused task** (e.g., "brain cortex panel", "kidney disease panel") → +> fetch data from that specific tissue/disease/system +> - **Broad / cross-tissue task** (e.g., "pan-cancer panel", "whole-body immune panel", +> "multi-organ developmental panel") → you **must** include data from **all relevant +> tissues, diseases, or biological contexts** so the panel captures both shared and +> context-specific biology. **Do NOT narrow down to a single tissue or disease.** +> - In general: the biological diversity in the retrieved dataset should reflect the +> biological diversity that the final gene panel must resolve. If the panel needs +> to distinguish 10 tissues, the dataset must contain cells from those 10 tissues. + +### 0.2 Search CELLxGENE Census (PRIMARY source) +CELLxGENE Census is the largest curated single-cell collection (217M+ cells) +and returns AnnData objects directly — **always try this first**. + +Read the full skill: `.pantheon/skills/omics/database_access/cellxgene_census.md` + +Strategy: + +**A) First, look for existing atlases / large integrated datasets** that already match +the task scope. CELLxGENE hosts many curated cross-tissue and disease-specific atlases +(e.g., Tabula Sapiens, Human Cell Atlas collections, organ-specific atlases, +disease-focused atlases). A single well-curated atlas is far better than stitching +together cells from separate studies (avoids batch effects, inconsistent annotations, etc.). + +```python +import cellxgene_census +with cellxgene_census.open_soma() as census: + # List all datasets and inspect their descriptions + datasets = census["census_info"]["datasets"].read().concat().to_pandas() + # Browse dataset titles/collections to find relevant atlases + print(datasets[["dataset_id", "collection_name", "dataset_title"]].head(30)) + # Score datasets by their biological diversity + obs_df = cellxgene_census.get_obs( + census, "", + value_filter="is_primary_data == True", + column_names=["dataset_id", "tissue_general", "disease", "cell_type"], + ) + diversity = obs_df.groupby("dataset_id").agg( + n_cells=("cell_type", "size"), + n_tissues=("tissue_general", "nunique"), + n_diseases=("disease", "nunique"), + n_cell_types=("cell_type", "nunique"), + ).sort_values("n_tissues", ascending=False) + print(diversity.head(20)) +``` + +Pick the dataset that best matches the task scope. For broad tasks, prioritize datasets +with highest tissue/disease/cell-type diversity. For focused tasks, prioritize relevance +to the specific tissue/disease. Prefer datasets with >50k cells and existing cell type annotations. + +**B) If no single atlas suffices**, build a composite query across multiple tissues/diseases: + +1. **Explore available data** — query cell metadata to estimate dataset sizes: + ```python + with cellxgene_census.open_soma() as census: + obs_df = cellxgene_census.get_obs( + census, "", + value_filter="tissue_general == '' and is_primary_data == True", + column_names=["cell_type", "tissue", "tissue_general", "disease", "assay", "dataset_id"], + ) + print(f"Total cells: {len(obs_df)}") + print(obs_df["cell_type"].value_counts().head(20)) + print(obs_df["disease"].value_counts().head(10)) + print(obs_df["tissue_general"].value_counts().head(15)) + ``` +2. **Refine filters — but preserve the task scope**: + - For **broad tasks**: keep multiple tissues/diseases/contexts in the filter. + Sample a **balanced** number of cells per category to avoid one dominating. + - For **focused tasks**: narrow to the specific tissue/disease/context. + - Always check the diversity of cell types, tissues, and diseases after filtering + to confirm the dataset matches the task scope. + +3. **Download the dataset** as AnnData: + ```python + with cellxgene_census.open_soma() as census: + adata = cellxgene_census.get_anndata( + census, + organism="", + obs_value_filter=" and is_primary_data == True", + column_names={ + "obs": ["cell_type", "tissue", "tissue_general", "disease", "sex", + "assay", "donor_id", "dataset_id", "development_stage"], + }, + ) + ``` +4. **Always filter `is_primary_data == True`** to avoid duplicate cells +5. If the dataset is very large (>500k cells), **downsample per category** rather than + dropping entire tissues/diseases. For example, sample up to N cells per + (tissue, disease) combination to keep diversity while controlling size. + Alternatively, use the streaming API (`ExperimentAxisQuery`) — see the skill file. + +### 0.3 Alternative sources (if Census is insufficient) +If CELLxGENE Census does not have suitable data (e.g., rare tissue, specific organism, +spatial data needed), try these alternatives **in order of preference**: + +1. **gget.cellxgene** — query CZ CELLxGENE Discover for specific datasets: + Read: `.pantheon/skills/omics/database_access/gget.md` + ```python + import gget + gget.setup("cellxgene") + adata = gget.cellxgene(species="homo_sapiens", tissue="", + cell_type=["", ""]) + ``` +2. **GEO / ArrayExpress** — call `browser_use` to search for accession numbers, + then download via `gget` or direct URL +3. **Human Cell Atlas (HCA)** / **Tabula Sapiens** / **Broad Single Cell Portal** + — call `browser_use` for specific dataset URLs + +Prefer datasets that already provide **processed count matrices** +(h5ad, loom, mtx format) with cell type annotations and metadata. + +### 0.4 Validate the retrieved dataset +Before proceeding to Step 1, verify: +- [ ] Dataset is loaded as an AnnData object +- [ ] Sufficient number of cells (ideally >10k for robust panel selection) +- [ ] Cell type annotations exist (check `.obs` columns) — if not, they will be computed in Step 1 +- [ ] The dataset is relevant to the user's biological context +- [ ] Save the dataset to workdir via `file_manager` + +> [!NOTE] +> Document in the notebook which database was queried, what filters were used, +> and why this dataset was selected. This information goes into the final report (Step 6). + +## 1) Dataset Understanding and Splitting + +Start with exploratory inspection using an **integrated notebook**. + +### 1.1 Basic structure +Inspect: +- file format (h5ad or other) +- number of cells / genes +- batches / conditions +- `.obs`, `.var`, `.obsm`, `.uns` +- whether dataset has spatial or multimodal components + +Checklist: +- [ ] Identify `label_key` (true cell type recommended if present) +- [ ] Identify batch/condition columns +- [ ] Confirm whether `adata.X` is raw counts or normalized/log1p + +--- + +### 1.2 Downsampling (CRITICAL) +Rules: +- Downsample to **< 500k cells**, **preserving all cell types** +- If genes > **30000**, reduce to **<=30000** via QC/HVG for compute-heavy steps +- Save downsampled `adata` to a new file in workdir via `file_manager` + +> [!IMPORTANT] +> Prefer stratified downsampling by `label_key` if available; otherwise stratify by clustering labels. + +--- + +### 1.3 Splitting +If provided one dataset, split to preserve all cell type distribution across all datasets: +- 1 training dataset (diversified) +- several test batches (**at least 5**) +- constraint: each split **< 50k cells** +- make splits as non-redundant as possible and represent **all cell types** + +--- + +### 1.4 Preprocessing status +Check: +- normalization +- PCA +- UMAP +- clustering + +Recompute only if missing or invalid. + +--- + +### 1.5 Preprocessing (if needed) +- QC (follow the QC skill if available) +- Normalize / log1p / scale +- PCA / neighbors / UMAP +- Batch correction (if needed) +- Leiden clustering +- DEG & marker detection +- Cell type annotation +- Marker plots (dotplots, heatmaps) + +> [!IMPORTANT] +> If heavy steps are slow or unstable on notebook use python code + +--- + +## 2) Algorithmic Gene Panel Selection + +### 2.1 Pre-established methods +Algorithmic Methods = `{HVG, DE, Random Forest, scGeneFit, SpaPROS}` + +- Use true cell type as `label_key` whenever available +- Implement HVG / DE via Scanpy on code +- for more advaced methods **always Use** `gene_panel_selection_tool` toolset : +```python +from pantheon.toolsets.gene_panel_selection_tool import GenePanelToolSet + +selection_tool = GenePanelToolSet( + name="gene_panel_selection", + default_adata_path="adapath", + default_workdir="workdir", +) + +# Advanced methods (tool calls) +# - select_scgenefit (ALWAYS: max_constraints <= 1000) +# - select_spapros (ALWAYS: n_hvg < 3000) +# - select_random_forest +# +# Example calls (adjust args as needed): +await selection_tool.select_scgenefit(label_key="cell_type", n_top_genes="200", max_constraints="1000") +await selection_tool.select_spapros(label_key="cell_type", num_markers="200", n_hvg="2500") +await selection_tool.select_random_forest(label_key="cell_type", n_top_genes="1000") + ``` + +- Always request **gene scores** +- Save each method score table to disk (CSV) + +--- + +## 3) Optimal SEED panel Discovery + +For **each method independently (HVG, DE, Scgenefit, RF, SpapROS)**: + +Let N be the target final panel size requested by the leader + +1. Load the method-specific gene score CSV and rank genes (descending score). +2. Build candidate sub-panels of sizes K ∈ {100, 200, …, N} by taking the top-K ranked genes. +3. For each method and each K: + - Subset the dataset to panel genes only: adata_K = adata[:, panel_genes] + - Recompute neighbors + Leiden on adata_K (same preprocessing policy across K) + - Compute ARI between Leiden clusters and true cell types (label_key). +4. Plot ARI vs K for each method. +5. Pick the **seed panel** = (method, K*) with the best ARI + +**Note**: **SEED STEP** is performed using the training `adata`. It is **IMPORTANT** you investigate ARI vs panel size for all methods (HVG, DE, Scgenefit, RF, SpapROS) when possible, to make sure you take the best one! + +--- + +## 4) Curation Logic + +### 4.1 Curation pipeline (STRICT ORDER) + +Final panel is built in **two phases**: + +#### Phase 1 — Seed-panel (algorithmic) +- Use the optimal Seed-panel identified in Step 3 as seed subpanel +- Do **not** change genes in the seed + +#### Phase 2 — Completion (biological lookup is the PRIMARY mechanism) + +> **CRITICAL**: Biological curation is the MAIN completion mechanism, NOT consensus fill. +> The purpose of completion is to add biologically meaningful genes that algorithmic methods may have missed. +> Consensus fill is ONLY a small last-resort gap filler. If you find yourself adding more consensus-fill genes +> than biological genes, you have NOT done enough biological lookup. + +**0) Completion Rule** +Before adding a batch of genes: +- test whether it makes ARI drop considerably or become less stable (training) +- If completing the panel up to size **N** degrades performance substantially (eg ARI drop >5%), propose: + - an optimal stable panel (< N) + - a supplemental gene list to reach N if required +- a modest ARI drop is acceptable if it adds important biological coverage +Check this on the training dataset. + +**1) Assess Seed Coverage First** +Before biological lookup, inspect genes in the seed panel: +- Map seed gene IDs to symbols +- Identify which biological categories from the leader's context are already covered +- Note which categories are MISSING or under-represented + +**2) Exhaustive Biological Lookup (CRITICAL — MUST BE THOROUGH)** +Derive the relevant biological categories from the **leader-provided context** (e.g., cell type markers, signaling pathways, functional states, disease-specific genes — whatever the user's goal requires). + +Call `browser_use` **MULTIPLE times**, once per major biological category identified. +For **each category**, collect **all** well-established marker genes (typically 10-30+ per category, not just 3-5). +Sources: GeneCards, GO, UniProt, KEGG, Reactome, MSigDB, published marker gene lists, review articles. + +> A single browser_use call returning a handful of genes for an entire panel is INSUFFICIENT. +> The number of biologically curated genes should scale with the gap between seed size and target N. +> Do multiple rounds of lookup — breadth across ALL relevant categories AND depth within each. + +**3) Add Biologically Relevant Genes** +For each candidate gene: + - check not already in seed panel + - ensure no redundancy + - maintain balanced biological coverage across categories + - categorize into a relevant biological category (from leader context, or inferred) + - after each batch of additions, check Completion Rule (ARI stability on training) + - if ARI drops sharply, try a different set; a modest drop for strong biological coverage is acceptable + - continue until all important biological genes are added or panel reaches size N + +**4) Consensus Fill (LAST RESORT ONLY — small gap filler)** +Only if after exhaustive biological lookup, `{seed + biological genes} < N`: + - normalize scores per method (same scale, no method dominates) + - aggregate into a consensus table + - fill the small remaining gap by score priority, excluding genes already present + +**Deliverable: a gene × {method where it comes from, biological category, biological function, source/reference} table.** + +**Note**: Every accepted gene must be **justified**, assigned a **biological category**, and referenced with a source (seed/method score or website/literature) and a gene function description. + +--- + +## 5) Benchmarking (MANDATORY) + +### 5.0 Panel genes comparison +Create an **UpSet plot** for all **N-size** algorithmic panels to see overlap. + +Use the **full original dataset** for evaluation. + +### 5.1 Dataset +Benchmarking is performed on **test datasets**. + +### 5.2 Metrics +For each subset compute (across test splits): +1. all algorithmic **N** size panels +2. final curated **N** size panel +3. if curated **N** was not optimal per **Completion Rule**, also benchmark the optimal stable ( falls back to the default. + - Ignores extra keys not present in the signature. + - Passes through extra kwargs (e.g. context_variables) untouched. + """ + sig = inspect.signature(func) + param_names = list(sig.parameters.keys()) + + @wraps(func) + async def wrapper(*args, **kwargs): + # Detect if a dict was passed as the first positional arg (after self) + dict_positional = len(args) > 1 and isinstance(args[1], dict) + + # Or nested under the first non-self parameter name via kwargs + first_nonself = next((p for p in param_names if p != "self"), None) + dict_in_kwargs = ( + first_nonself in kwargs and isinstance(kwargs[first_nonself], dict) + ) + + if dict_positional or dict_in_kwargs: + if dict_positional: + params = args[1] + bound = sig.bind_partial(*args[:1]) + else: + params = kwargs.pop(first_nonself) + bound = sig.bind_partial(*args) + + for name, param in sig.parameters.items(): + if name == "self" or name in bound.arguments: + continue + v = params.get(name, param.default) if isinstance(params, dict) else param.default + if v in ("", None) and param.default is not inspect._empty: + v = param.default + if v is not inspect._empty: + bound.arguments[name] = v + + for k, v in kwargs.items(): + if k not in bound.arguments: + bound.arguments[k] = v + + return await func(*bound.args, **bound.kwargs) + + return await func(*args, **kwargs) + + return wrapper + + +class GenePanelToolSet(ToolSet): + """Toolset for algorithmic gene panel selection methods. + + Exposes SpaPROS, Random Forest, and scGeneFit as async tool calls. + HVG and DE are intentionally handled by the LLM in the notebook + (via scanpy) since they are simple one-liners. + + Args: + name: Toolset name registered with the agent runtime. + default_adata_path: Fallback .h5ad path when the caller omits one. + default_workdir: Fallback directory for saving outputs. + """ + + def __init__( + self, + name: str = "gene_panel_selection", + default_adata_path: Optional[str] = None, + default_workdir: str = ".", + **kwargs, + ): + super().__init__(name, **kwargs) + self.default_adata_path = default_adata_path + self.default_workdir = default_workdir + logger.info( + f"GenePanelToolSet initialized " + f"(name={name}, adata={default_adata_path}, workdir={default_workdir})" + ) + + # ------------------------------------------------------------------ # + # Helpers # + # ------------------------------------------------------------------ # + + @staticmethod + def _coerce(value, default, cast): + """Safe type conversion with fallback to *default*.""" + if value is None: + return default + if isinstance(value, str) and value.strip().lower() in ("", "none", "null"): + return default + try: + return cast(value) + except (ValueError, TypeError): + return default + + def _resolve(self, adata_path, workdir): + """Return (path, workdir) after applying defaults.""" + path = adata_path or self.default_adata_path + wdir = workdir or self.default_workdir + return path, wdir + + # ------------------------------------------------------------------ # + # SpaPROS # + # ------------------------------------------------------------------ # + + @tool + @unwrap_llm_dict_call + async def select_spapros( + self, + adata_path: Optional[str] = None, + label_key: str = "", + num_markers: str = "100", + n_hvg: str = "3000", + return_scores: str = "false", + workdir: Optional[str] = None, + ) -> dict: + """ + Select marker genes using SpaPROS probe-set selection. + + Args: + adata_path (str): Path to .h5ad dataset. Falls back to default. + label_key (str): Column in .obs for cell groups. + num_markers (str): Number of markers to select. + n_hvg (str): HVG pre-filter size (must be < 3000). + return_scores (str): "true" to include per-gene importance scores. + workdir (str): Output directory. Falls back to default. + + Returns: + dict with keys: used_dataset, top_n, saved_to, genes. + """ + try: + import scanpy as sc + import pandas as pd + import numpy as np + import spapros as sp + + path, workdir = self._resolve(adata_path, workdir) + if not path: + return {"error": "No dataset path provided."} + + out_dir = os.path.join(workdir, "gene_panels", "spapros") + os.makedirs(out_dir, exist_ok=True) + + num_markers = self._coerce(num_markers, 100, int) + n_hvg = self._coerce(n_hvg, 3000, int) + return_scores = str(return_scores).lower() in ("true", "yes", "1") + + adata = sc.read_h5ad(path) + + # Pre-filter to HVGs for tractable computation + sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=n_hvg) + adata = adata[:, adata.var["highly_variable"]] + + if not label_key or label_key not in adata.obs.columns: + return {"error": f"label_key '{label_key}' not found in adata.obs."} + + selector = sp.se.ProbesetSelector( + adata, + n=num_markers, + celltype_key=label_key, + verbosity=1, + save_dir=None, + ) + selector.select_probeset() + + df = selector.probeset.copy() + df.index.name = "gene" + + # Save full results table + full_path = os.path.join(out_dir, "spapros_full_table.csv") + df.to_csv(full_path) + + selected = df[df["selection"] == True].index.tolist() + panel_path = os.path.join(out_dir, f"spapros_top_{num_markers}.csv") + pd.DataFrame({"gene": selected}).to_csv(panel_path, index=False) + + if return_scores and "importance_score" in df.columns: + score_list = [ + {"gene": g, "score": float(row.get("importance_score", np.nan))} + for g, row in df.iterrows() + ] + score_path = os.path.join(out_dir, "spapros_scores.csv") + pd.DataFrame(score_list).to_csv(score_path, index=False) + return { + "used_dataset": path, + "top_n": num_markers, + "saved_to": {"panel": panel_path, "full_table": full_path, "scores": score_path}, + "genes": score_list, + } + + return { + "used_dataset": path, + "top_n": num_markers, + "saved_to": {"panel": panel_path, "full_table": full_path}, + "genes": selected, + } + + except Exception as e: + import traceback + logger.error(f"select_spapros failed: {e}\n{traceback.format_exc()}") + return {"error": f"SpaPROS failed: {e}"} + + # ------------------------------------------------------------------ # + # Random Forest # + # ------------------------------------------------------------------ # + + @tool + @unwrap_llm_dict_call + async def select_random_forest( + self, + adata_path: Optional[str] = None, + label_key: str = "", + n_top_genes: str = "1000", + return_scores: str = "false", + random_state: str = "42", + workdir: Optional[str] = None, + ) -> dict: + """ + Rank genes by Random Forest feature importance. + + Trains an RF classifier on the expression matrix (X) to predict + cell labels, then returns genes ranked by Gini importance. + + Args: + adata_path (str): Path to .h5ad dataset. Falls back to default. + label_key (str): Column in .obs for cell labels. + n_top_genes (str): How many top genes to save (default 1000). + return_scores (str): "true" to return all genes with scores. + random_state (str): Random seed. + workdir (str): Output directory. Falls back to default. + + Returns: + dict with keys: used_dataset, top_n, saved_to, genes. + """ + try: + import scanpy as sc + import numpy as np + import pandas as pd + from sklearn.ensemble import RandomForestClassifier + + path, workdir = self._resolve(adata_path, workdir) + if not path: + return {"error": "No dataset path provided."} + + out_dir = os.path.join(workdir, "gene_panels", "random_forest") + os.makedirs(out_dir, exist_ok=True) + + n_top_genes = self._coerce(n_top_genes, 1000, int) + random_state = self._coerce(random_state, 42, int) + return_scores = str(return_scores).lower() in ("true", "1", "yes") + + adata = sc.read_h5ad(path) + + if not label_key or label_key not in adata.obs.columns: + return {"error": f"label_key '{label_key}' not found in adata.obs."} + + X = adata.X.toarray() if not isinstance(adata.X, np.ndarray) else adata.X + y = adata.obs[label_key].astype("category").cat.codes.values + + clf = RandomForestClassifier( + n_estimators=300, random_state=random_state, n_jobs=-1 + ) + clf.fit(X, y) + + ranked = sorted( + [{"gene": g, "score": float(s)} for g, s in zip(adata.var_names, clf.feature_importances_)], + key=lambda d: d["score"], + reverse=True, + ) + + save_path = os.path.join(out_dir, f"rf_top_{n_top_genes}.csv") + pd.DataFrame(ranked[:n_top_genes]).to_csv(save_path, index=False) + + return { + "used_dataset": path, + "top_n": n_top_genes, + "saved_to": save_path, + "genes": ranked if return_scores else [x["gene"] for x in ranked[:n_top_genes]], + } + + except Exception as e: + logger.error(f"select_random_forest failed: {e}") + return {"error": str(e)} + + # ------------------------------------------------------------------ # + # scGeneFit # + # ------------------------------------------------------------------ # + + @tool + @unwrap_llm_dict_call + async def select_scgenefit( + self, + adata_path: Optional[str] = None, + label_key: Optional[str] = None, + n_top_genes: str = "200", + method: str = "centers", + epsilon_param: str = "1.0", + sampling_rate: str = "1.0", + n_neighbors: str = "3", + max_constraints: str = "1000", + redundancy: str = "0.01", + return_scores: str = "false", + workdir: Optional[str] = None, + ) -> dict: + """ + Select marker genes via scGeneFit (LP-based marker selection). + + Solves a linear program that finds a sparse weight vector over genes + such that labeled cell groups remain separable. Weights serve as + per-gene importance scores. + + Args: + adata_path (str): Path to .h5ad. Falls back to default. + label_key (str): Column in .obs for cell labels. + n_top_genes (str): Number of markers to select (default 200). + method (str): Constraint strategy: "centers" | "pairwise" | "pairwise_centers". + epsilon_param (str): LP epsilon scaling factor (default 1.0). + sampling_rate (str): Fraction of cells to sample for pairwise methods. + n_neighbors (str): Neighbours for pairwise constraint building. + max_constraints (str): Hard cap on constraint rows (keep <= 1000). + redundancy (str): Redundancy param for center summarisation. + return_scores (str): "true" to return all genes with LP weights. + workdir (str): Output directory. Falls back to default. + + Returns: + dict with keys: used_dataset, top_n, saved_to, genes. + """ + try: + import time + import scanpy as sc + import numpy as np + import pandas as pd + import scipy.sparse as sps + import scGeneFit.functions as gf + + path, workdir = self._resolve(adata_path, workdir) + if not path: + return {"error": "No dataset path provided."} + + out_dir = os.path.join(workdir, "gene_panels", "scgenefit") + os.makedirs(out_dir, exist_ok=True) + + n_top_genes = self._coerce(n_top_genes, 200, int) + epsilon_param = self._coerce(epsilon_param, 1.0, float) + sampling_rate = self._coerce(sampling_rate, 1.0, float) + n_neighbors = self._coerce(n_neighbors, 3, int) + max_constraints = self._coerce(max_constraints, 1000, int) + redundancy = self._coerce(redundancy, 0.01, float) + return_scores = str(return_scores).lower() in ("true", "1", "yes") + + logger.info(f"scGeneFit: loading {path}") + adata = sc.read_h5ad(path) + if getattr(adata, "isbacked", False): + adata = adata.to_memory() + + if not label_key or label_key not in adata.obs.columns: + return {"error": f"label_key '{label_key}' not found in adata.obs."} + + logger.info( + f"scGeneFit: {adata.shape}, method={method}, " + f"n_top_genes={n_top_genes}, max_constraints={max_constraints}" + ) + + # Dense matrix required by the LP solver + if sps.issparse(adata.X): + X = adata.X.toarray() + else: + X = np.asarray(adata.X) + + y = adata.obs[label_key].astype("category").values + d = X.shape[1] + + # Access internal scGeneFit routines + _sample = getattr(gf, "__sample") + _pairwise = getattr(gf, "__select_constraints_pairwise") + _pairwise_cent = getattr(gf, "__select_constraints_centers") + _summarised = getattr(gf, "__select_constraints_summarized") + _lp_markers = getattr(gf, "__lp_markers") + + t0 = time.time() + samples, samples_labels, _ = _sample(X, y, sampling_rate) + logger.info(f"scGeneFit: sampled {len(samples)} cells in {time.time() - t0:.1f}s") + + t0 = time.time() + if method == "pairwise_centers": + constraints, smallest_norm = _pairwise_cent(X, y, samples, samples_labels) + elif method == "pairwise": + constraints, smallest_norm = _pairwise(X, y, samples, samples_labels, n_neighbors) + else: + constraints, smallest_norm = _summarised(X, y, redundancy) + logger.info( + f"scGeneFit: {constraints.shape[0]} constraints built in {time.time() - t0:.1f}s" + ) + + # Cap constraints to keep LP tractable + if constraints.shape[0] > max_constraints: + rng = np.random.default_rng(42) + idx = rng.permutation(constraints.shape[0])[:max_constraints] + constraints = constraints[idx, :] + logger.info(f"scGeneFit: capped to {max_constraints} constraints") + + t0 = time.time() + sol = _lp_markers(constraints, n_top_genes, smallest_norm * epsilon_param) + logger.info(f"scGeneFit: LP solved in {time.time() - t0:.1f}s") + + weights = np.asarray(sol["x"][:d], dtype=float) + + if return_scores: + ranked = sorted( + [{"gene": g, "score": float(s)} for g, s in zip(adata.var_names, weights)], + key=lambda d: d["score"], + reverse=True, + ) + save_path = os.path.join(out_dir, "scgenefit_scores.csv") + pd.DataFrame(ranked).to_csv(save_path, index=False) + return { + "used_dataset": path, + "top_n": len(ranked), + "saved_to": save_path, + "genes": ranked, + } + + order = np.argsort(-weights)[:n_top_genes] + top = adata.var_names[order].tolist() + save_path = os.path.join(out_dir, f"scgenefit_top_{n_top_genes}.csv") + pd.DataFrame({"gene": top}).to_csv(save_path, index=False) + + return { + "used_dataset": path, + "top_n": n_top_genes, + "saved_to": save_path, + "genes": top, + } + + except Exception as e: + import traceback + logger.error(f"scGeneFit failed: {e}\n{traceback.format_exc()}") + return {"error": str(e)} + + +__all__ = ["GenePanelToolSet"]