Skip to content

Issue with running pipeline since previous update 5 months ago. #109

@JakeLehle

Description

@JakeLehle

Report

Hey I love the pipeline and what you are doing by merging multiple machine learning models to get a popularity vote. Ive been using the package without a problem for about a year now in my single cell pipeline but recently rebuilt my conda env and now everything is broken and I'm having a heck of a time trying to figure out exactly why but this does bring up both a big request and a enhancement request for the documentation to make the pipeline more easy for others in the future.

Here is the code I typically run using the default of all models.

#%% Automated cell type analysis with popv
import popv
# popV needs adata.X raw information no normalization
# use the QC_on_adata_normal() function to remove low quality cells as a pre-processing step
# You will start with the adata_tmp object for this then

# Select a pre-trained model
huggingface_repo = "popV/tabula_sapiens_All_Cells"
# The query batch key is what will be used by bbknn for batch correction
query_batch_key = "run_accession"
#%% Perform annotation useing a premade model
import numba
hmo = popv.hub.HubModel.pull_from_huggingface_hub(huggingface_repo, cache_dir="tmp/tabula_sapiens")

#%%
adata_tmp_an = hmo.annotate_data(
    adata_tmp,
    query_batch_key=query_batch_key,
    prediction_mode="inference",  # "fast" does not integrate reference and query.
    gene_symbols="feature_name", # "Uncomment if using gene symbols."
)

for col in adata_tmp_an.obs.columns:
    adata_tmp_an.obs[col] = adata_tmp_an.obs[col].astype(str)

adata_tmp_an.write("adata_popv_an.h5ad")

But this hit an error when running the OnClass model which has to do with some update with pandas v3.0.0 now making it difficult to write anndata objects with arrow types. I've been banging my head against a wall trying to sort out the dependency issues which seems like if I set pandas=2.2.3 and anndata=0.12.10 the popv step gets further but still fails. I tried removing the onclass model and manually setting all the other models to be used excluding just the onclas model but there is very limited documentation on how to do this in the ipython notebook.

So I need 2 things:

  1. Please try to run the pipeline with a fresh install and the most current versions of scanpy, anndata, and pandas. If you can get it to work please tell me your package versions so I can copy that. If it fails let me know and I can also help troubleshoot a fix.

  2. Add documentation to the tutorial page showing users how to manually set using one or all of the models to be be used as inputs for the .annotate_data() function. Please also update the API to more clearly list out the name of the models available to be called for the .annotate_data() function. Right now based onthe python files at the source it looks like those need to be called _onclass, _celltypist, _harmony, etc. But that isn't clear and threw errors when I tried setting those in my code.

Version information

Here is the conda env I yaml file I use to make my virtual machine for running popV.

name: sc_pre
channels:

  • conda-forge
  • bioconda
  • plotly
    dependencies:

Core Python

  • python=3.11
  • cython
  • numpy
  • pandas=2.2.3
  • scipy
  • scikit-learn

Scanpy ecosystem

  • scanpy
  • anndata=0.12.10
  • leidenalg
  • louvain
  • python-igraph
  • bbknn
  • umap-learn
  • pynndescent
  • fa2

Cell annotation

  • celltypist
  • cellxgene-census
  • scvi-tools

CNV analysis

  • gffutils

Visualization

  • matplotlib-base
  • seaborn
  • plotly-orca
  • hvplot
  • adjusttext

Data handling

  • pybiomart
  • goatools
  • geoparse

SRA tools (for SRAscraper compatibility)

  • awscli
  • parallel-fastq-dump
  • pysradb
  • python-wget
  • sra-tools>=3.0.0

Fix for the OpenSSL Version Mismatch

  • aws-c-cal
  • awscrt
  • openssl

Utilities

  • pyyaml
  • jupyter_core
  • jupyterlab

MultiQC for reports

  • multiqc

GSEA

  • bioconda::gseapy

GUI tools (optional, can be removed for headless)

  • pyqt
  • qt
  • firefox
  • pygraphviz

pip dependencies

  • pip
  • pip:
    • scrublet # For doublet detection
    • popv # For cell annotation consensus
    • cytotrace2-py # For stemness scoring (install from github if needed)
    • infercnvpy # For CNV analysis

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions