llomics is a simple prototype package built to access and process metadata from the Sequence Reads Archive (SRA) records.
It was made in an attempt to help automate the handling of large numbers of records with frequently ambiguous metadata, and provide a structured output with standardized sample identification and classification.
To do this, llomics relies on prompts to an LLM (currently any OpenAI model via their API) to interpret metadata and determine if/what perturbations (mutations, depletions, etc.) or controls are present in an experiment.
The goal of llomics is to maximize accuracy while minimizing user effort without fine-tuning the LLM.
Currently, all that is needed is an OpenAI API account, and an NCBI API key.
Note
This projected was recently renamed from SRAgent to LLoMics to disambiguate from the recently published tool of the same name. This tool has not yet been tested after the rename.
llomicsis actively under development. Expect bugs! 🐛 🐛 🐛- Prompts and output are currently designed specifically for ChIP-seq experiments. ie. we expect certain controls in each project. But plan to test and optimize for all types of records in near term.
- Development and testing has been done with records of yeast histone PTM ChIP-seq, near term goal is to check prompt language, response accuracy, for other organisms.
- Issues and contributions are welcome!
Clone and cd into this repo:
git clone [email protected]:mniederhuber/llomics.gitcd llomics
pip install .llomics currenlty relies on the OpenAI API to access their LLMs, and an NCBI API key to access SRA metadata.
You'll need an account and API key for both services to use llomics.
llomics looks for the following global variables:
OPENAI_API_KEYENTREZ_API_KEYENTREZ_EMAIL
You can add these to a configuration file like .bash_profile:
export OPEN_API_KEY='My_OpenAI_Key'
export ENTREZ_API_KEY='My_Entrez_Key'
export ENTREZ_EMAIL='My_NCBI_Acct_Email'
llomics is built around a two step process.
- fetch record metadata
- summarize and annotate metadata (with LLM)
These two steps are combined in the single function annotate()
which takes either a single bioproject ID, a list of bioproject IDs, or a previously generated metadata table (as pandas dataframe).
annotate() takes the following required arguments:
input: bioproject id(s) or pandas dataframemodel: one of the OpenAI models, eg. 'gpt-4o', 'gpt-3.5-turbo-0125'
Additional default arguments:
validate=TRUE,boolof wheter or not to check if there are obvious disagreements in sample classification within a projecttag=TRUE,booldefault if to generate a sample 'tag' or 'id' in the format of {chip_target}{perturbation}{perturbation_type}_{timepoint} with some variation for WT and non-timecourse expssample=None,intof sub-samples to process within a project. This is useful for testing purposes mainly.summary_reps=1,intof number of times to summarise a project before classification step. This is depricated. Needs to be removed.outFile,stroptional filename for output of csv of annotated metadata
import llomics
llomics.annotate('PRJNA721183', # this is a small project with 6 samples, good for testing
'gpt-4o',
outFile = 'test.csv') The current output is a pandas dataframe.