A Python-based distributed computing workflow for efficiently downloading and processing large-scale Sequence Read Archive (SRA) data. This project optimizes resource allocation across compute nodes using HTCondor workload management system.
- Smart Query Generation: Search SRA database with custom date ranges and keywords
- Automated Load Balancing: Distributes SRA downloads across compute nodes based on file sizes
- Resource Optimization: Dynamically adjusts CPU and disk allocation based on workload
- HTCondor Integration: Generates optimized submit files for distributed processing
- Python 3.9
- pysradb
- pandas
- loguru
git clone https://github.com/William-Gardner-Biotech/SRA_Dispatch.git
cd SRA_Dispatch
pip install -r requirements.txt
Create a config.json
file in the config/
directory with the following structure:
{
"dates": {
"start": "dd-mm-yyyy",
"end": "dd-mm-yyyy"
},
"query": {
"keyword1": "your-keyword",
"keyword2": "your-keyword"
},
"process_configs": {
"on_chtc": true,
"cpu_per_node": 8,
"max_cpu_request": 32,
"minimum_submissions_for_balancing": 10
},
"directory": {
"output_results": "path/to/output"
},
"files": {
"sra_list_folder": "sras_to_process",
"sra_query_file": "sra_queue.txt"
}
}
- Configure your search parameters and resource requirements in
config.json
- Run the workflow:
python3 -m sra_dispatch
The workflow will:
- Query the SRA database based on your parameters
- Balance the workload across available compute nodes
- Generate HTCondor submit files
- Create node-specific SRA lists in the
sras_to_process/
directory
The query_SRA_for_size_df
function queries the SRA database using specified date ranges and keywords, returning a DataFrame of accessions and file sizes.
The balance_nodes
function:
- Calculates optimal disk space allocation per node
- Distributes SRA downloads based on file sizes
- Adjusts CPU allocation based on workload distribution
- Generates node-specific SRA lists
The workflow generates HTCondor submit files with computed resource requirements and configuration parameters.
- Validates date formats and ranges
- Checks minimum submission thresholds
- Ensures balanced node allocation
- Prevents empty partitions through dynamic adjustment
- Set
on_chtc: false
in config for local, non-HTCondor execution - File sizes include a 20x multiplier for fasterq-dump processing
- The system automatically adjusts CPU allocation when nodes are underutilized