Tools-For-Finding-Comparing-Assembling-NCBI-Genomes

Automatically download, compare, and assemble genomes from NCBI

- findGenome.py (version 5)

Dependencies This script requires Python 3.6+ and the following libraries: pip install biopython

Install the NCBI Datasets CLI for handling nuclear genomes: conda install -c bioconda ncbi-datasets-cli

This script will automatically download and filter organellar and nuclear genomes from the NCBI database

# basic code:
python findGenome.py "group" "outfolder" --genome_type "chloroplast" --length_threshold INT --batch_size 50 --duplicate_removal --max_individuals_per_species INT --overwrite

# examples:
python findGenome5.py -g "ranunculaceae" -o ./genome_ranunculaceae --genome_type "nuclear_genome" --overwrite --email XXX@XXX

python findGenome5.py -g "ranunculaceae" -o ./chloroplast_ranunculaceae --genome_type "chloroplast" --duplicate_removal --max_individuals 2 --overwrite --email XXX@XXX

python findGenome5.py -g "ranunculaceae" -o ./mitogenome_ranunculaceae --genome_type "mitochondrial" --duplicate_removal --max_individuals 2 --overwrite --email XXX@XXX

# usage:
findGenome.py [-h] [--outfolfder OUTFOLDER] [--group GROUP]
                   [--genome_type {chloroplast,mitochondrial,nuclear_genome}] [--batch_size BATCH_SIZE]
                   [--duplicate_removal] [--max_individuals MAX_INDIVIDUALS_PER_SPECIES]
                   [--overwrite] [--email EMAIL]

Download plastid, mitochondrial, or nuclear genomes from NCBI.

options:
  -h, --help            Show this help message and exit
  -o, --outfolder       Output folder for downloaded files.
  -g, --group           Taxonomic group or organism name (e.g., genus, family, order).
  -t, --genome_type {chloroplast,mitochondrial,nuclear_genome}
                        The type of genome to download.
  --batch_size BATCH_SIZE
                        Batch size for downloading genomes (only organellar genomes).
  --duplicate_removal   Remove duplicate sequence files (only for organellar genomes).
  --max_individuals MAX_INDIVIDUALS_PER_SPECIES
                        Maximum number of individuals per species to retain (only organellar genomes).
  --overwrite           Overwrite existing output folder.
  --email               Your email for NCBI Entrez queries.

- assembleGenome.py / assembleGenes.py (version 5)

Dependencies This script requires Java 1.6 or later (for Astral). This script requires Python 3.6+ and the following libraries: pip install biopython

These scripts automatically extract and compare features, and align and assemble sequences/annotated CDS regions from files downloaded from NCBI

# basic code:
python assembleGenome.py -i INPUT -f FEATURE_SUMMARY -g --group_order GROUP_ORDER -s -a -r --output_dir OUTPUT_DIR --select_group SELECT_GROUP --overwrite

# example:
For a small test datasets, run: python findGenome.py -g "ranunculus" -o ./chloroplast_ranunculus --genome_type "chloroplast" --duplicate_removal --max_individuals 2 --overwrite --email XXX@XXX
python assembleGenome.py -i chloroplast_ranunculus/*.gb -f ranunculus_features -s -a -r --output_dir chloroplast_ranunculus/

# usage:
assembleGenome.py [-h] --input INPUT [FILE1.gb FILE2.gb ...] [-f FEATURE_SUMMARY] [-g] [-o GROUP_ORDER [GROUP1 GROUP2 ...]]
                       [--generate_gene_sequences] [--align_sequences] [--run_raxml] [--run_astral]
                       [--output_dir OUTPUT_DIR]
                       [--select_group SELECT_GROUP]
                       [--overwrite]

Process GenBank files and extract gene names and sequences.

options:
  -h, --help            show this help message and exit
  -i, --input INPUT     Path to the GenBank files
  -f, --feature_summary FEATURE_SUMMARY
                        Path for feature summary files (CSV and XLSX)
  -g, --group_feature_summary
                        Whether to generate a section-wise feature summary
  -o, --group_order GROUP_ORDER [GROUP_ORDER ...]
                        Optional: Order of sections in the summary files
  -s, --generate_gene_sequences
                        Generate gene sequences in FASTA format
  -a, --align_sequences
                        Align gene sequences using MAFFT
  -r, --run_raxml       Run RAxML-NG for phylogenetic analysis
  --select_group SELECT_GROUP
                        Limit gene extraction to a specific group
  --output_dir OUTPUT_DIR
                        Output directory for gene sequences and alignment results
  --overwrite           Overwrite existing files if they exist

# basic code:
python assembleGenes.py -i INPUT -s FEATURE_SUMMARY -g --group_order GROUP_ORDER -s -a -r --output_dir OUTPUT_DIR --select_group SELECT_GROUP --overwrite

# example:
For a small test datasets, run: python findGenome5.py -g "ranunculus" -o ./chloroplast_ranunculus --genome_type "chloroplast" --duplicate_removal --max_individuals 2 --overwrite --email XXX@XXX
python assembleGenes.py -i chloroplast_ranunculus/*.gb -o ranunculus_gene_features -s -g -a -r -x 

For a small test datasets, run: python findGenome5.py -g "ranunculales" -o ./mitogenome_ranunculales --genome_type "mitochondrial" --duplicate_removal --max_individuals 2 --overwrite --email XXX@XXX
python assembleGenes.py -i mitogenome_ranunculales/*.gb -o ranunculales_gene_features —feeature_section_summary -o Papaveroideae Fumarioideae Thalictroideae Delphinieae Ranunculeae Anemoneae -s -g -a -r -x

# usage:
assembleGenes.py [-h] --input INPUT [FILE1.gb FILE2.gb ...] [-o GROUP_ORDER [GROUP1 GROUP2 ...]] [--feature_section_summary]
                       [--generate_gene_sequences] [--align_sequences] [--run_raxml] [--run_astral]
                       [--output_dir OUTPUT_DIR]
                       [--select_group SELECT_GROUP]
                       [--overwrite]

Process GenBank files and extract gene names and sequences.

options:
  -h, --help            show this help message and exit
  -i, --input INPUT     Path to the GenBank files
  -f, --feature_section_summary
                        Generate section-wise feature summary
  --group_order GROUP_ORDER [GROUP_ORDER ...]
                        Limit gene extraction to a specific group
  -g, --generate_gene_sequences
                        Generate gene sequences in FASTA format
  -a, --align_sequences
                        Align gene sequences using MAFFT
  -r, --run_raxml       Run RAxML-NG for phylogenetic analysis
  -o, --output_dir OUTPUT_DIR
                        Output directory for gene sequences and alignment results
  --overwrite           Overwrite existing files if they exist

If you use any of the scripts, please cite the following reference until the journal article is published:

Karbstein et al. (2024), BioRxiv (https://doi.org/10.1101/2023.08.08.552429)

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
LICENSE		LICENSE
README.md		README.md
assembleGenes.py		assembleGenes.py
assembleGenome.py		assembleGenome.py
findGenome.py		findGenome.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tools-For-Finding-Comparing-Assembling-NCBI-Genomes

- findGenome.py (version 5)

- assembleGenome.py / assembleGenes.py (version 5)

If you use any of the scripts, please cite the following reference until the journal article is published:

About

Releases

Packages

Languages

License

KK260/NCBI-Genome-Tools

Folders and files

Latest commit

History

Repository files navigation

Tools-For-Finding-Comparing-Assembling-NCBI-Genomes

- findGenome.py (version 5)

- assembleGenome.py / assembleGenes.py (version 5)

If you use any of the scripts, please cite the following reference until the journal article is published:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages