Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snpgenie\_within\_group.pl: Question about input file formats #37

Open
annasimonsen opened this issue Aug 7, 2020 · 2 comments
Open

Comments

@annasimonsen
Copy link

I have several thousand fasta files, each fasta file represents a single gene containing all sequenced individuals from a single population. Each fasta file is a nucleotide alignment, which I have attempted to framecode align using TranslatorX. I would like to calculate piN/piS for each gene using snpgenie_within_group.pl

I do not have a gtf file. Can I run snpgenie_within_group.pl without the gtf file? If not, can you offer any guidance on how would I format this type of sequence data and generate the correctly formatted gtf file?

Thanks

@singing-scientist
Copy link
Contributor

singing-scientist commented Aug 7, 2020

Thanks a lot for the question @annasimonsen! Unfortunately I wrote the script with chromosomes (not genes) in mind and did not have enough foresight to allow flexible usage without the GTF. However, I think your pipeline could auto-generate temporary GTF files one the fly. For example, suppose your directory contained three files: gene1.fasta, gene2.fasta, and gene3.fasta. You'll probably be looping through these files somehow, perhaps using a wrapper Unix script. When you hit gene1.fasta, you can determine the length of the sequences inside — I think bioawk has something ready made, or you could get clever with cat, grep, and awk — and then write a file called gene1.gtf. For example, if gene1.fasta is an alignment of sequences with 693 nucleotides, you'd simply write the 1-line GTF file:

my_temp_gtf my_temp_gtf CDS 1 693 . + 0 gene_id "gene1";

Then provide that temp file as an argument to SNPGenie, and delete when finished (or whatever you prefer). In other words, ever temp GTF file you produce will be a single line that species one gene beginning at 1 (in every case) and ending at the last site (i.e. length).

Let me know if that helps!

Chase

@annasimonsen
Copy link
Author

Thanks for your quick reply! Ill let you know how that goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants