Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing models from a list #123

Open
dcopetti opened this issue Mar 3, 2022 · 2 comments
Open

removing models from a list #123

dcopetti opened this issue Mar 3, 2022 · 2 comments

Comments

@dcopetti
Copy link

dcopetti commented Mar 3, 2022

Hello,

I would like to use the GFF3toolkit to remove some gene models (all with one isoform, from an external list) from a gff3 file. I first run
gff3_QC -g assembly_MAKER1.gff -f assembly.fa -o QC_report1 -s QC_stats1
and got this report:

==> QC_report <==
Line_num        Error_code      Error_level     Error_tag
['Line 1']      Esf0014 Error   ["##gff-version" missing from the first line]
['Line 15079']  Esf0012 Info    [Found 5 Ns in CDS feature of length 296 using the external FASTA, consists of 1 segment (start, length): (210940, 5)]

==> QC_stats <==
Error_code      Number_of_problematic_models    Error_level     Error_tag
Esf0014 1       Error   ##gff-version" missing from the first line
Esf0012 1       Info    Found Ns in a feature using the external FASTA

(I can fix the header myself)
I wonder how I can use gff3_fix to remove ~1500 genes (gene, mRNA, exon, and CDS lines): is it possible to create a 4-column file to submit to -qc_r? Can I use any of the error codes that have a "delete_model" function? Is there a way to specify the gene ID instead of the line number?

Also, is there a feature to remove gene models whose protein sequence does not start with M?
Thanks,
Dario

@mpoelchau
Copy link
Contributor

Hi @dcopetti - that's an interesting use case! I suppose you could hack a qc report file to get that done. The qc reports are line-based because not every feature in gff3 is required to have an ID. So you could provide the line number of the gene feature and assign it an error code that uses the delete_model function (https://github.com/NAL-i5K/GFF3toolkit/blob/master/docs/gff3_fix.py-documentation.rst). I've never tried this, but it might work.

The gff3toolkit doesn't have a function to flag or delete models with partial protein sequences.

@dcopetti
Copy link
Author

dcopetti commented Mar 4, 2022

Thanks, I will try it next time!
I found a way with gffread, using --nids and --keep_genes - mine was not a new problem after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants