Problem with GTF file when running ribo-former #2

JC-therea · 2023-08-31T08:42:20Z

Good morning,

I'm currently facing an issue with your program. I've managed to install it successfully, but when using my own data, I'm encountering some problems. The error message I'm encountering is as follows:

$ riboformer template.yaml
Loading assembly data...
INFO:root:Extracted GTF attributes: ['gene_id', 'ID', 'Name', 'transcript_id', 'Parent', 'original_biotype']
Traceback (most recent call last):
  File "/opt/miniconda3/envs/ribo-former/bin/riboformer", line 8, in <module>
    sys.exit(main())
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/transcript_transformer/riboformer.py", line 46, in main
    process_data(args)
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/transcript_transformer/data.py", line 63, in process_data
    f = parse_transcriptome(f, args.gtf_path, args.fa_path)
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/transcript_transformer/data.py", line 83, in parse_transcriptome
    gtf = gtf.with_columns(pl.col("exon_number").cast(pl.Int32, strict=False))
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 6602, in with_columns
    self.lazy()
  File "/opt/miniconda3/envs/ribo-former/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 1438, in collect
    return pli.wrap_df(ldf.collect())
exceptions.ColumnNotFoundError: exon_number

Error originated just after this operation:
DF ["seqname", "source", "feature", "start"]; PROJECT */14 COLUMNS; SELECTION: "None"

I work with yeast, and therefore, the GTF file I'm using follows the same format as the human GTF, but the attribute column is different; it contains less information in my case. Is the 'exon_number' field taken into account during the execution of ribo-former? If this parameter is indeed required, do you think there might be a way to parse the GTF file to correctly add the field 'exon_number' ?

Thank you very much for your assistance.

The text was updated successfully, but these errors were encountered:

JC-therea · 2023-08-31T09:25:15Z

If you have any toy data to test your program in my workstation it would be great too.

jdcla · 2023-08-31T13:48:12Z

The exon_number field is currently required as the code obtains the the transcript sequences from the fasta file and expects introns to be present.
Could you send me a link to the assembly files that you are using, I might take a look to see how they are formatted.
I think I might be able to fix this on my side by simply allowing no exon number to be present if only one entry of exons (feature = exon) is found for a given transcript ID.
Thanks for your feedback.

JC-therea · 2023-08-31T16:52:09Z

Perhaps more fields are missing because I don't know which fields are needed in the attribute column of the gtf... It would be great if you could provide an example of the GTF format you are utilizing to maybe fix it by my side.

Nonetheless, here is included the link to the files I am currently using: https://drive.google.com/drive/folders/1N5FgqSZDGSURKGUj0YojXXiVAiUUvFTD?usp=drive_link

jdcla · 2023-08-31T17:06:10Z

The whole study is build upon the Ensembl human genome, which definitely works.
However, I have allowed flexibility for several fields, so not all fields provided are necessary.
Specifically, all information that would be necessary to parse sequence information from the fasta file is required, such as strand, start, end, exon_number. Additionally, it is necessary to have a transcript_id.

edit: start_codon is also a required field.

jdcla · 2023-09-01T13:29:29Z

I've included some toy data in the RIBO_former repository. Please also update transcript_transformer, as I have fixed a regression bug.

I looked at your gtf and there are definitely multiple transcripts with multiple exons. I believe there is a way forward to parse GTF files without using the exon number feature, as exons can be sorted is either through using the ID field or by simply looking at the coordinates and strand information. The latter seems more robust.

However, I'm somewhat hesitant to do this at this point. Your gtf file furthermore has the problem of not containing any start codon fields, which is the feature used to obtain the posive set (Translation initiation sites). I'm not sure how common it is to have assemblies without the exon_number and start_codon feature. For the human genome, not every coding sequence has a start codon feature. If I remember correctly, this is because not every transcript isoform is valid (wrong splicing etc.)

I assume these gtf files have been generated through some program locally? Could you share how you obtained these?

jdcla · 2023-09-01T13:46:38Z

There is another problem, namely that the script currently does not support non-human transcriptome fine-tuning. More precise, it does not search for seqnames that are not part of the human transcriptome.

There is no inherent blocker to use the pre-trained model for non-human data, but I have currently implemented the code in a way that it only looks at and splits the data according to the seqnames the models were pre-trained on (https://github.com/jdcla/transcript_transformer/blob/main/transcript_transformer/pretrained/riboformer_models/50perc_06_23.yml).

I added this feature to the roadmap and hope to implement this soon. See the README.md to track this.

JC-therea · 2023-09-01T14:51:56Z

Hello! Thank you for continuing to assist with the issue I've encountered. The GTF file I'm using is an adaptation of a GFF3 file obtained from the most up-to-date S. pombe genome page at https://www.pombase.org/data/genome_sequence_and_features/gff3/. As you can see, it doesn't contain much of the information you mentioned. It's a limitation due to the data types I'm working with, which are inherently less comprehensive than human annotations.

I generated the GTF using AGAT (https://github.com/NBISweden/AGAT/tree/master), which is quite robust for making these types of modifications.

For now, I'll attempt to use the toy data to gain a more detailed understanding of how the program operates.

Once again, thank you very much for the effort you've put into helping me with this issue.

jdcla · 2023-09-01T18:20:11Z

You're welcome. I appreciate the feedback.
I will keep this issue open as it is relevant to address the exon_number problem.

jdcla closed this as completed Sep 1, 2023

jdcla reopened this Sep 1, 2023

JC-therea closed this as completed Sep 1, 2023

jdcla reopened this Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with GTF file when running ribo-former #2

Problem with GTF file when running ribo-former #2

JC-therea commented Aug 31, 2023

JC-therea commented Aug 31, 2023

jdcla commented Aug 31, 2023 •

edited

Loading

JC-therea commented Aug 31, 2023

jdcla commented Aug 31, 2023 •

edited

Loading

jdcla commented Sep 1, 2023

jdcla commented Sep 1, 2023

JC-therea commented Sep 1, 2023

jdcla commented Sep 1, 2023

Problem with GTF file when running ribo-former #2

Problem with GTF file when running ribo-former #2

Comments

JC-therea commented Aug 31, 2023

JC-therea commented Aug 31, 2023

jdcla commented Aug 31, 2023 • edited Loading

JC-therea commented Aug 31, 2023

jdcla commented Aug 31, 2023 • edited Loading

jdcla commented Sep 1, 2023

jdcla commented Sep 1, 2023

JC-therea commented Sep 1, 2023

jdcla commented Sep 1, 2023

jdcla commented Aug 31, 2023 •

edited

Loading

jdcla commented Aug 31, 2023 •

edited

Loading