[Feature Request] Restore information from Matrix Market files in recount3 #40

jiapeiyuan17 · 2023-11-28T00:53:04Z

Hi Ben and Kasper,

Now we are conducting a project utilizing data from GTEx project. We are particularly interested in the resource presented in recount3 and would like to seek clarification on two specific points:

In your method, you mentioned that "When STAR performs spliced alignment, it outputs a high-confidence collection of splice-junction calls in a file named (SJ.out.tab)". And in the recount3, we could get the Matrix Market file. Can you confirm whether these aggregated files contain the information found in the last three columns of the SJ.out.tab file?
If so, is there a way to convert the Matrix Market file back to bed file with the counts of junction reads?

Your prompt response to these inquiries would be greatly appreciated. Thank you for your attention to this matter.

Best,
Jiapei

ChristopherWilks · 2023-11-29T17:06:18Z

Hi Jiapei,

Thanks for your interest in recount3!

For 1., the recount3 matrix market files are derived from the aggregate SJ.out.tab files across the samples for a particular study (or tissue in the case of GTEx v8). I'll have to double check if we did any additional filtering (since it's been a while), but the contents should be the vast majority of what was SJ.out.tab files.

For 2. given that you want the splice junctions in a bed file of counts you're probably best off using Snaptron's re-formatted version of the GTEx v8 junctions in recount3:

https://snaptron.cs.jhu.edu/data/gtexv2/junctions.bgz

The header file is:
https://snaptron.cs.jhu.edu/data/junctions.header.tsv

You'll also want to (minimally) download the GTEx samples description TSV:
https://snaptron.cs.jhu.edu/data/gtexv2/samples.tsv

where the rail_id column (first column) is the sample ID that appears in the comma delimited nested list (field samples in the junctions file) for each junction to define which GTEx samples it appears in (has at least one read supporting). That field also contains the spliced read count of the junction for that sample, e.g. <sample_id>:<spliced_read_count>,...

Chris

ChristopherWilks · 2023-11-29T17:19:14Z

Also, I should point out, the .bgz file is a gzip-compatible block-gzip format that can be read by gzip or pigz. But there's also the Tabix index file: https://snaptron.cs.jhu.edu/data/gtexv2/junctions.bgz.tbi which you can use to quickly query a genomic coordinate range of junctions as well.

lcolladotor · 2023-11-30T23:29:26Z

Hi,

It sounds that thanks to Chris we can close this issue. Is that right Jiapei?

Best,
Leo

lcolladotor added the question Further information is requested label Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Restore information from Matrix Market files in recount3 #40

[Feature Request] Restore information from Matrix Market files in recount3 #40

jiapeiyuan17 commented Nov 28, 2023

ChristopherWilks commented Nov 29, 2023

ChristopherWilks commented Nov 29, 2023

lcolladotor commented Nov 30, 2023

[Feature Request] Restore information from Matrix Market files in recount3 #40

[Feature Request] Restore information from Matrix Market files in recount3 #40

Comments

jiapeiyuan17 commented Nov 28, 2023

ChristopherWilks commented Nov 29, 2023

ChristopherWilks commented Nov 29, 2023

lcolladotor commented Nov 30, 2023