Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Restore information from Matrix Market files in recount3 #40

Open
jiapeiyuan17 opened this issue Nov 28, 2023 · 3 comments
Labels
question Further information is requested

Comments

@jiapeiyuan17
Copy link

Hi Ben and Kasper,

Now we are conducting a project utilizing data from GTEx project. We are particularly interested in the resource presented in recount3 and would like to seek clarification on two specific points:

  1. In your method, you mentioned that "When STAR performs spliced alignment, it outputs a high-confidence collection of splice-junction calls in a file named (SJ.out.tab)". And in the recount3, we could get the Matrix Market file. Can you confirm whether these aggregated files contain the information found in the last three columns of the SJ.out.tab file?
  2. If so, is there a way to convert the Matrix Market file back to bed file with the counts of junction reads?

Your prompt response ​to these inquiries would be greatly appreciated. Thank you for your attention to this matter.

Best,
Jiapei

@ChristopherWilks
Copy link
Collaborator

Hi Jiapei,

Thanks for your interest in recount3!

For 1., the recount3 matrix market files are derived from the aggregate SJ.out.tab files across the samples for a particular study (or tissue in the case of GTEx v8). I'll have to double check if we did any additional filtering (since it's been a while), but the contents should be the vast majority of what was SJ.out.tab files.

For 2. given that you want the splice junctions in a bed file of counts you're probably best off using Snaptron's re-formatted version of the GTEx v8 junctions in recount3:

https://snaptron.cs.jhu.edu/data/gtexv2/junctions.bgz

The header file is:
https://snaptron.cs.jhu.edu/data/junctions.header.tsv

You'll also want to (minimally) download the GTEx samples description TSV:
https://snaptron.cs.jhu.edu/data/gtexv2/samples.tsv

where the rail_id column (first column) is the sample ID that appears in the comma delimited nested list (field samples in the junctions file) for each junction to define which GTEx samples it appears in (has at least one read supporting). That field also contains the spliced read count of the junction for that sample, e.g. <sample_id>:<spliced_read_count>,...

Chris

@ChristopherWilks
Copy link
Collaborator

Also, I should point out, the .bgz file is a gzip-compatible block-gzip format that can be read by gzip or pigz. But there's also the Tabix index file: https://snaptron.cs.jhu.edu/data/gtexv2/junctions.bgz.tbi which you can use to quickly query a genomic coordinate range of junctions as well.

@lcolladotor lcolladotor added the question Further information is requested label Nov 30, 2023
@lcolladotor
Copy link
Member

Hi,

It sounds that thanks to Chris we can close this issue. Is that right Jiapei?

Best,
Leo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants