Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marker files conversion #164

Closed
shawntanzk opened this issue Aug 16, 2021 · 16 comments
Closed

Marker files conversion #164

shawntanzk opened this issue Aug 16, 2021 · 16 comments
Assignees

Comments

@shawntanzk
Copy link
Collaborator

Have gotten the NS-Forest marker file from Brian, however, they are not in the right format.

https://github.com/obophenotype/brain_data_standards_ontologies/tree/dosdp_based_pipeline/src/markers/raw

We need to convert the gene names into ensembl ID - @hkir-dev could you help write something that can do this?

Currently it is all in cluster name too and need to be matched to taxonomy ID - happy to do this manually if it is not easily scriptable, but I think they the cluster names match perfectly pref label in the csv files, so should be able to write a R code that does this too

@hkir-dev
Copy link
Contributor

Some markers (such as LOC100392984, FAM19A1) don't exist in our marker file (src/patterns/data/bds/ensmusg_data.tsv). Can we use one of the rest services from https://rest.ensembl.org/ to automatically search these markers?

@shawntanzk
Copy link
Collaborator Author

yeah I think thats what Brian was asking about release version and all, see issue #163
but yeah, as long as we get an ensembl thing that can resolve, should be ok

@shawntanzk
Copy link
Collaborator Author

@BAevermann - think we briefly talked about this yesterday, was wondering what version of ensembl should we look at that would match your dataset?
I've tried a few versions and lots still come up with heaps of NA.

Thanks

@shawntanzk
Copy link
Collaborator Author

shawntanzk commented Aug 17, 2021

So I just learnt that a gene symbol can have multiple ensembl IDs (tried using EnsDb.Hsapiens.v86 & biomaRt to convert and kept getting extra lines) so I'm not sure how to handle the conversion from symbols to ensembl accurately.

We currently are basing it off our marker file (src/patterns/data/bds/ensmusg_data.tsv) but we aren't sure where this came from. Also, I'm not sure how the mouse marker file was generated? Was it you guys @jeremymiller? If so would it be possible to help us with the humans and marmoset too?

@hkir-dev already got accession Id from cluster name so its just converting gene symbols to ensembl ID now.

@shawntanzk
Copy link
Collaborator Author

shawntanzk commented Aug 17, 2021

have downloaded the dataset from biomart, hope it is correct:
1fb5bc0
Currently i dont think some of the LOC terms are in ensembl, which is a bit of an issue

@shawntanzk
Copy link
Collaborator Author

Items missing in the db:

For marmoset: {'LOC103788553', 'LOC108588895', 'GABRG1', 'CALN1', 'CCDC129', 'FAM19A1', 'LOC100401328', 'PCDH11X', 'LOC103789268', 'LOC103793569', 'LOC100392984', 'LOC100384959', 'SEPP1', 'LOC103791740', 'LOC103789461', 'LOC108588539', 'LOC108587679', 'LOC103788313', 'FAM179A', 'LOC108593203', 'MOBP', 'LOC103787232', 'FYB', 'LOC108588071', 'LOC103793609', 'LOC100403193', 'LOC108589948', 'LOC103795407', 'LOC100405319', 'LOC103795617', 'LOC108588801', 'LOC100406856', 'LOC100408486', 'MYO5B', 'LOC108588466', 'LOC103788721'}

For human: {'LOC105376081', 'LOC105369818', 'LOC105369890', 'LOC105378334', 'LOC100506497', 'LOC284825', 'LOC105376372', 'LOC105373642', 'LOC100128108', 'LOC101928964', 'LOC105374392', 'LOC101927281', 'ZFPM2_AS1', 'FAM19A1', 'LOC105370315', 'LOC100996671', 'LOC101927874', 'FER1L6_AS2', 'LOC101928114', 'LOC100132891', 'LOC105370019', 'LOC105373893', 'NPSR1_AS1', 'LOC105374971', 'FAM150B', 'LOC105374524', 'LOC105377183', 'LOC101929028', 'LOC105376917', 'LOC101927389', 'LOC105378486', 'LOC105373454', 'LOC105371310', 'LOC100128497', 'LOC105379168', 'LOC105370456', 'LOC101927459', 'LOC101927843', 'LOC105379064', 'LOC101927745', 'LOC101927668', 'LOC105377209', 'LOC101929680', 'LOC105370610', 'LOC105378031', 'LOC101928278', 'LOC105371832', 'SOX2_OT', 'LOC105376457', 'LOC101927286', 'RNF219_AS1', 'LOC101927078', 'LOC105379003', 'LOC105377703', 'LOC101927439', 'LOC105377862', 'LOC105376987', 'LOC105374973', 'LOC105373592', 'LOC101928842', 'LOC401134', 'LOC101926942', 'ADD3_AS1', 'LOC101927199', 'LOC105371663', 'LOC100507562'}

@shawntanzk
Copy link
Collaborator Author

Oligo L3-6 OPALIN LRP4-AS1 exists in the NS-forest marker file, but not anywhere else @BAevermann - need some clarification here, thanks

@jeremymiller
Copy link
Collaborator

Hi Shawn. I think Brian is the best person to answer all of the above questions. Biomart is an appropriate place to do gene conversions, but is not the ONLY place, so Brian may need to weigh in. For all of the Allen institute generated data sets (including human M1), you can download the reference transcriptome file here: https://portal.brain-map.org/atlases-and-data/rnaseq/reference-genome-files, which includes the NCBI Entrez Gene gene ids, but not ensembl.

@BAevermann
Copy link
Collaborator

Hey Jeremy!

So 2 of 3 links on the GTF page appear dead. The mouse smartseq worked though. The file contains ENSEMBL and Havanna (so gencode I presume?) annotation; unfortunately, it doesnt seem to track the overall version of the annotation. Any chance that its still referred to somewhere in the data processing scripts?

b.

@jeremymiller
Copy link
Collaborator

All three links work for me. Try hard-reloading your browser? The mouse SMART-seq probably does not match what was done in the M1 mouse paper. Unfortunately, I don't know the answer of your other questions. Zizhen might know.

@BAevermann
Copy link
Collaborator

Ok. Ill try.

Anyway, we will need to set up infrastructure... perhaps just a metadata field ... that captures the annotation version used. We are moving onto the Human/Marmoset datasets, would Zizhen also know about them?

@jeremymiller
Copy link
Collaborator

jeremymiller commented Aug 17, 2021 via email

@shawntanzk
Copy link
Collaborator Author

noticed that antisense gene (eg FER1L6_AS2) uses a - instead of _ in the database
Will change it in the forest file gotten by Brian accordingly

@dosumis
Copy link
Contributor

dosumis commented Aug 25, 2021

@jeremymiller - IIRC we got the actual Ensembl reference file used for the mouse analysis. Can we get the same for human and marmoset? Is Nik the right person to ask for these? This should be shared with @BAevermann too as ref gene names and IDs should come from the same genome build/Ensembl release.

@jeremymiller
Copy link
Collaborator

@dosumis: there is an email thread going around about this, which I'll add you to. I'm not sure the status at the moment, and I most of our references use NCBI gene IDs rather than Ensembl so the conversion might still be an issue. Let's resolve in the email thread.

@shawntanzk
Copy link
Collaborator Author

Think we have a plan for standardisation, am going to close this ticket cause the name is confusing - will look through tickets to see if there is an open one about using same reference gene files, if not will write one up with what stage we are up to now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants