-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marker files conversion #164
Comments
Some markers (such as LOC100392984, FAM19A1) don't exist in our marker file (src/patterns/data/bds/ensmusg_data.tsv). Can we use one of the rest services from https://rest.ensembl.org/ to automatically search these markers? |
yeah I think thats what Brian was asking about release version and all, see issue #163 |
@BAevermann - think we briefly talked about this yesterday, was wondering what version of ensembl should we look at that would match your dataset? Thanks |
So I just learnt that a gene symbol can have multiple ensembl IDs (tried using We currently are basing it off our marker file (src/patterns/data/bds/ensmusg_data.tsv) but we aren't sure where this came from. Also, I'm not sure how the mouse marker file was generated? Was it you guys @jeremymiller? If so would it be possible to help us with the humans and marmoset too? @hkir-dev already got accession Id from cluster name so its just converting gene symbols to ensembl ID now. |
have downloaded the dataset from biomart, hope it is correct: |
Items missing in the db: For marmoset: {'LOC103788553', 'LOC108588895', 'GABRG1', 'CALN1', 'CCDC129', 'FAM19A1', 'LOC100401328', 'PCDH11X', 'LOC103789268', 'LOC103793569', 'LOC100392984', 'LOC100384959', 'SEPP1', 'LOC103791740', 'LOC103789461', 'LOC108588539', 'LOC108587679', 'LOC103788313', 'FAM179A', 'LOC108593203', 'MOBP', 'LOC103787232', 'FYB', 'LOC108588071', 'LOC103793609', 'LOC100403193', 'LOC108589948', 'LOC103795407', 'LOC100405319', 'LOC103795617', 'LOC108588801', 'LOC100406856', 'LOC100408486', 'MYO5B', 'LOC108588466', 'LOC103788721'} For human: {'LOC105376081', 'LOC105369818', 'LOC105369890', 'LOC105378334', 'LOC100506497', 'LOC284825', 'LOC105376372', 'LOC105373642', 'LOC100128108', 'LOC101928964', 'LOC105374392', 'LOC101927281', 'ZFPM2_AS1', 'FAM19A1', 'LOC105370315', 'LOC100996671', 'LOC101927874', 'FER1L6_AS2', 'LOC101928114', 'LOC100132891', 'LOC105370019', 'LOC105373893', 'NPSR1_AS1', 'LOC105374971', 'FAM150B', 'LOC105374524', 'LOC105377183', 'LOC101929028', 'LOC105376917', 'LOC101927389', 'LOC105378486', 'LOC105373454', 'LOC105371310', 'LOC100128497', 'LOC105379168', 'LOC105370456', 'LOC101927459', 'LOC101927843', 'LOC105379064', 'LOC101927745', 'LOC101927668', 'LOC105377209', 'LOC101929680', 'LOC105370610', 'LOC105378031', 'LOC101928278', 'LOC105371832', 'SOX2_OT', 'LOC105376457', 'LOC101927286', 'RNF219_AS1', 'LOC101927078', 'LOC105379003', 'LOC105377703', 'LOC101927439', 'LOC105377862', 'LOC105376987', 'LOC105374973', 'LOC105373592', 'LOC101928842', 'LOC401134', 'LOC101926942', 'ADD3_AS1', 'LOC101927199', 'LOC105371663', 'LOC100507562'} |
Oligo L3-6 OPALIN LRP4-AS1 exists in the NS-forest marker file, but not anywhere else @BAevermann - need some clarification here, thanks |
Hi Shawn. I think Brian is the best person to answer all of the above questions. Biomart is an appropriate place to do gene conversions, but is not the ONLY place, so Brian may need to weigh in. For all of the Allen institute generated data sets (including human M1), you can download the reference transcriptome file here: https://portal.brain-map.org/atlases-and-data/rnaseq/reference-genome-files, which includes the NCBI Entrez Gene gene ids, but not ensembl. |
Hey Jeremy! So 2 of 3 links on the GTF page appear dead. The mouse smartseq worked though. The file contains ENSEMBL and Havanna (so gencode I presume?) annotation; unfortunately, it doesnt seem to track the overall version of the annotation. Any chance that its still referred to somewhere in the data processing scripts? b. |
All three links work for me. Try hard-reloading your browser? The mouse SMART-seq probably does not match what was done in the M1 mouse paper. Unfortunately, I don't know the answer of your other questions. Zizhen might know. |
Ok. Ill try. Anyway, we will need to set up infrastructure... perhaps just a metadata field ... that captures the annotation version used. We are moving onto the Human/Marmoset datasets, would Zizhen also know about them? |
noticed that antisense gene (eg FER1L6_AS2) uses a - instead of _ in the database |
@jeremymiller - IIRC we got the actual Ensembl reference file used for the mouse analysis. Can we get the same for human and marmoset? Is Nik the right person to ask for these? This should be shared with @BAevermann too as ref gene names and IDs should come from the same genome build/Ensembl release. |
@dosumis: there is an email thread going around about this, which I'll add you to. I'm not sure the status at the moment, and I most of our references use NCBI gene IDs rather than Ensembl so the conversion might still be an issue. Let's resolve in the email thread. |
Think we have a plan for standardisation, am going to close this ticket cause the name is confusing - will look through tickets to see if there is an open one about using same reference gene files, if not will write one up with what stage we are up to now |
Have gotten the NS-Forest marker file from Brian, however, they are not in the right format.
https://github.com/obophenotype/brain_data_standards_ontologies/tree/dosdp_based_pipeline/src/markers/raw
We need to convert the gene names into ensembl ID - @hkir-dev could you help write something that can do this?
Currently it is all in cluster name too and need to be matched to taxonomy ID - happy to do this manually if it is not easily scriptable, but I think they the cluster names match perfectly pref label in the csv files, so should be able to write a R code that does this too
The text was updated successfully, but these errors were encountered: