This repository has been archived by the owner on Sep 24, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
Adding new Namespace datasets
ncatlett edited this page Feb 28, 2014
·
18 revisions
A 'standard' format is available to use for adding new namespaces to the resource-generation pipeline. This format is a tab-delimited text file with the following columns:
- ID - a unique identifier for the namespace value required
- ALTIDS - any alternative ids
- LABEL - the preferred label for the namespace value required
- SYNONYM - alternative labels, pipe-delimited
- DESCRIPTION - documentation text
- TYPE - the encoding for the namespace value (e.g., 'O' for pathology, 'C' for complex) required
- SPECIES - the species associated with the namespace value, if any
- XREF - equivalent values from other BEL namespaces, pipe-delimited. Must include a recognized prefix to be used for generating equivalences
- OBSOLETE - flag obsolete values with '1'
- PARENTS - any parent terms, valid for ID isA PARENT
- CHILDREN - any child terms, valid for CHILD isA ID General information can be included at the top of the file, but must be preceded with a '#'.
Examples of namespace data in this format can be found for the following namespaces:
- SFAM - Selventa protein families
- SCHEM - Selventa legacy chemical names
- SCOMP - Selventa named complexes
- SDIS - Selventa legacy diseases
To add a namespace dataset in this format to your resource-generator pipeline, the following steps are required:
- Add to configuration.py:
- initialize data object (NOTE - the data object is expected to be named using the prefix for your namespace, followed by "_data" )
my_data = StandardCustomData(name='my-namespace-name', prefix='my')
- configure dataset by adding to baseline_data. baseline_data is an ordered dictionary containing information for all of the data files used by gp_baseline.py. baseline_data maps data file names to a tuple containing [1] file location, [2] the file parser (in parsers.py, and [3] the data object to store the parsed data.
baseline_data['my_file_name'] = ('file_location', parsers.NamespaceParser, my_data)
- Create header templates for .belns and .beleq files
- These are optional for running the pipeline, but will need to be added manually to your .belns and .beleq files to run current versions of the framework.
- Add to templates and name as follows:
- my-namespace-name.belns
- my-namespace-name.beleq
- Add to configuration.py after your dataset initialization:
my_data.ids = True
- Create templates my-namespace-name-ids.belns and my-namespace-name-ids.beleq
Note - the default is to generate .beleq files with a new UUID for each value in your namespace.
- Confirm that the root namespace is included in the equiv_root_data list (add if necessary). Any namespace that you are equivalencing to must generate .beleq files prior to your namespace.
- add to equiv.py equiv function:
elif str(d) == 'my': resolve_xrefs(d, 'chebi', 'chebi_id_eq', verbose)
(here, 'chebi' is the prefix of the xref data, and chebi_id_eq is an equivalence dictionary created within the equiv module.)
(This is a high-level overview)
- configure in configuration.py
- write parser for parsers.py
- write dictionary format for parsed data object in parsed.py
- create data object class in datasets.py inheriting from NamespaceDataSet (or format your parsed dictionary to use the StandardCustomData format)
- add (if desired) handling of your dataset in equiv.py