Adding new Namespace datasets

File format

A 'standard' format is available to use for adding new namespaces to the resource-generation pipeline. This format is a tab-delimited text file with the following columns:

ID - a unique identifier for the namespace value required
ALTIDS - any alternative ids
LABEL - the preferred label for the namespace value required
SYNONYM - alternative labels, pipe-delimited
DESCRIPTION - documentation text
TYPE - the encoding for the namespace value (e.g., 'O' for pathology, 'C' for complex) required
SPECIES - the species associated with the namespace value, if any
XREF - equivalent values from other BEL namespaces, pipe-delimited. Must include a recognized prefix to be used for generating equivalences
OBSOLETE - flag obsolete values with '1'
PARENTS - any parent terms, valid for ID isA PARENT
CHILDREN - any child terms, valid for CHILD isA ID General information can be included at the top of the file, but must be preceded with a '#'.

Example data

Examples of namespace data in this format can be found for the following namespaces:

SFAM - Selventa protein families
SCHEM - Selventa legacy chemical names
SCOMP - Selventa named complexes
SDIS - Selventa legacy diseases

Integration into resource-generator pipeline

To add a namespace dataset in this format to your resource-generator pipeline, the following steps are required:

Add to configuration.py:
initialize data object (NOTE - the data object is expected to be named using the prefix for your namespace, followed by "_data" ) my_data = StandardCustomData(name='my-namespace-name', prefix='my', domain=['my-namespace-domain'])
configure dataset by adding to baseline_data. baseline_data is an ordered dictionary containing information for all of the data files used by gp_baseline.py. baseline_data maps data file names to a tuple containing [1] file location, [2] the file parser (in parsers.py, and [3] the data object to store the parsed data. baseline_data['my_file_name'] = ('file_location', parsers.NamespaceParser, my_data)
Create header templates for .belns and .beleq files
These are optional for running the pipeline, but will need to be added manually to your .belns and .beleq files to run current versions of the framework.
Add to templates and name as follows:
- my-namespace-name.belns
- my-namespace-name.beleq

Optional - Add .belns and .beleq files for IDs as well as labels

Add to configuration.py after your dataset initialization: my_data.ids = True
Create templates my-namespace-name-ids.belns and my-namespace-name-ids.beleq

Optional - Create equivalences to an existing BEL namespace using XREFS in your dataset

Note - the default is to generate .beleq files with a new UUID for each value in your namespace.

Confirm that the root namespace is included in the equiv_root_data list (add if necessary). Any namespace that you are equivalencing to must generate .beleq files prior to your namespace.
add to equiv.py equiv function: elif str(d) == 'my': resolve_xrefs(d, 'chebi', 'chebi_id_eq', verbose) (here, 'chebi' is the prefix of the xref data, and chebi_id_eq is an equivalence dictionary created within the equiv module.)

If you need to add a namespace data set in your own format

(This is a high-level overview)

configure in configuration.py
write parser for parsers.py
write dictionary format for parsed data object in parsed.py
create data object class in datasets.py inheriting from NamespaceDataSet (or format your parsed dictionary to use the StandardCustomData format)
add (if desired) handling of your dataset in equiv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly