Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement download tracker and pipeline execution change #24

Merged
merged 12 commits into from
Aug 6, 2020
34 changes: 27 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,45 @@

[PubTator](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/) and its 2.0 version ([PubTator Central](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTatorCentral/)) uses text mining to tag PubMed abstracts/artciles with standardized concepts. This repository retrieves and processes PubTator annotations for use in [`greenelab/snorkeling`](https://github.com/greenelab/snorkeling) and elsewhere.

## Environment
# Get Started

Install the [conda](https://conda.io) environment specified in [`environment.yml`](environment.yml) by running:
## Set-up Environment

### Conda

1. Install the [conda](https://conda.io) environment.
2. Create the pubtator environmenmt by running:

```sh
conda create --name Pubtator python=3.8
```
3. Install packages via pip by running the following:

```sh
pip install -r requirements.txt
```

4. Activate with `conda activate pubtator`.

### Pip

1. Make sure you have python version **3.8** installed.
2. Install packages by running the following:

```sh
conda env create --file environment.yml
pip install -r requirements.txt
```

Activate with `conda activate pubtator`.

## Execution

To download and extract Pubator Central's data (default) run the following:
To start processing Pubtator/Pubtator Central run the following command:

```sh
bash execute.sh {email address here}
python execute.py --config config_files/pubtator_central_config.json
```

If the original Pubtator is desired run the above command with the following flag: --pubtator. You do not need to provide your email address when running the first version of Pubtator.
If the original Pubtator is desired replace `pubtator_central_config.json` with `pubtator_config.json`. The json file contains all the necessary parameters needed to run. More information for the json file can be found [here](config_files).

## License

Expand Down
96 changes: 96 additions & 0 deletions config_files/CONFIG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Configuration Files Explained

This file explains the pipeline steps and parameters needed for each step.
**Note: any added parameter or step will be ignored unless `execute.py` is manually changed.**

## Repository Download

This is the first step of the Pubtator pipeline.
Basically this step downloads Pubtator Central's annotation file from their ftp server.

Following Parameters for this section:
| Param | Description | Accepted Values |
| --- | --- | --- |
| url | the url to download the file from | a string with a url path |
| download_folder | the folder to hold the downloaded file | a string name for the folder |
| skip | tell execute.py to ignore this step and contine | true or false |

## Pubtator to XML

This is the second step of the Pubtator pipeline.
This step converts Pubtator/Pubtator Central's annotation file into xml format.
**Note: This step may take awhile to complete**

Following Parameters for this section:
| Param | Description | Accepted Values |
| --- | --- | --- |
| documents | The file path pointing to the downloaded file from the previous step. | a string for the file path |
| output | The file path to save the xml file. Make sure to keep the xz extension. | a string for the file path |
| skip | Tell execute.py to ignore this step and contine | true or false |

## Extract Tags

This is the second step of the Pubtator pipeline.
This step extracts Pubtator/Pubtator Central's annotations from the xml file.
It outputs a tsv file that contains all extracted annotation.
**Note: This step may take awhile to complete**

Following Parameters for this section:
| Param | Description | Accepted Values |
| --- | --- | --- |
| input | The file path pointing to the xml file in previouss step. Make sure to keep the xz extension. | a string for the file path |
| output | The file path to save the tsv file. Make sure to keep the xz extension. | a string for the file path |
| skip | Tell execute.py to ignore this step and contine | true or false |

## Hetnet ID Extractor

This is the third step of the Pubtator pipeline.
This step filters out extracted annotations to only include tags within [Hetionet's Database](https://het.io/).

Following Parameters for this section:
| Param | Description | Accepted Values |
| --- | --- | --- |
| input | The file path pointing to the tsv file in previous step. Make sure to keep the xz extension. | a string for the file path |
| output | The file path to save the tsv file. Make sure to keep the xz extension. | a string for the file path |
| skip | Tell execute.py to ignore this step and contine | true or false |

## Map PMIDS to PMCIDS

This is the forth step of the Pubtator pipeline.
This step querys NCBI's pmid to pmcid mapper in order to grab PMCIDS.
**Note: To download full text you will need to have PMCIDS. PMIDS will not work.**

Following Parameters for this section:
| Param | Description | Accepted Values |
| --- | --- | --- |
| input | The file path pointing to the tsv file in extract tags step. Make sure to keep the xz extension. | a string for the file path |
| output | The file path to save the tsv file. | a string for the file path |
| debug | This is a flag for debugging purposes. Feel free to ignore and leave as false. | true or false |
| skip | Tell execute.py to ignore this step and contine | true or false |

## Download Full Text

This is the fifth step of the Pubtator pipeline.
This step queries Pubtator Central's api and downloads annotated full text if text is present.

Following Parameters for this section:
| Param | Description | Accepted Values |
| --- | --- | --- |
| input | The file path pointing to the tsv file in previous step. | a string for the file path |
| output | The file path to save the xml file. | a string for the file path |
| temp_dir | The folder to hold temporary batch files for this step of the pipeline | a string for the folder path |
| log_file | A log file that keeps track of the IDs that have already been queried. It is used to monitor progress in case the process is interrupted. Make sure it has the tsv extension. | a file path for the file |
| skip | Tell execute.py to ignore this step and contine | true or false |

## Extract Full Text Tags

This is the sixth step of the Pubtator pipeline.
This step extracts tags from full text documents.
Please refer to [Extract Tags Section](#extract-tags) for parameter details.

## Hetnet ID Extractor Full Text

This is the last step of the Pubtator pipeline.
This step filters tags to only have Hetionet tags.
Please refer to [Hetnet ID Extractor Section](#hetnet-id-extractor) for parameter details.

37 changes: 37 additions & 0 deletions config_files/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Configuration Files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand the purpose of these config files - you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true or changing the output filenames)?

Just wondering if you need to document what each of the fields mean somewhere. Most of them are fairly obvious from the name, so I think it's probably not necessary but if you expect users to be changing things by hand a lot I might feel differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true or changing the output filenames)?

Correct. The idea is to provide the fields already, so a user can change directories as needed.

Just wondering if you need to document what each of the fields mean somewhere.

Good idea. I'll add documentation to this PR.


## File Description

| File | Description |
| --- | --- |
| [pubtator central config](pubtator_central_config.json) | This is a configuration file for parsing Pubtator Central. |
| [pubtator config](pubtator_config.json) | This is a configuration file for parsing Pubtator (older version of Pubtator Central). |
| [tests config](tests_config.json) | This is a configuration file for testing the pubtator system. Feel free to ignore this file. |

## Usage

Each configuration file is in json format and contains parameters for each step within the pubtator pipeline.
All files are organized by order of operation, which means the very first step occurs at the top and the subsequent step comes right afterwards.
Every step can be skipped, which allows one to continue the pipeline at any step one chooses.
Please refer to [CONFIG.md](CONFIG.md) for more details on each pipeline step and their respective parameters.

Example config file:
```json
{
"pipeline step 1":{
"param1":"param1_value",
"param2":"param2_value",
"skip":false
},
"pipeline step 2":{
"param1":"param1_value",
"param2":"param2_value",
"skip":false
},
"pipeline step 3":{
"param1":"param1_value",
"param2":"param2_value",
"skip":false
}
}
```
54 changes: 54 additions & 0 deletions config_files/pubtator_central_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
"repository_download":{
"url":"ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz",
"download_folder":"download",
"skip":false
},

"pubtator_to_xml": {
"documents":"download/bioconcepts2pubtatorcentral.offset.gz",
"output":"data/pubtator-central-docs.xml.xz",
"skip":false
},

"extract_tags":{
"input":"data/pubtator-central-docs.xml.xz",
"output":"data/pubtator-central-tags.tsv.xz",
"skip":false
},

"hetnet_id_extractor":{
"input":"data/pubtator-central-tags.tsv.xz",
"output":"data/pubtator-central-hetnet-tags.tsv.xz",
"skip":false
},

"map_pmids_to_pmcids":{
"input":"data/pubtator-central-tags.tsv.xz",
"output":"data/pubtator-pmids-to-pmcids.tsv",
"debug":false,
"skip":false
},

"download_full_text":{
"input":"data/pubtator-pmids-to-pmcids.tsv",
"document_batch":100,
"output":" data/pubtator-central-full-text.xml",
"temp_dir":"data/temp",
"log_file":"batch_log.tsv",
"skip":false
},

"extract_full_text_tags":{
"input":"data/pubtator-central-full-text.xml",
"output":"data/pubtator-central-full-text-tags.tsv.xz",
"skip":false
},

"hetnet_id_extractor_full_text":{
"input":"data/pubtator-central-full-text-tags.tsv.xz",
"output":"data/pubtator-central-full-hetnet-tags.tsv.xz",
"skip":false
}

}
25 changes: 25 additions & 0 deletions config_files/pubtator_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"repository_download":{
"url":"ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/bioconcepts2pubtator_offsets.gz",
"download_folder":"download",
"skip":false
},

"pubtator_to_xml": {
"documents":"download/bioconcepts2pubtator_offsets.gz",
"output":"data/pubtator-docs.xml.xz",
"skip":false
},

"extract_tags":{
"input":"data/pubtator-docs.xml.xz",
"output":"data/pubtator-tags.tsv.xz",
"skip":false
},

"hetnet_id_extractor":{
"input":"data/pubtator-tags.tsv.xz",
"output":"data/pubtator-hetnet-tags.tsv.xz",
"skip":false
}
}
54 changes: 54 additions & 0 deletions config_files/tests_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
"repository_download":{
"url":"ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz",
"download_folder":"download",
"skip":true
},

"pubtator_to_xml": {
"documents":"data/example/1-sample-annotations.txt",
"output":"data/example/2-sample-docs.xml",
"skip":false
},

"extract_tags":{
"input":"data/example/2-sample-docs.xml",
"output":"data/example/3-sample-tags.tsv",
"skip":false
},

"hetnet_id_extractor":{
"input":"data/example/3-sample-tags.tsv",
"output":"data/example/4-hetnet-tags.tsv",
"skip":false
},

"map_pmids_to_pmcids":{
"input":"data/example/3-sample-tags.tsv",
"output":"data/example/5-sample-pmids-to-pmcids.tsv",
"debug":true,
"skip":false
},

"download_full_text":{
"input":"data/example/5-sample-pmids-to-pmcids.tsv",
"document_batch":100,
"output":"data/example/6-sample-full-text.xml",
"temp_dir":"data/temp",
"log_file":"batch_log.tsv",
"skip":false
},

"extract_full_text_tags":{
"input":"data/example/6-sample-full-text.xml",
"output":"data/example/7-sample-full-text-tags.tsv",
"skip":false
},

"hetnet_id_extractor_full_text":{
"input":"data/example/7-sample-full-text-tags.tsv",
"output":"data/example/8-hetnet-full-text-tags.tsv",
"skip":false
}

}
2 changes: 1 addition & 1 deletion data/example/2-sample-docs.xml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM 'BioC.dtd'>
<collection>
<source>Pubtator</source>
<date>2020/03/02</date>
<date>2020/08/03</date>
<key>Pubtator.key</key>
<document>
<id>1560033</id>
Expand Down
Loading