greenelab · danich1 · Aug 6, 2020 · Aug 3, 2020 · Aug 3, 2020 · Aug 3, 2020
diff --git a/README.md b/README.md
@@ -4,25 +4,45 @@
 
 [PubTator](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/) and its 2.0 version ([PubTator Central](https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTatorCentral/)) uses text mining to tag PubMed abstracts/artciles with standardized concepts. This repository retrieves and processes PubTator annotations for use in [`greenelab/snorkeling`](https://github.com/greenelab/snorkeling) and elsewhere.
 
-## Environment
+# Get Started
 
-Install the [conda](https://conda.io) environment specified in [`environment.yml`](environment.yml) by running:
+## Set-up Environment
+
+### Conda
+
+1. Install the [conda](https://conda.io) environment.
+2. Create the pubtator environmenmt by running:
+
+```sh
+conda create --name Pubtator python=3.8
+```
+3. Install packages via pip by running the following:
+
+```sh
+pip install -r requirements.txt
+```
+
+4. Activate with `conda activate pubtator`.
+
+### Pip
+
+1. Make sure you have python version **3.8** installed.
+2. Install packages by running the following:
 
 ```sh
-conda env create --file environment.yml
+pip install -r requirements.txt
 ```
 
-Activate with `conda activate pubtator`.
 
 ## Execution
 
-To download and extract Pubator Central's data (default) run the following:
+To start processing Pubtator/Pubtator Central run the following command:
 
 ```sh
-bash execute.sh {email address here}
+python execute.py --config config_files/pubtator_central_config.json
 ```
 
-If the original Pubtator is desired run the above command with the following flag: --pubtator. You do not need to provide your email address when running the first version of Pubtator.
+If the original Pubtator is desired replace `pubtator_central_config.json` with `pubtator_config.json`. The json file contains all the necessary parameters needed to run. More information for the json file can be found [here](config_files).
 
 ## License
 

diff --git a/config_files/CONFIG.md b/config_files/CONFIG.md
@@ -0,0 +1,96 @@
+# Configuration Files Explained
+
+This file explains the pipeline steps and parameters needed for each step.
+**Note: any added parameter or step will be ignored unless `execute.py` is manually changed.**
+
+## Repository Download
+
+This is the first step of the Pubtator pipeline.
+Basically this step downloads Pubtator Central's annotation file from their ftp server.
+
+Following Parameters for this section:
+| Param | Description | Accepted Values |
+| --- | --- | --- |
+| url | the url to download the file from | a string with a url path |
+| download_folder | the folder to hold the downloaded file | a string name for the folder |
+| skip | tell execute.py to ignore this step and contine | true or false |
+
+## Pubtator to XML
+
+This is the second step of the Pubtator pipeline.
+This step converts Pubtator/Pubtator Central's annotation file into xml format.
+**Note: This step may take awhile to complete**
+
+Following Parameters for this section:
+| Param | Description | Accepted Values |
+| --- | --- | --- |
+| documents | The file path pointing to the downloaded file from the previous step. | a string for the file path |
+| output | The file path to save the xml file. Make sure to keep the xz extension. | a string for the file path |
+| skip | Tell execute.py to ignore this step and contine | true or false |
+
+## Extract Tags
+
+This is the second step of the Pubtator pipeline.
+This step extracts Pubtator/Pubtator Central's annotations from the xml file.
+It outputs a tsv file that contains all extracted annotation.
+**Note: This step may take awhile to complete**
+
+Following Parameters for this section:
+| Param | Description | Accepted Values |
+| --- | --- | --- |
+| input | The file path pointing to the xml file in previouss step. Make sure to keep the xz extension. | a string for the file path |
+| output | The file path to save the tsv file. Make sure to keep the xz extension. | a string for the file path |
+| skip | Tell execute.py to ignore this step and contine | true or false |
+
+## Hetnet ID Extractor
+
+This is the third step of the Pubtator pipeline.
+This step filters out extracted annotations to only include tags within [Hetionet's Database](https://het.io/).
+
+Following Parameters for this section:
+| Param | Description | Accepted Values |
+| --- | --- | --- |
+| input | The file path pointing to the tsv file in previous step. Make sure to keep the xz extension. | a string for the file path |
+| output | The file path to save the tsv file. Make sure to keep the xz extension. | a string for the file path |
+| skip | Tell execute.py to ignore this step and contine | true or false |
+
+## Map PMIDS to PMCIDS
+
+This is the forth step of the Pubtator pipeline.
+This step querys NCBI's pmid to pmcid mapper in order to grab PMCIDS.
+**Note: To download full text you will need to have PMCIDS. PMIDS will not work.**
+
+Following Parameters for this section:
+| Param | Description | Accepted Values |
+| --- | --- | --- |
+| input | The file path pointing to the tsv file in extract tags step. Make sure to keep the xz extension. | a string for the file path |
+| output | The file path to save the tsv file. | a string for the file path |
+| debug | This is a flag for debugging purposes. Feel free to ignore and leave as false. | true or false |
+| skip | Tell execute.py to ignore this step and contine | true or false |
+
+## Download Full Text
+
+This is the fifth step of the Pubtator pipeline.
+This step queries Pubtator Central's api and downloads annotated full text if text is present.
+
+Following Parameters for this section:
+| Param | Description | Accepted Values |
+| --- | --- | --- |
+| input | The file path pointing to the tsv file in previous step. | a string for the file path |
+| output | The file path to save the xml file. | a string for the file path |
+| temp_dir | The folder to hold temporary batch files for this step of the pipeline | a string for the folder path |
+| log_file | A log file that keeps track of the IDs that have already been queried. It is used to monitor progress in case the process is interrupted. Make sure it has the tsv extension. | a file path for the file |
+| skip | Tell execute.py to ignore this step and contine | true or false |
+
+## Extract Full Text Tags
+
+This is the sixth step of the Pubtator pipeline.
+This step extracts tags from full text documents.
+Please refer to [Extract Tags Section](#extract-tags) for parameter details.
+
+## Hetnet ID Extractor Full Text
+
+This is the last step of the Pubtator pipeline.
+This step filters tags to only have Hetionet tags.
+Please refer to [Hetnet ID Extractor Section](#hetnet-id-extractor) for parameter details.
+
diff --git a/config_files/README.md b/config_files/README.md
@@ -0,0 +1,37 @@
+# Configuration Files
+
+## File Description
+
+| File | Description |
+| --- | --- | 
+| [pubtator central config](pubtator_central_config.json) | This is a configuration file for parsing Pubtator Central. |
+| [pubtator config](pubtator_config.json) | This is a configuration file for parsing Pubtator (older version of Pubtator Central). |
+| [tests config](tests_config.json) | This is a configuration file for testing the pubtator system. Feel free to ignore this file. |
+
+## Usage
+
+Each configuration file is in json format and contains parameters for each step within the pubtator pipeline. 
+All files are organized by order of operation, which means the very first step occurs at the top and the subsequent step comes right afterwards.
+Every step can be skipped, which allows one to continue the pipeline at any step one chooses. 
+Please refer to [CONFIG.md](CONFIG.md) for more details on each pipeline step and their respective parameters.
+
+Example config file:
+```json
+{
+  "pipeline step 1":{
+    "param1":"param1_value",
+    "param2":"param2_value",
+    "skip":false
+    },
+   "pipeline step 2":{
+    "param1":"param1_value",
+    "param2":"param2_value",
+    "skip":false
+    },
+   "pipeline step 3":{
+    "param1":"param1_value",
+    "param2":"param2_value",
+    "skip":false
+    }
+}
+```
diff --git a/config_files/pubtator_central_config.json b/config_files/pubtator_central_config.json
@@ -0,0 +1,54 @@
+{
+    "repository_download":{
+        "url":"ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz",
+        "download_folder":"download",
+        "skip":false
+    },
+
+    "pubtator_to_xml": {
+        "documents":"download/bioconcepts2pubtatorcentral.offset.gz",
+        "output":"data/pubtator-central-docs.xml.xz",
+        "skip":false
+    },
+
+    "extract_tags":{
+        "input":"data/pubtator-central-docs.xml.xz",
+        "output":"data/pubtator-central-tags.tsv.xz",
+        "skip":false
+    },
+
+    "hetnet_id_extractor":{
+        "input":"data/pubtator-central-tags.tsv.xz",
+        "output":"data/pubtator-central-hetnet-tags.tsv.xz",
+        "skip":false
+    },
+
+    "map_pmids_to_pmcids":{
+        "input":"data/pubtator-central-tags.tsv.xz",
+        "output":"data/pubtator-pmids-to-pmcids.tsv",
+        "debug":false,
+        "skip":false
+    },
+
+    "download_full_text":{
+        "input":"data/pubtator-pmids-to-pmcids.tsv",
+        "document_batch":100,
+        "output":" data/pubtator-central-full-text.xml",
+        "temp_dir":"data/temp",
+        "log_file":"batch_log.tsv",
+        "skip":false
+    },
+
+    "extract_full_text_tags":{
+        "input":"data/pubtator-central-full-text.xml",
+        "output":"data/pubtator-central-full-text-tags.tsv.xz",
+        "skip":false
+    },
+
+    "hetnet_id_extractor_full_text":{
+        "input":"data/pubtator-central-full-text-tags.tsv.xz",
+        "output":"data/pubtator-central-full-hetnet-tags.tsv.xz",
+        "skip":false
+    }
+
+}
diff --git a/config_files/pubtator_config.json b/config_files/pubtator_config.json
@@ -0,0 +1,25 @@
+{
+    "repository_download":{
+        "url":"ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/bioconcepts2pubtator_offsets.gz",
+        "download_folder":"download",
+        "skip":false
+    },
+
+    "pubtator_to_xml": {
+        "documents":"download/bioconcepts2pubtator_offsets.gz",
+        "output":"data/pubtator-docs.xml.xz",
+        "skip":false
+    },
+
+    "extract_tags":{
+        "input":"data/pubtator-docs.xml.xz",
+        "output":"data/pubtator-tags.tsv.xz",
+        "skip":false
+    },
+
+    "hetnet_id_extractor":{
+        "input":"data/pubtator-tags.tsv.xz",
+        "output":"data/pubtator-hetnet-tags.tsv.xz",
+        "skip":false
+    }
+}
diff --git a/config_files/tests_config.json b/config_files/tests_config.json
@@ -0,0 +1,54 @@
+{
+    "repository_download":{
+        "url":"ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz",
+        "download_folder":"download",
+        "skip":true
+    },
+
+    "pubtator_to_xml": {
+        "documents":"data/example/1-sample-annotations.txt",
+        "output":"data/example/2-sample-docs.xml",
+        "skip":false
+    },
+
+    "extract_tags":{
+        "input":"data/example/2-sample-docs.xml",
+        "output":"data/example/3-sample-tags.tsv",
+        "skip":false
+    },
+
+    "hetnet_id_extractor":{
+        "input":"data/example/3-sample-tags.tsv",
+        "output":"data/example/4-hetnet-tags.tsv",
+        "skip":false
+    },
+
+    "map_pmids_to_pmcids":{
+        "input":"data/example/3-sample-tags.tsv",
+        "output":"data/example/5-sample-pmids-to-pmcids.tsv",
+        "debug":true,
+        "skip":false
+    },
+
+    "download_full_text":{
+        "input":"data/example/5-sample-pmids-to-pmcids.tsv",
+        "document_batch":100,
+        "output":"data/example/6-sample-full-text.xml",
+        "temp_dir":"data/temp",
+        "log_file":"batch_log.tsv",
+        "skip":false
+    },
+
+    "extract_full_text_tags":{
+        "input":"data/example/6-sample-full-text.xml",
+        "output":"data/example/7-sample-full-text-tags.tsv",
+        "skip":false
+    },
+
+    "hetnet_id_extractor_full_text":{
+        "input":"data/example/7-sample-full-text-tags.tsv",
+        "output":"data/example/8-hetnet-full-text-tags.tsv",
+        "skip":false
+    }
+
+}
diff --git a/data/example/2-sample-docs.xml b/data/example/2-sample-docs.xml
@@ -1,7 +1,7 @@
 <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM 'BioC.dtd'>
 <collection>
   <source>Pubtator</source>
-  <date>2020/03/02</date>
+  <date>2020/08/03</date>
   <key>Pubtator.key</key>
 <document>
   <id>1560033</id>