Implement download tracker and pipeline execution change #24

danich1 · 2020-08-03T19:37:45Z

This PR is based off of #22 and #23. To fill you in on the loop basically Pubtator is a repository designed to download abstracts and full text from Pubtator Central. I used this for my first aim to extract hetnet relationships from biomedical text. Given the implementation somethings needed to be changed, which is why this PR exists. Feel free to look as deep or shallow as you want with the code. The small tests run with no issues, but curious to see what you find.

Overall changes for personal note:

Generated configuration files for each run
Execute.sh is now a python script that reads in a configuration file
Downloading full text from Pubtator Central has a log tracker
Changed from relying on conda to relying on pip for package installments
Updated README to be more clear on how to set up this repository

…ownload_tracker

jjc2718

Looks good! Just a few stylistic comments/clarification questions, nothing major.

config_files/README.md

jjc2718 · 2020-08-04T17:04:52Z

config_files/README.md

@@ -0,0 +1,38 @@
+# Configuration Files


Just so I understand the purpose of these config files - you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true or changing the output filenames)?

Just wondering if you need to document what each of the fields mean somewhere. Most of them are fairly obvious from the name, so I think it's probably not necessary but if you expect users to be changing things by hand a lot I might feel differently.

you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true or changing the output filenames)?

Correct. The idea is to provide the fields already, so a user can change directories as needed.

Just wondering if you need to document what each of the fields mean somewhere.

Good idea. I'll add documentation to this PR.

jjc2718 · 2020-08-04T17:16:40Z

scripts/download_full_text.py


+            except Exception as e:
+                print(f"There is an error processing batch {idx}.")


Do you want these error messages to print to stderr instead? Not sure if you plan on having this script print any other text (looks like it writes everything to files currently).

jjc2718 · 2020-08-04T17:32:07Z

scripts/download_full_text.py

+    if not Path(f"{temp_dir}/{log_file}").exists():
+        log = pd.DataFrame([], columns=["batch", "pmcid"])
+        log.to_csv(
+            f"{temp_dir}/{log_file}", 


Do you want to use a pathlib path here instead of hard-coding the forward slash? I guess it's unlikely that anyone will be using this on Windows, but it might be good to keep file paths cross-compatible wherever possible.

The simplest thing might be just to store it in a variable, something like

log_file_path = Path(f"{temp_dir}/{log_file}")

Then you can check that for existence and pass log_file_path.name to to_csv. That gives the filename a single point of truth, rather than having to repeat it.

jjc2718 · 2020-08-04T17:34:29Z

scripts/download_full_text.py

+            sep="\t", index=False
+        )
+    else:
+        log = pd.read_csv(f"{temp_dir}/{log_file}", sep="\t")


Same comment here as above (pathlib path vs. hardcoded string path).

jjc2718 · 2020-08-04T17:36:53Z

scripts/download_full_text.py


+        # Measure the ids that haven't been seen by the logger
+        already_seen = (
+            set(pmcid_batch_df.PMCID.values.tolist())


You might not need the .tolist() here - I think you can just pass the .values (which should be a Numpy array) directly to the set() function and it should work the same.

jjc2718 · 2020-08-04T18:38:09Z

execute.py

+    filter_tags(
+        configuration["hetnet_id_extractor_full_text"]["input"], 
+        configuration["hetnet_id_extractor_full_text"]["output"]
+    )


Just as a sanity check, one thing you could do is to make sure there are no steps in the config file that aren't implemented in execute.py (i.e. the steps in the config should be a subset of what's in this file). As it stands now, it looks like an extra step in a config file would just silently not be executed.

This may help in the future if you add a step to a config but forget to add it here. Up to you, though - I don't feel too strongly about this.

Co-authored-by: Jake Crawford <[email protected]>

danich1 · 2020-08-05T20:33:56Z

@jjc2718 Thanks for reviewing.
@dhimmel or @cgreene can one of you add me as an admin to the repository, so I can push these changes into the repository?

dhimmel · 2020-08-05T23:18:24Z

Made you admin. Feel free to change the 1 approval required setting

danich1 added 7 commits August 3, 2020 15:06

Added batch monitor patch and updated interface

ae80866

Update README.md to reflect new changes

e924e0c

Created README.md for configuration files

1278a27

Updated files for ease of readability

781a4a4

Merge branch 'download_tracker' of github.com:danich1/pubtator into d…

1d4857a

…ownload_tracker

fixed execute.py

ff61ed3

file name change

2746b8d

danich1 requested a review from jjc2718 August 3, 2020 19:37

jjc2718 approved these changes Aug 4, 2020

View reviewed changes

danich1 and others added 5 commits August 5, 2020 15:37

Applying Jake's suggestions from code review

7842f2e

Co-authored-by: Jake Crawford <[email protected]>

Added documentation and made changes per jjc2718 suggestions

c46d6a3

Fixed anchors for read me

d87c64f

Finalized config readme

304eec3

No really this config.md is finalized now.

0c5575d

danich1 merged commit ecd2ac5 into greenelab:master Aug 6, 2020

danich1 deleted the download_tracker branch August 6, 2020 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement download tracker and pipeline execution change #24

Implement download tracker and pipeline execution change #24

danich1 commented Aug 3, 2020

jjc2718 left a comment

jjc2718 Aug 4, 2020

danich1 Aug 5, 2020

jjc2718 Aug 4, 2020

jjc2718 Aug 4, 2020

jjc2718 Aug 4, 2020

jjc2718 Aug 4, 2020

jjc2718 Aug 4, 2020

danich1 commented Aug 5, 2020

dhimmel commented Aug 5, 2020


		except Exception as e:
		print(f"There is an error processing batch {idx}.")

Implement download tracker and pipeline execution change #24

Implement download tracker and pipeline execution change #24

Conversation

danich1 commented Aug 3, 2020

jjc2718 left a comment

Choose a reason for hiding this comment

jjc2718 Aug 4, 2020

Choose a reason for hiding this comment

danich1 Aug 5, 2020

Choose a reason for hiding this comment

jjc2718 Aug 4, 2020

Choose a reason for hiding this comment

jjc2718 Aug 4, 2020

Choose a reason for hiding this comment

jjc2718 Aug 4, 2020

Choose a reason for hiding this comment

jjc2718 Aug 4, 2020

Choose a reason for hiding this comment

jjc2718 Aug 4, 2020

Choose a reason for hiding this comment

danich1 commented Aug 5, 2020

dhimmel commented Aug 5, 2020