Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement download tracker and pipeline execution change #24

Merged
merged 12 commits into from
Aug 6, 2020

Conversation

danich1
Copy link
Contributor

@danich1 danich1 commented Aug 3, 2020

This PR is based off of #22 and #23. To fill you in on the loop basically Pubtator is a repository designed to download abstracts and full text from Pubtator Central. I used this for my first aim to extract hetnet relationships from biomedical text. Given the implementation somethings needed to be changed, which is why this PR exists. Feel free to look as deep or shallow as you want with the code. The small tests run with no issues, but curious to see what you find.

Overall changes for personal note:

  • Generated configuration files for each run
  • Execute.sh is now a python script that reads in a configuration file
  • Downloading full text from Pubtator Central has a log tracker
  • Changed from relying on conda to relying on pip for package installments
  • Updated README to be more clear on how to set up this repository

@danich1 danich1 requested a review from jjc2718 August 3, 2020 19:37
Copy link
Member

@jjc2718 jjc2718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few stylistic comments/clarification questions, nothing major.

config_files/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,38 @@
# Configuration Files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand the purpose of these config files - you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true or changing the output filenames)?

Just wondering if you need to document what each of the fields mean somewhere. Most of them are fairly obvious from the name, so I think it's probably not necessary but if you expect users to be changing things by hand a lot I might feel differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true or changing the output filenames)?

Correct. The idea is to provide the fields already, so a user can change directories as needed.

Just wondering if you need to document what each of the fields mean somewhere.

Good idea. I'll add documentation to this PR.


except Exception as e:
print(f"There is an error processing batch {idx}.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want these error messages to print to stderr instead? Not sure if you plan on having this script print any other text (looks like it writes everything to files currently).

if not Path(f"{temp_dir}/{log_file}").exists():
log = pd.DataFrame([], columns=["batch", "pmcid"])
log.to_csv(
f"{temp_dir}/{log_file}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to use a pathlib path here instead of hard-coding the forward slash? I guess it's unlikely that anyone will be using this on Windows, but it might be good to keep file paths cross-compatible wherever possible.

The simplest thing might be just to store it in a variable, something like

log_file_path = Path(f"{temp_dir}/{log_file}")

Then you can check that for existence and pass log_file_path.name to to_csv. That gives the filename a single point of truth, rather than having to repeat it.

sep="\t", index=False
)
else:
log = pd.read_csv(f"{temp_dir}/{log_file}", sep="\t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as above (pathlib path vs. hardcoded string path).


# Measure the ids that haven't been seen by the logger
already_seen = (
set(pmcid_batch_df.PMCID.values.tolist())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might not need the .tolist() here - I think you can just pass the .values (which should be a Numpy array) directly to the set() function and it should work the same.

filter_tags(
configuration["hetnet_id_extractor_full_text"]["input"],
configuration["hetnet_id_extractor_full_text"]["output"]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a sanity check, one thing you could do is to make sure there are no steps in the config file that aren't implemented in execute.py (i.e. the steps in the config should be a subset of what's in this file). As it stands now, it looks like an extra step in a config file would just silently not be executed.

This may help in the future if you add a step to a config but forget to add it here. Up to you, though - I don't feel too strongly about this.

@danich1
Copy link
Contributor Author

danich1 commented Aug 5, 2020

@jjc2718 Thanks for reviewing.
@dhimmel or @cgreene can one of you add me as an admin to the repository, so I can push these changes into the repository?

@dhimmel
Copy link
Collaborator

dhimmel commented Aug 5, 2020

Made you admin. Feel free to change the 1 approval required setting

@danich1 danich1 merged commit ecd2ac5 into greenelab:master Aug 6, 2020
@danich1 danich1 deleted the download_tracker branch August 6, 2020 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants