-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement download tracker and pipeline execution change #24
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few stylistic comments/clarification questions, nothing major.
@@ -0,0 +1,38 @@ | |||
# Configuration Files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand the purpose of these config files - you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true
or changing the output filenames)?
Just wondering if you need to document what each of the fields mean somewhere. Most of them are fairly obvious from the name, so I think it's probably not necessary but if you expect users to be changing things by hand a lot I might feel differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't expect users to add or remove fields, correct? They would just change the fields if necessary (for example setting skip:true or changing the output filenames)?
Correct. The idea is to provide the fields already, so a user can change directories as needed.
Just wondering if you need to document what each of the fields mean somewhere.
Good idea. I'll add documentation to this PR.
scripts/download_full_text.py
Outdated
|
||
except Exception as e: | ||
print(f"There is an error processing batch {idx}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want these error messages to print to stderr instead? Not sure if you plan on having this script print any other text (looks like it writes everything to files currently).
scripts/download_full_text.py
Outdated
if not Path(f"{temp_dir}/{log_file}").exists(): | ||
log = pd.DataFrame([], columns=["batch", "pmcid"]) | ||
log.to_csv( | ||
f"{temp_dir}/{log_file}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to use a pathlib
path here instead of hard-coding the forward slash? I guess it's unlikely that anyone will be using this on Windows, but it might be good to keep file paths cross-compatible wherever possible.
The simplest thing might be just to store it in a variable, something like
log_file_path = Path(f"{temp_dir}/{log_file}")
Then you can check that for existence and pass log_file_path.name
to to_csv
. That gives the filename a single point of truth, rather than having to repeat it.
scripts/download_full_text.py
Outdated
sep="\t", index=False | ||
) | ||
else: | ||
log = pd.read_csv(f"{temp_dir}/{log_file}", sep="\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as above (pathlib
path vs. hardcoded string path).
scripts/download_full_text.py
Outdated
|
||
# Measure the ids that haven't been seen by the logger | ||
already_seen = ( | ||
set(pmcid_batch_df.PMCID.values.tolist()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might not need the .tolist()
here - I think you can just pass the .values
(which should be a Numpy array) directly to the set()
function and it should work the same.
filter_tags( | ||
configuration["hetnet_id_extractor_full_text"]["input"], | ||
configuration["hetnet_id_extractor_full_text"]["output"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a sanity check, one thing you could do is to make sure there are no steps in the config file that aren't implemented in execute.py
(i.e. the steps in the config should be a subset of what's in this file). As it stands now, it looks like an extra step in a config file would just silently not be executed.
This may help in the future if you add a step to a config but forget to add it here. Up to you, though - I don't feel too strongly about this.
Made you admin. Feel free to change the 1 approval required setting |
This PR is based off of #22 and #23. To fill you in on the loop basically Pubtator is a repository designed to download abstracts and full text from Pubtator Central. I used this for my first aim to extract hetnet relationships from biomedical text. Given the implementation somethings needed to be changed, which is why this PR exists. Feel free to look as deep or shallow as you want with the code. The small tests run with no issues, but curious to see what you find.
Overall changes for personal note: