-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds retraction status #314
Conversation
Test fail due to: |
Let's not open into |
Let's not commit the CSV @geemi725 - it's pretty big |
paperqa/clients/retractions.py
Outdated
def _write_csv_to_gcs(self, retraction_dataframe: pd.DataFrame) -> None: | ||
retraction_dataframe.to_csv(self.retraction_data_path, index=False) | ||
|
||
def _populate_retracted_dois(self, filtered_data: pd.DataFrame) -> set[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use pandas - just use python standard CSV
paperqa/clients/retractions.py
Outdated
) from e | ||
return pd.DataFrame(columns=self.columns) # NOTE: this is empty | ||
|
||
def _write_csv_to_gcs(self, retraction_dataframe: pd.DataFrame) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCS = google cloud specific stuff. Let's get a new name
tests/test_clients.py
Outdated
@pytest.mark.vcr | ||
@pytest.mark.asyncio | ||
async def test_crossref_retraction_status(): | ||
async with aiohttp.ClientSession() as session: | ||
crossref_client = DocMetadataClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
paperqa/clients/retractions.py
Outdated
async def _download_raw_retracted(self) -> None: | ||
retries = 3 | ||
delay = 5 | ||
url = "https://api.labs.crossref.org/data/retractionwatch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you somehow move this functionality to paperqa/clients/crossref.py
?
Can just make a free function there
paperqa/clients/retractions.py
Outdated
progress_bar = tqdm( | ||
unit="iB", unit_scale=True, desc=self.retraction_data_path | ||
) | ||
while True: | ||
chunk = await response.content.read(1024) | ||
if not chunk: | ||
break | ||
await f.write(chunk) | ||
progress_bar.update(len(chunk)) | ||
progress_bar.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use with
statement here over manual close
(that currently won't get called if an Exception
happens)
paperqa/clients/retractions.py
Outdated
try: | ||
return DOIQuery(doi=doc_details.doi, **kwargs) | ||
except ValidationError: | ||
logger.debug("Must have a valid doi to query retraction data.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you place the DOI into an f-string here?
Removing redundant |
|
||
|
||
@pytest.mark.asyncio | ||
async def test_crossref_retraction_status(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test 👍
paperqa/clients/retractions.py
Outdated
from anyio import open_file | ||
from pydantic import ValidationError | ||
from tenacity import retry, stop_after_attempt, wait_exponential | ||
from tqdm.asyncio import tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tqdm is not one of our dependencies - will have to drop that.
paperqa/clients/retractions.py
Outdated
with tqdm( | ||
unit="iB", unit_scale=True, desc=self.retraction_data_path | ||
) as progress_bar: | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just use logging here and log some progress info
paperqa/types.py
Outdated
|
||
if self.is_retracted: | ||
return f"RETRACTED ARTICLE! Original doi: {self.doi}. Retrieved from http://retractiondatabase.org/." | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the original citation though - so that you can see title, author, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR checks if a doi has been retracted.
Reaction data downloaded on 09-94-24 is added to
paperqa/clients/client_data
Will update the retraction dataset every 30 days.
Pulled from the spetember-branch
Closes #313