Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds retraction status #314

Merged
merged 11 commits into from
Sep 11, 2024
Merged

Adds retraction status #314

merged 11 commits into from
Sep 11, 2024

Conversation

geemi725
Copy link
Contributor

@geemi725 geemi725 commented Sep 5, 2024

This PR checks if a doi has been retracted.

Reaction data downloaded on 09-94-24 is added to paperqa/clients/client_data

Will update the retraction dataset every 30 days.

Pulled from the spetember-branch

Closes #313

@geemi725
Copy link
Contributor Author

geemi725 commented Sep 5, 2024

Test fail due to: ModuleNotFoundError: No module named 'pandas'
Use dict instead?

@jamesbraza
Copy link
Collaborator

Let's not open into main, open into september release

@geemi725 geemi725 changed the base branch from main to september-2024-release September 5, 2024 21:02
@whitead
Copy link
Collaborator

whitead commented Sep 6, 2024

Let's not commit the CSV @geemi725 - it's pretty big

Comment on lines 86 to 89
def _write_csv_to_gcs(self, retraction_dataframe: pd.DataFrame) -> None:
retraction_dataframe.to_csv(self.retraction_data_path, index=False)

def _populate_retracted_dois(self, filtered_data: pd.DataFrame) -> set[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use pandas - just use python standard CSV

) from e
return pd.DataFrame(columns=self.columns) # NOTE: this is empty

def _write_csv_to_gcs(self, retraction_dataframe: pd.DataFrame) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS = google cloud specific stuff. Let's get a new name

Comment on lines 332 to 336
@pytest.mark.vcr
@pytest.mark.asyncio
async def test_crossref_retraction_status():
async with aiohttp.ClientSession() as session:
crossref_client = DocMetadataClient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

async def _download_raw_retracted(self) -> None:
retries = 3
delay = 5
url = "https://api.labs.crossref.org/data/retractionwatch"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you somehow move this functionality to paperqa/clients/crossref.py?

Can just make a free function there

Comment on lines 67 to 76
progress_bar = tqdm(
unit="iB", unit_scale=True, desc=self.retraction_data_path
)
while True:
chunk = await response.content.read(1024)
if not chunk:
break
await f.write(chunk)
progress_bar.update(len(chunk))
progress_bar.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use with statement here over manual close (that currently won't get called if an Exception happens)

try:
return DOIQuery(doi=doc_details.doi, **kwargs)
except ValidationError:
logger.debug("Must have a valid doi to query retraction data.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you place the DOI into an f-string here?

@geemi725
Copy link
Contributor Author

Removing redundant RetrationDataPostProcessor calls in test_client.py worked!

paperqa/clients/retractions.py Outdated Show resolved Hide resolved
paperqa/types.py Show resolved Hide resolved


@pytest.mark.asyncio
async def test_crossref_retraction_status():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test 👍

from anyio import open_file
from pydantic import ValidationError
from tenacity import retry, stop_after_attempt, wait_exponential
from tqdm.asyncio import tqdm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tqdm is not one of our dependencies - will have to drop that.

Comment on lines 77 to 80
with tqdm(
unit="iB", unit_scale=True, desc=self.retraction_data_path
) as progress_bar:
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use logging here and log some progress info

paperqa/types.py Outdated
Comment on lines 544 to 547

if self.is_retracted:
return f"RETRACTED ARTICLE! Original doi: {self.doi}. Retrieved from http://retractiondatabase.org/."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the original citation though - so that you can see title, author, etc.

@whitead whitead self-requested a review September 11, 2024 02:09
Copy link
Collaborator

@whitead whitead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@whitead whitead merged commit 4ba4386 into september-2024-release Sep 11, 2024
3 checks passed
@whitead whitead deleted the retraction branch September 11, 2024 02:39
This was referenced Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retraction status to queries
3 participants