Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken title search ut #411

Merged
merged 15 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion tests/test_clients.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from paperqa.clients.journal_quality import JournalQualityPostProcessor
from paperqa.clients.retractions import RetrationDataPostProcessor

from utils.paper_helpers import compare_formatted_citations

@pytest.mark.vcr
@pytest.mark.parametrize(
Expand Down Expand Up @@ -112,11 +113,17 @@ async def test_title_search(paper_attributes: dict[str, str]) -> None:
),
)
details = await client.query(title=paper_attributes["title"])

# compares the citation without the specific number of citations
assert compare_formatted_citations(
paper_attributes['formatted_citation'], details.formatted_citation
), "Formatted citation should match"

assert set(details.other["client_source"]) == set( # type: ignore[union-attr]
paper_attributes["source"]
), "Should have the correct source"
for key, value in paper_attributes.items():
if key not in {"is_oa", "source"}:
if key not in {"is_oa", "source", "formatted_citation"}:
nadolskit marked this conversation as resolved.
Show resolved Hide resolved
assert getattr(details, key) == value, f"Should have the correct {key}"
elif key == "is_oa":
assert (
Expand Down
18 changes: 18 additions & 0 deletions tests/utils/paper_helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import re

def compare_formatted_citations(expected: str, actual: str) -> bool:
"""
Compares two formatted citation strings; ignoring the citation_count value.

:param expected: The expected formatted citation string.
nadolskit marked this conversation as resolved.
Show resolved Hide resolved
:param actual: The actual formatted citation string.
:return: True if the citations match except for the citation count, False otherwise.
"""
# https://regex101.com/r/lCN8ET/1
citation_pattern = r"(This article has )\d+( citations?)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex is closely coupled to DocDetails. Can you move this regex to a ClassVar[str] on DocDetails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!


# between group 1 and 2, replace with the character "n"
expected_cleaned = re.sub(citation_pattern, r"\1n\2", expected).strip()
actual_cleaned = re.sub(citation_pattern, r"\1n\2", actual).strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got the capturing groups wrong, you want r"This article has (\d+) citations?"

We don't care to extract the text, we want to extract the number

Then you can just directly compare the int, no need for the weird r"\1n\2"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair.
This original approach using \1n\2 was because it's focusing on comparing the structure of the citation text while ignoring the citation count itself; just ignoring whatever number is there.

But you make a valid point. I'll get this updated!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw (I am not sure your experience levels with regex) going from r"(This article has )\d+( citations?)" to r"This article has (\d+) citations?" will still match the surrounding text too. A capturing group () is here to make it easy to capture and extract a value

And plz point out if we're on the same page already haha. Hope you have a good night

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've addressed this accordingly. Thx James!


return expected_cleaned == actual_cleaned
Loading