Improved quote convention detection for Paratext projects #239

benjaminking · 2025-10-30T18:59:52Z

This PR improves quote convention detection for Paratext projects, especially messy projects that are inconsistent with their quote conventions. It does this by implementing a weighted voting scheme across different books. It also makes a small change to the way that quote convention similarity is calculated to accommodate the weighted voting. Finally, it adds a new quote convention that was recently observed in a project.

On a set of 57 real projects submitted to Serval, this improves the accuracy of quote convention detection from 40% to 95%.

This change is

… detection

ddaspit

@ddaspit reviewed 14 of 14 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

machine/punctuation_analysis/quote_convention_detector.py line 53 at r1 (raw file):

        return STANDARD_QUOTE_CONVENTIONS.score_all_quote_conventions(self._quotation_mark_tabulator)

    def detect_quote_convention_and_get_tabulated_quotation_marks(

Could we expose the tabulated quotation marks from the QuoteConventionAnalysis instead of returning it separately?

machine/punctuation_analysis/quote_convention.py line 64 at r1 (raw file):

    def __hash__(self) -> int:
        return hash((tuple(self.level_conventions)))

Are the extra parentheses necessary? Doesn't tuple return a tuple?

Enkidu93

@Enkidu93 reviewed 14 of 14 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @benjaminking)

tests/punctuation_analysis/test_quote_convention_set.py line 1254 at r1 (raw file):

    assert all_three_quote_convention_set.find_most_similar_convention(noisy_multiple_english_quotes_tabulator) == (
        standard_english_quote_convention,
        approx(0.8333333333333, rel=1e-9),

I'm noticing that these scores all seem to be going down. Why is that the case? If the intent is to use the same logic as before but weight it by book, shouldn't these stay the same? Is it because of this: # The scores of greater depths depend on the scores of shallower depths? Is the idea that if the top-level quotes are off, that should be compounded into the score for deeper quotes? Was this motivated by particular examples?

We aren't thresholding on these values at the moment anywhere, are we? If so, we need to make sure we update those threshold values.

machine/punctuation_analysis/quote_convention_analysis.py line 17 at r1 (raw file):

        self._convention_scores = convention_scores
        if len(convention_scores) > 0:
            self._best_quote_convention_score = max(convention_scores.items(), key=lambda item: item[1])[1]

You should combine this with setting self._best_quote_convention_score so you don't have to calculate max(...) twice.

codecov-commenter · 2025-11-06T19:51:58Z

Codecov Report

❌ Patch coverage is 96.45390% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.97%. Comparing base (9d71953) to head (2e699e2).

Files with missing lines	Patch %	Lines
tests/corpora/test_usfm_manual.py	20.00%	4 Missing ⚠️
.../zip_paratext_project_quote_convention_detector.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #239      +/-   ##
==========================================
+ Coverage   90.93%   90.97%   +0.04%     
==========================================
  Files         337      338       +1     
  Lines       21726    21804      +78     
==========================================
+ Hits        19756    19837      +81     
+ Misses       1970     1967       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

benjaminking · 2025-11-06T20:42:30Z

For some reason, Reviewable is crashing every time I open this PR, so I'm just going to respond on Github.

Could we expose the tabulated quotation marks from the QuoteConventionAnalysis instead of returning it separately?

I have refactored this to avoid having to either expose additional fields or return multiple values from a method.

Are the extra parentheses necessary? Doesn't tuple return a tuple?

Yes, I have removed the extraneous parentheses.

I'm noticing that these scores all seem to be going down. Why is that the case? If the intent is to use the same logic as before but weight it by book, shouldn't these stay the same? Is it because of this: # The scores of greater depths depend on the scores of shallower depths? Is the idea that if the top-level quotes are off, that should be compounded into the score for deeper quotes? Was this motivated by particular examples?

Yes, the changes in score are due to the way that the quotation matching score is calculated (as you mentioned). Yes, the idea is that we shouldn't give credit for matching 2nd-level quotation marks unless the 1st-level quotation marks match (to some extent). There was a situation that was happening in a few cases that motivated this.

Suppose that a project has some books that use convention X and some that use convention Y. Now suppose that convention Z has the same 1st-level quotation marks as X and the same 2nd-level quotation marks as Y. In a lot of cases, Z was being identified as the quote convention for the project, even though none of the books used Z. So I made the change to only give a convention credit for having matching 2nd-level quotation marks to the extent that it also matches the 1st-level quotation marks.

There should also be a number of situations where the score will increase under the new calculation.

We aren't thresholding on these values at the moment anywhere, are we? If so, we need to make sure we update those threshold values.

I don't believe we have set any thresholds based on this score.

You should combine this with setting self._best_quote_convention_score so you don't have to calculate max(...) twice.

Done.

Enkidu93

Reviewable is also crashing for me :/.

Great, looks good. Thank you for the details on the depth-based adjustment.

ddaspit

@ddaspit reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @benjaminking)

ddaspit

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

Ben King added 6 commits October 28, 2025 14:41

Use a weighted average of books for Paratext project quote convention…

65d0d15

… detection

Always return a QuoteConventionAnalysis instead of None

d882910

Modify quote convention similarity calculation

e4ea5f7

Add new quote convention

3bab6bd

Minor code clarity changes

14274dd

Fix linting issue

afa93ba

benjaminking requested a review from Enkidu93 October 30, 2025 18:59

ddaspit requested changes Nov 3, 2025

View reviewed changes

Enkidu93 requested changes Nov 5, 2025

View reviewed changes

Address reviewer comments + refactor weighted average

2e699e2

Enkidu93 approved these changes Nov 7, 2025

View reviewed changes

Enkidu93 requested a review from ddaspit November 7, 2025 15:43

ddaspit reviewed Nov 10, 2025

View reviewed changes

ddaspit approved these changes Nov 10, 2025

View reviewed changes

benjaminking merged commit 0c3cd9c into main Nov 10, 2025
13 of 14 checks passed

benjaminking deleted the quote_convention_detection_by_voting branch November 10, 2025 21:14

Enkidu93 mentioned this pull request Nov 10, 2025

Port quotation denormalization improvements sillsdev/machine#356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improved quote convention detection for Paratext projects #239

Improved quote convention detection for Paratext projects #239

benjaminking commented Oct 30, 2025 •

edited by ddaspit

Loading

Uh oh!

ddaspit left a comment

Uh oh!

Enkidu93 left a comment

Uh oh!

codecov-commenter commented Nov 6, 2025 •

edited

Loading

Uh oh!

benjaminking commented Nov 6, 2025

Uh oh!

Enkidu93 left a comment

Uh oh!

ddaspit left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Improved quote convention detection for Paratext projects #239

Improved quote convention detection for Paratext projects #239

Conversation

benjaminking commented Oct 30, 2025 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benjaminking commented Nov 6, 2025

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benjaminking commented Oct 30, 2025 •

edited by ddaspit

Loading

codecov-commenter commented Nov 6, 2025 •

edited

Loading