Skip to content

Implement conversion from LaTeX to our Markup XML#4787

Merged
mbollmann merged 14 commits intopython-devfrom
python-normalize-and-latex-import
Jun 1, 2025
Merged

Implement conversion from LaTeX to our Markup XML#4787
mbollmann merged 14 commits intopython-devfrom
python-normalize-and-latex-import

Conversation

@mbollmann
Copy link
Copy Markdown
Member

@mbollmann mbollmann commented Mar 5, 2025

This reimplements functionality from bin/latex_to_unicode.py within the new library, needed for #4766.

Work in progress. Works in principle, but needs much more test cases to ensure feature parity with the previous implementation. Also, some normalization steps (as done in the old latex_to_unicode() function) are not yet ported.

  • bin/latex_to_unicode.py implements some heuristics to determine if e.g. % or ~ are LaTeX symbols or plain text — we should add that somehow, maybe as a parameter to the conversion functions?

This reimplements most of `bin/latex_to_unicode.py` within the new library.
More tests are needed, and some conversions done in `latex_to_unicode` are still missing.
@mbollmann mbollmann added the python-library Concerning the acl-anthology-py library label Mar 5, 2025
@mbollmann mbollmann self-assigned this Mar 5, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2025

Codecov Report

Attention: Patch coverage is 99.13043% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.67%. Comparing base (f474131) to head (ecaa270).
Report is 19 commits behind head on python-dev.

Files with missing lines Patch % Lines
python/acl_anthology/utils/latex.py 98.63% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           python-dev    #4787      +/-   ##
==============================================
+ Coverage       93.49%   93.67%   +0.17%     
==============================================
  Files              35       35              
  Lines            2675     2782     +107     
==============================================
+ Hits             2501     2606     +105     
- Misses            174      176       +2     
Files with missing lines Coverage Δ
python/acl_anthology/exceptions.py 89.47% <ø> (ø)
python/acl_anthology/text/markuptext.py 94.73% <100.00%> (-0.27%) ⬇️
python/acl_anthology/utils/__init__.py 100.00% <100.00%> (ø)
python/acl_anthology/utils/text.py 96.77% <100.00%> (+3.91%) ⬆️
python/acl_anthology/utils/xml.py 98.57% <100.00%> (+0.18%) ⬆️
python/acl_anthology/utils/latex.py 99.34% <98.63%> (-0.66%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mbollmann
Copy link
Copy Markdown
Member Author

@davidweichiang If you have a minute, I’d appreciate if you could have a look at this — I have tried to port the logic of our bin/latex_to_unicode.py, which I think you mainly authored, to the new library, relying on pylatexenc rather than custom parsing + latexcodec. I created several test cases to ensure the functionality is as expected. Could you have a look at them to see if you can think of anything else that is important or maybe tricky to cover when ingesting LaTeX and converting it to our XML format?

The test cases are here: https://github.com/acl-org/acl-anthology/pull/4787/files#diff-e559d67d054b0d61eb1f86a702d5373d2ea14dc6e1ff04aee432e7bcc6e912b3

@mbollmann mbollmann marked this pull request as ready for review June 1, 2025 10:07
@mbollmann mbollmann merged commit 154dfd2 into python-dev Jun 1, 2025
14 checks passed
@mbollmann mbollmann deleted the python-normalize-and-latex-import branch June 1, 2025 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python-library Concerning the acl-anthology-py library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant