-
Notifications
You must be signed in to change notification settings - Fork 357
Description
Branching this out from #4373 since that one got a bit too big. This is mostly for my internal tracking of what’s left to be done before we can fully use the new Python library for data modification and ingestion of new proceedings.
TODOs
- MarkupText needs to be instantiable from LaTeX, with all the modifications and normalizations we currently do
- Should some Unicode normalizations maybe be moved into a separate function, like
MarkupText.normalize()
, if it’s independent from LaTeX conversion issues?
- Should some Unicode normalizations maybe be moved into a separate function, like
- Implement minimal-diff XML saving – Implement minimal-diff XML saving #5337
- Adapt
ingest_mitpress.py
and test with CL ingestion from David, e.g. on 40c8044 – Adapt MIT Press ingestion script to use our library #5359 - Adapt remaining ingestion scripts
Previous comments
I thought that I could have a working implementation of this ready in 2025 Q1, but it seems I forgot about the quite complex machinery we have for importing LaTeX, in the form of https://github.com/acl-org/acl-anthology/blob/master/bin/normalize_anth.py, https://github.com/acl-org/acl-anthology/blob/master/bin/latex_to_unicode.py, and https://github.com/acl-org/acl-anthology/blob/master/bin/fixedcase. In order for the idea here (= to add full ingestion functionality into the new library) to work, these would have to be integrated into this library, which is ... highly non-trivial.
Some complications with the currently existing functions:
- They operate on XML elements, which is conceptually incompatible with the new library. They should work on single LaTeX inputs instead. This can probably be adapted.
- They don’t have type annotations, which are required in the new library, and there are many, many, functions to annotate.
- They build on
latexcodec==1.0.7
and include hard-coded rules based on the behaviour of latexcodec in that version (as documented by comments in the scripts). latexcodec is at 3.0.0 currently and its use is discouraged by its own developer, which is why I have switched topylatexenc
already in the new library. Incorporating code that relies on latexcodec 1.0.7 into the new library seems undesirable. - We don’t have tests for how the currently existing functions are supposed to behave, so it’s not clear how to port this functionality to pylatexenc while ensuring that it’s functionally the same.
Based on these considerations, I think simply copying the existing files over to the new library is not a good idea. I would rather:
- Compile an extensive set of test cases that document how these functions currently work. (This seems like a good idea in any case.)
- Start a reimplementation with pylatexenc and compare against the test cases.
- Copy over individual parts from the old scripts as needed to satisfy all test cases.
Based on the complexity of this, I might turn this into a separate issue/PR.
Originally posted by @mbollmann in #4373 (comment)