Missing functionality for modifying and ingesting data with Python lib

Branching this out from #4373 since that one got a bit too big. This is mostly for my internal tracking of what’s left to be done before we can fully use the new Python library for data modification and ingestion of new proceedings.

## TODOs

- [x] MarkupText needs to be instantiable from LaTeX, with all the modifications and normalizations we currently do
    - Should some Unicode normalizations maybe be moved into a separate function, like `MarkupText.normalize()`, if it’s independent from LaTeX conversion issues?
- [x] Implement minimal-diff XML saving – #5337 
- [x] Adapt `ingest_mitpress.py` and test with CL ingestion from David, e.g. on 40c8044aa – #5359
- [ ] Adapt remaining ingestion scripts

## Previous comments

I thought that I could have a working implementation of this ready in 2025 Q1, but it seems I forgot about the quite complex machinery we have for importing LaTeX, in the form of https://github.com/acl-org/acl-anthology/blob/master/bin/normalize_anth.py, https://github.com/acl-org/acl-anthology/blob/master/bin/latex_to_unicode.py, and https://github.com/acl-org/acl-anthology/blob/master/bin/fixedcase. In order for the idea here (= to add full ingestion functionality into the new library) to work, these would have to be integrated into this library, which is ... highly non-trivial.

Some complications with the currently existing functions:

- They operate on XML elements, which is conceptually incompatible with the new library. They should work on single LaTeX inputs instead. This can probably be adapted.
- They don’t have type annotations, which are required in the new library, and there are many, many, functions to annotate.
- They build on `latexcodec==1.0.7` and include hard-coded rules based on the behaviour of latexcodec in that version (as documented by comments in the scripts). latexcodec is at 3.0.0 currently and its use is discouraged by its own developer, which is why I have switched to `pylatexenc` already in the new library. Incorporating code that relies on latexcodec 1.0.7 into the new library seems undesirable.
- We don’t have tests for how the currently existing functions are supposed to behave, so it’s not clear how to port this functionality to pylatexenc while ensuring that it’s functionally the same.

Based on these considerations, I think simply copying the existing files over to the new library is not a good idea. I would rather:

1. Compile an extensive set of test cases that document how these functions currently work. (This seems like a good idea in any case.)
2. Start a reimplementation with pylatexenc and compare against the test cases.
3. Copy over individual parts from the old scripts as needed to satisfy all test cases.

Based on the complexity of this, I might turn this into a separate issue/PR.

_Originally posted by @mbollmann in https://github.com/acl-org/acl-anthology/issues/4373#issuecomment-2676768119_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing functionality for modifying and ingesting data with Python lib #4766

TODOs

Previous comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing functionality for modifying and ingesting data with Python lib #4766

Description

TODOs

Previous comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions