Skip to content

Missing functionality for modifying and ingesting data with Python lib #4766

@mbollmann

Description

@mbollmann

Branching this out from #4373 since that one got a bit too big. This is mostly for my internal tracking of what’s left to be done before we can fully use the new Python library for data modification and ingestion of new proceedings.

TODOs

Previous comments

I thought that I could have a working implementation of this ready in 2025 Q1, but it seems I forgot about the quite complex machinery we have for importing LaTeX, in the form of https://github.com/acl-org/acl-anthology/blob/master/bin/normalize_anth.py, https://github.com/acl-org/acl-anthology/blob/master/bin/latex_to_unicode.py, and https://github.com/acl-org/acl-anthology/blob/master/bin/fixedcase. In order for the idea here (= to add full ingestion functionality into the new library) to work, these would have to be integrated into this library, which is ... highly non-trivial.

Some complications with the currently existing functions:

  • They operate on XML elements, which is conceptually incompatible with the new library. They should work on single LaTeX inputs instead. This can probably be adapted.
  • They don’t have type annotations, which are required in the new library, and there are many, many, functions to annotate.
  • They build on latexcodec==1.0.7 and include hard-coded rules based on the behaviour of latexcodec in that version (as documented by comments in the scripts). latexcodec is at 3.0.0 currently and its use is discouraged by its own developer, which is why I have switched to pylatexenc already in the new library. Incorporating code that relies on latexcodec 1.0.7 into the new library seems undesirable.
  • We don’t have tests for how the currently existing functions are supposed to behave, so it’s not clear how to port this functionality to pylatexenc while ensuring that it’s functionally the same.

Based on these considerations, I think simply copying the existing files over to the new library is not a good idea. I would rather:

  1. Compile an extensive set of test cases that document how these functions currently work. (This seems like a good idea in any case.)
  2. Start a reimplementation with pylatexenc and compare against the test cases.
  3. Copy over individual parts from the old scripts as needed to satisfy all test cases.

Based on the complexity of this, I might turn this into a separate issue/PR.

Originally posted by @mbollmann in #4373 (comment)

Metadata

Metadata

Assignees

Labels

python-libraryConcerning the acl-anthology-py library

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions