Document Structure in Long Document Transformers

Jan Buchmann, Max Eichler, Jan-Micha Bodensohn, Ilia Kuznetsov and Iryna Gurevych

This repository contains the code for the paper "Document Structure in Long Document Transformers", accepted at EACL 2024.

The corresponding data can be found here

From the abstract of the paper:

Long documents often exhibit structure with hierarchically organized elements of different functions, such as section headers and paragraphs. Despite the omnipresence of document structure, its role in natural language processing (NLP) remains opaque. Do long-document Transformer models acquire an internal representation of document structure during pre-training? How can structural information be communicated to a model after pre-training, and how does it influence downstream performance? To answer these questions, we develop a novel suite of probing tasks to assess structure-awareness of long-document Transformers, propose general-purpose structure infusion methods, and evaluate the effects of structure infusion on QASPER and Evidence Inference, two challenging long-document NLP tasks.

We build our experiments on Intertext Graphs (ITG) [1] as the common data format and employed two long document transformers: LED [2] and LongT5 [3]. We performed downstream task experiments on QASPER [4] and Evidence Inference [5].

Contact: Jan Buchmann, [email protected]

UKP Lab: http://www.ukp.tu-darmstadt.de/
TU Darmstadt: http://www.tu-darmstadt.de/

Repository Structure

The repository is split into 3 parts. Each of these has a README that explains usage and a requirements.txt with dependencies.

├── infusion  # contains the code for downstream task experiments and pretraining. 
├── probing  # contains the code for probing experiments.
├── structure_datasets  # contains the code to create the downstream task datasets in the intertext graph format. Note that the the downstream task datasets in ITG format are available [here](https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4111), so you should not need to recreate them.

Usage

To reproduce the experiments from the paper, please download the datasets in ITG format here and unpack the zip files.

Move the contents of infusion_datasets to infusion/data/datasets/. See infusion/README.md for basic instructions to run experiments.

Move the contents of probing_datasets to probing/data/. See probing/README.md for basic instructions to run experiments.

Citation

If you happen to find our paper or this repository useful, please consider citing

Jan Buchmann, Max Eichler, Jan-Micha Bodensohn, Ilia Kuznetsov, and Iryna Gurevych. 2024. Document Structure in Long Document Transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1056–1073, St. Julian’s, Malta. Association for Computational Linguistics.

@inproceedings{buchmann-etal-2024-document,
    title = "Document Structure in Long Document Transformers",
    author = "Buchmann, Jan  and
      Eichler, Max  and
      Bodensohn, Jan-Micha  and
      Kuznetsov, Ilia  and
      Gurevych, Iryna",
    editor = "Graham, Yvette  and
      Purver, Matthew",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.eacl-long.64",
    pages = "1056--1073",
}

References

[1] Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, Online. Association for Computational Linguistics.

[2] Iz Beltagy, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020).

[3] Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. 2022. LongT5: Efficient Text-To-Text Transformer for Long Sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States. Association for Computational Linguistics.

[4] Jay DeYoung, Eric Lehman, Benjamin Nye, Iain Marshall, and Byron C. Wallace. 2020. Evidence Inference 2.0: More Data, Better Models. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 123–132, Online. Association for Computational Linguistics.

[5] Ilia Kuznetsov, Jan Buchmann, Max Eichler, Iryna Gurevych; Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review. Computational Linguistics 2022; 48 (4): 949–986. doi: https://doi.org/10.1162/coli_a_00455

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
img		img
infusion		infusion
probing		probing
structure_datasets		structure_datasets
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Structure in Long Document Transformers

Jan Buchmann, Max Eichler, Jan-Micha Bodensohn, Ilia Kuznetsov and Iryna Gurevych

Repository Structure

Usage

Citation

References

Disclaimer

About

Releases

Packages

Languages

License

UKPLab/eacl2024-doc-structure

Folders and files

Latest commit

History

Repository files navigation

Document Structure in Long Document Transformers

Jan Buchmann, Max Eichler, Jan-Micha Bodensohn, Ilia Kuznetsov and Iryna Gurevych

Repository Structure

Usage

Citation

References

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages