Skip to content

Toward Structured Related Work Generation with Novelty Statements @ the Fourth Workshop on Scholarly Document Processing (SDP 2024)

License

Notifications You must be signed in to change notification settings

omron-sinicx/STRoGeNS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Toward Structured Related Work Generation with Novelty Statements

OMRON SINIC X

Kazuya Nishimura, Kuniaki Saito, Tosho Hirasawa, Yoshitaka Ushiku

[Paper][Poster]

To do

  • Code to process STRoGeNS arXiv
  • Code to process conference proceedings
  • Code for comparisons
  • Code for metrics evaluation

 

Description about dataset

About dataset
Dataset Pairs Words Words (Output) Input Doc (Num) #Para.
Multi-XScience 40,528 778.1 116.4 4.4 1
S2ORC 136,655 1067.4 148.7 5.0 1
Delve 78,927 622.6 228.6 3.7 1
TAS2 117,700 1036.0 134.8 4.8 1
TAD 218,255 1071.4 162.3 5.2 1
BigSurvey-MDS 4,478 11,893.1 1,051.7 76.3 1
SciReviewGen 10,130 11,734.4 7,193.8 68.1 1
STRoGeNS-arXiv22 (Ours) 85,853 3,046.2 514.3 16.6 4.22
STRoGeNS-conf22 (Ours) 15,079 3,669.1 508.5 20.4 4.27
STRoGeNS-conf23 (Ours) 4,762 4,836.6 504.6 25.7 4.04
Data format
[
  "title": "{title}",
  "abstract": "{abst}",
  "related_work": "{related work}",
  "cited": {
    "[1]": {
      "title": "{ref_1:title}",
      "abstract": "{ref_1:abst}",
    },
    ...,
    "[N]": {
      "title": "{ref_N:title}",
      "abstract": "{ref_N:abst}",
   }
]

Example of data

Comming soon

Environment

Requirements

Run on docker

git clone https://github.com/omron-sinicx/STRoGeNS.git
cd STRoGeNS

bash ./docker/build.sh
bash ./docker/run.sh
bash ./docker/exec.sh

huggingface-cli login

*Our environment does not follow nougat environment. To use nougat, please follow nougat docs.

Data Preparation

Unfortunately, due to licensing issues, we are unable to release all of the datasets we have created. We release a part of STRoGeNS-arXiv22 consisting of articles with an open license. To generate the entire dataset, we release the processing code that handles the process of creating the datasets.

STRoGENS-arXiv22

Step 1. Download arXiv dataset from unarXiv22

See the download instruction of https://github.com/IllDepence/unarXive

Decompress the xz file by following command.

tar -xvf unarXive_230324_open_subset.tar.xz

Put expand files on ./data/STRoGeNS-arXiv22/rawdata

Folder Structure
data/STRoGeNS-arXiv22
├── rawdata
|   ├── 00
|   |    |- arXiv_src_0005_001.jsonl
|   |    ...
|   |- 01
|   ...
│ - re
└──rw
└

Execute all preprocessing

scripts/preproc/arxiv2022_processing.sh
Individual pre-processing codes

Execute each processing.

Step 2. Extract title and etc. from jsonl.

python ./preprocessing/arxiv_dataset/unarxiv_title_ext.py \
    --data_dir data/STRoGeNS-arXiv22/rawdata \
    --output_path data/STRoGeNS-arXiv22/rw --log_dir {logs_directory}\
    --n_dir 0 # to start intermediate directory. start from n-th file

Step 3. Search abstract of cited references with Semantic scholar

Please set S2_API_KEY = "{S2APIKEY}" in line 14 of unarxiv_add_ref_info.py. See https://www.semanticscholar.org/product/api#api-key for the details.

python ./preprocessing/arxiv_dataset/unarxiv_add_ref_info.py \
    --data_dir data/STRoGeNS-arXiv22/rw \
    --output_path data/STRoGeNS-arXiv22/rw_wabst --log_dir {logs_directory}\
    --n_dir 0 # to start intermediate directory. start from n-th file

Step 4. Convert result to huggingface format

python ./preprocessing/arxiv_processing/unarxiv_convert2hg_format.py \
        --data_dir data/STRoGeNS-arXiv22/rw_wabst \
        --output_dir data/STRoGeNS-arXiv22/hg_format 
STRoGeNS-conf22, 23

We put example of dataset on data derectory (STRoGeNS-conf22 and STRoGeNS-conf23).

Step. 1 Download pdf from conference proceedings

Proceedings Python Library Requirements

  • Selenium, BeautifulSoup, slugify, jsonlines, tqdm

Script

scripts/preproc/conf_download_pdfs.sh

Step 2. Convert to markdown by NOUGAT

To make nougat environment, please refer document of nougat RUN NOUGAT

scripts/preproc/conf_nougat.sh

See details NOUGAT

Execute all preprocessing

scripts/preproc/conf_processing.sh
Individual pre-processing codes

Step 3. Extract title from parsed md

python ./preprocessing/conf_dataset/conf_title_ext.py --input_dir Conf{2023 or 2022} --output_path {title_extracted_data_dir}

Step 4. Search abstract of cited references with Semantic scholar

Please set S2_API_KEY = "{S2APIKEY}" in line 14 of conf_add_ref_info.py.

python ./preprocessing/conf_dataset/conf_add_ref_info.py --input_dir {title_extracted_data_dir}--output_path {final_data_dir}

Run comparisons

Scripts

bash ./scripts/comparisons/{method_name}.sh

Run evaluation for metrics

# run humanevaluation result confirmination
python ./metrics/human_eval/calcurate_corr.py \
  --annot_dir data/annotations/humanevaluation\
  --llama2_autoeval outputs/llama2_eval.csv

# run noverty evaluation
python ./metrics/novelty_eval/accuracy_eval.py

Folder structure

{STRoGENS ROOT} 
|- comparisons: implementation for comparisons
| |- casual_generation : LLAMA2
| |- conditional_generation: BART, PEGASUS
|   ...
|- metrics: code for evaluation metrics
|- preprocessing: code to make our dataset
└ scripts: running code

License

The codes are licensed under the MIT LICENSE.

Acknowledgements

Code for comparisons is based on BART, PEGASUS, LLAMA(Peft), LexRank, RextRank, LED.

If you found our work useful in your research, please consider citing our works(s) at:

@inproceedings{kazuya2024toward,
    title = "Toward Structured Related Work Generation with Novelty Statements",
    author = "Kazuya, Nishimura  and Kuniaki, Saito  and Tosho, Hirasawa  and Yoshitaka, Ushiku",
    booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing",
}

About

Toward Structured Related Work Generation with Novelty Statements @ the Fourth Workshop on Scholarly Document Processing (SDP 2024)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published