Toward Structured Related Work Generation with Novelty Statements

To do

Code to process STRoGeNS arXiv
Code to process conference proceedings
Code for comparisons
Code for metrics evaluation

Description about dataset

About dataset

Dataset	Pairs	Words	Words (Output)	Input Doc (Num)	#Para.
Multi-XScience	40,528	778.1	116.4	4.4	1
S2ORC	136,655	1067.4	148.7	5.0	1
Delve	78,927	622.6	228.6	3.7	1
TAS2	117,700	1036.0	134.8	4.8	1
TAD	218,255	1071.4	162.3	5.2	1
BigSurvey-MDS	4,478	11,893.1	1,051.7	76.3	1
SciReviewGen	10,130	11,734.4	7,193.8	68.1	1
STRoGeNS-arXiv22 (Ours)	85,853	3,046.2	514.3	16.6	4.22
STRoGeNS-conf22 (Ours)	15,079	3,669.1	508.5	20.4	4.27
STRoGeNS-conf23 (Ours)	4,762	4,836.6	504.6	25.7	4.04

Data format

[
  "title": "{title}",
  "abstract": "{abst}",
  "related_work": "{related work}",
  "cited": {
    "[1]": {
      "title": "{ref_1:title}",
      "abstract": "{ref_1:abst}",
    },
    ...,
    "[N]": {
      "title": "{ref_N:title}",
      "abstract": "{ref_N:abst}",
   }
]

Example of data

Comming soon

Environment

Requirements

python >= 3.9
Ubuntu >= 20.04 LTS
NVIDIA Driver >= 535
Docker with nvidia-container-toolkit
LLAMA2 permittion

Please refer https://huggingface.co/meta-llama/Llama-2-7b
API KEY for semantic scholar

See https://www.semanticscholar.org/product/api#api-key for the details.

*Please replace {S2APIKEY} to token in line 14 of unarxiv_add_ref_info.py and line 14 of conf_add_ref_info.py.
API KEY for GPT See https://openai.com/index/openai-api/ for the details.

*Please replace {GPT_API_KEY} to token in line 12 of accuracy_eval.py, line 12 of novelty_eval.py, and line 18 of gpt_estimation.py.

Run on docker

git clone https://github.com/omron-sinicx/STRoGeNS.git
cd STRoGeNS

bash ./docker/build.sh
bash ./docker/run.sh
bash ./docker/exec.sh

huggingface-cli login

*Our environment does not follow nougat environment. To use nougat, please follow nougat docs.

Data Preparation

Unfortunately, due to licensing issues, we are unable to release all of the datasets we have created. We release a part of STRoGeNS-arXiv22 consisting of articles with an open license. To generate the entire dataset, we release the processing code that handles the process of creating the datasets.

STRoGENS-arXiv22

Step 1. Download arXiv dataset from unarXiv22

See the download instruction of https://github.com/IllDepence/unarXive

Decompress the xz file by following command.

tar -xvf unarXive_230324_open_subset.tar.xz

Put expand files on ./data/STRoGeNS-arXiv22/rawdata

Folder Structure

data/STRoGeNS-arXiv22
├── rawdata
|   ├── 00
|   |    |- arXiv_src_0005_001.jsonl
|   |    ...
|   |- 01
|   ...
│ - re
└──rw
└

Execute all preprocessing

scripts/preproc/arxiv2022_processing.sh

Individual pre-processing codes

Execute each processing.

Step 2. Extract title and etc. from jsonl.

python ./preprocessing/arxiv_dataset/unarxiv_title_ext.py \
    --data_dir data/STRoGeNS-arXiv22/rawdata \
    --output_path data/STRoGeNS-arXiv22/rw --log_dir {logs_directory}\
    --n_dir 0 # to start intermediate directory. start from n-th file

Step 3. Search abstract of cited references with Semantic scholar

Please set S2_API_KEY = "{S2APIKEY}" in line 14 of unarxiv_add_ref_info.py. See https://www.semanticscholar.org/product/api#api-key for the details.

python ./preprocessing/arxiv_dataset/unarxiv_add_ref_info.py \
    --data_dir data/STRoGeNS-arXiv22/rw \
    --output_path data/STRoGeNS-arXiv22/rw_wabst --log_dir {logs_directory}\
    --n_dir 0 # to start intermediate directory. start from n-th file

Step 4. Convert result to huggingface format

python ./preprocessing/arxiv_processing/unarxiv_convert2hg_format.py \
        --data_dir data/STRoGeNS-arXiv22/rw_wabst \
        --output_dir data/STRoGeNS-arXiv22/hg_format

STRoGeNS-conf22, 23

We put example of dataset on data derectory (STRoGeNS-conf22 and STRoGeNS-conf23).

Step. 1 Download pdf from conference proceedings

Proceedings Python Library Requirements

Selenium, BeautifulSoup, slugify, jsonlines, tqdm

Script

scripts/preproc/conf_download_pdfs.sh

Step 2. Convert to markdown by NOUGAT

To make nougat environment, please refer document of nougat RUN NOUGAT

scripts/preproc/conf_nougat.sh

See details NOUGAT

Execute all preprocessing

scripts/preproc/conf_processing.sh

Individual pre-processing codes

Step 3. Extract title from parsed md

python ./preprocessing/conf_dataset/conf_title_ext.py --input_dir Conf{2023 or 2022} --output_path {title_extracted_data_dir}

Step 4. Search abstract of cited references with Semantic scholar

Please set S2_API_KEY = "{S2APIKEY}" in line 14 of conf_add_ref_info.py.

python ./preprocessing/conf_dataset/conf_add_ref_info.py --input_dir {title_extracted_data_dir}--output_path {final_data_dir}

Run comparisons

Scripts

bash ./scripts/comparisons/{method_name}.sh

Run evaluation for metrics

# run humanevaluation result confirmination
python ./metrics/human_eval/calcurate_corr.py \
  --annot_dir data/annotations/humanevaluation\
  --llama2_autoeval outputs/llama2_eval.csv

# run noverty evaluation
python ./metrics/novelty_eval/accuracy_eval.py

Folder structure

{STRoGENS ROOT} 
|- comparisons: implementation for comparisons
| |- casual_generation : LLAMA2
| |- conditional_generation: BART, PEGASUS
|   ...
|- metrics: code for evaluation metrics
|- preprocessing: code to make our dataset
└ scripts: running code

License

The codes are licensed under the MIT LICENSE.

Acknowledgements

Code for comparisons is based on BART, PEGASUS, LLAMA(Peft), LexRank, RextRank, LED.

If you found our work useful in your research, please consider citing our works(s) at:

@inproceedings{kazuya2024toward,
    title = "Toward Structured Related Work Generation with Novelty Statements",
    author = "Kazuya, Nishimura  and Kuniaki, Saito  and Tosho, Hirasawa  and Yoshitaka, Ushiku",
    booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toward Structured Related Work Generation with Novelty Statements

To do

Description about dataset

Example of data

Environment

Run on docker

Data Preparation

Step 1. Download arXiv dataset from unarXiv22

Execute all preprocessing

Step 2. Extract title and etc. from jsonl.

Step 3. Search abstract of cited references with Semantic scholar

Step 4. Convert result to huggingface format

Step. 1 Download pdf from conference proceedings

Step 2. Convert to markdown by NOUGAT

Execute all preprocessing

Step 3. Extract title from parsed md

Step 4. Search abstract of cited references with Semantic scholar

Run comparisons

Scripts

Run evaluation for metrics

Folder structure

License

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
comparisons		comparisons
data		data
docker		docker
metrics		metrics
outputs		outputs
preprocessing		preprocessing
scripts		scripts
LICENSE		LICENSE
README.md		README.md

License

omron-sinicx/STRoGeNS

Folders and files

Latest commit

History

Repository files navigation

Toward Structured Related Work Generation with Novelty Statements

To do

Description about dataset

Example of data

Environment

Run on docker

Data Preparation

Step 1. Download arXiv dataset from unarXiv22

Execute all preprocessing

Step 2. Extract title and etc. from jsonl.

Step 3. Search abstract of cited references with Semantic scholar

Step 4. Convert result to huggingface format

Step. 1 Download pdf from conference proceedings

Step 2. Convert to markdown by NOUGAT

Execute all preprocessing

Step 3. Extract title from parsed md

Step 4. Search abstract of cited references with Semantic scholar

Run comparisons

Scripts

Run evaluation for metrics

Folder structure

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages