Skip to content

Latest commit

 

History

History
132 lines (92 loc) · 3.77 KB

README.md

File metadata and controls

132 lines (92 loc) · 3.77 KB

DUNKS

This is the source code and data with the paper "DUNKS: Chunking and Summarizing Large and Heterogeneous Web Data for Dataset Search".

With the vast influx of open datasets published on the Web, dataset search has been an established and increasingly prominent problem. Existing solutions primarily cater to data in a single format, such as tabular or RDF datasets, despite the diverse formats of Web data. In this paper, to address data heterogeneity, we propose to transform major data formats into unified data chunks, each consisting of triples describing an entity. Furthermore, to make data chunks accommodate to the limited input capacity of dense ranking models based on pre-trained language models, we devise a multi-chunk summarization method that extracts representative triples from representative chunks. We conduct experiments on two test collections for ad hoc dataset retrieval, where the results demonstrate the effectiveness of dense ranking over summarized data chunks.


Requirements

This code is based on Python 3.9+, and the partial list of the required packages is as follow.

  • beautifulsoup4
  • camelot_py
  • contractions
  • pikepdf
  • python_docx
  • python_magic
  • rdflib
  • tika
  • xmltodict
  • flag-embedding
  • torch
  • transformers
  • ranx
pip install -r requirements.txt

Unified Data Chunking

python ./code/unified-data-chunking/graph_builder.py [-i|-p] <input_file|input_path> -o <output_path>
  • [-i|--input_file]: path to a single file

  • [-p|--input_path]: path to the input folder

  • [-o|--output_path]: path to the output folder

Notice: only one of -i and -p can be used

The structure of the input folder:

    ./input_folder
    |--dataset1
        |--file1.json
        |--file2.csv
    |--dataset2
        |--file1.json
        |--file2.csv

The input dataset can contain multiple heterogeneous data files. Currently supported data formats include:

  • .txt, .pdf, .html, .doc, .docx
  • .csv, .xls, .xlsx
  • .json, .xml
  • .rdf, .nt, .owl

The generated files in the output folder:

    ./output_folder
    |--term.tsv
    |--text.tsv
    |--triple.tsv

The structrue of the output file is as follows:

  • term.tsv: dataset_id\tterm_id\tterm_text
  • text.tsv: dataset_id\tpassage_id\tpassage_text
  • triple.tsv: dataset_id\tsubject_id\tpredicate_id\tobject_id

Multi-Chunk Summarization

python ./code/multi-chunk-summarization/summary_generator.py -i <input_path> -o <output_path> -n <chunk_num> -k <chunk_size>
  • [-p|--input_path]: path to the input folder, usually it is the output folder of the previous step

  • [-o|--output_path]: path to the output folder

  • [-n|--chunk_num]: the maximum number of chunks retained in the summary

  • [-k|--chunk_size]: the maximum number of triples in a summarized chunk

The structure of the input folder:

    ./input_folder
    |--term.tsv
    |--triple.tsv

The generated files in the output folder:

    ./output_folder
    |--summary.tsv

The structrue of the output file is as follows:

  • summary.tsv: dataset_id\tchunk_id\tsubject_id\tpredicate_id\tobject_id

Chunk-based Dataset Reranking

We implement monoBERT, BGE, and BGE-reranker as dense reranking models, see code in ./code/chunk-based-dataset-reranking/ for details. We use ranx for normalizing and fusing metadata-based and data-based relevance scores.


Evaluation

All the results for reranking experiments on test collection NTCIR-E and ACORDAR are at ./data/results in TREC format as follows.

1 Q0 32907 1 1.3371933160530833 mixed
1 Q0 31665 2 1.2344413177975981 mixed
1 Q0 1670 3 0.816091260131519 mixed

Citation