Skip to content

nju-websoft/DSEBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DSEBench

DSEBench is a test collection for Dataset Search with Examples (DSE), a task aimed at returning top-ranked candidate datasets based on both a textual query and a set of target datasets that clarify the user's search intent. Unlike traditional dataset retrieval tasks that rely solely on a query, DSE incorporates example datasets to improve search relevance. We conduct experiments on both the DSE task and field-level relevance. For further details about this test collection, please refer to the following paper.

Datasets

We reused the 46,615 datasets collected from NTCIR. The "datasets.json" file (available at Zenodo provides the id, title, description, tags, author, and summary of each dataset in JSON format.

{ 
  "id": "0000de36-24e5-42c1-959d-2772a3c747e7", 
  "title": "Montezuma National Wildlife Refuge: January - April, 1943", 
  "description": "This narrative report for Montezuma National Wildlife Refuge outlines Refuge accomplishments from January through April of 1943. ...", 
  "tags": ["annual-narrative", "behavior", "populations"], 
  "author": "Fish and Wildlife Service", 
  "summary": "Almost continuous rains during April brought flood conditions to the Clyde River as well as to the refuge storage pool. Cayuga Lake is at its highest level in about ton years. ..."
}

Below is an example of how to load and use the datasets.json file:

import json

# Load the dataset file
with open('datasets.json', 'r') as f:
    datasets_data = json.load(f)
    
    # Iterate through each judgment
    for dataset in datasets_data:
        dataset_id = dataset['id']  # Get the dataset ID
        title = dataset['title']  # Get the title
        
        # Other code to process the judgment data...

Queries

The "./Data/queries.tsv" file provides 3,979 keyword queries. Each row represents a query with two "\t"-separated columns: query_id and query_text. The queries can be divided into two categories: generated queries, created from the metadata of datasets, and NTCIR queries, imported from the English part of the NTCIR dataset. Queries with IDs starting with "GEN_" are generated queries, while those starting with "NTCIR" are NTCIR queries.

Below is an example of how to load and use the ./Data/queries.tsv file:

# Load the queries file
with open('Data/queries.tsv', 'r') as f:
    # Iterate through each line
    for line in f:
        query_id, query_text = line.split('\t')  # Get the query ID and the query text

        # Other code to process the data...

Test and Training Cases

In DSEBench, each input consists of a case, which includes a query and a set of target datasets that are known to be relevant to the query. The "./Data/cases.tsv" file provides 141 test cases and 5,699 training cases. Each row represents a case with three "\t"-separated columns: case_id, query_id, and target_dataset_id.

Test cases are identified by a case_id composed of pure numbers. These test cases are adapted from highly relevant query-dataset pairs from the NTCIR dataset. The remaining cases are training cases. Among these, those with a case_id starting with l1_ are adapted from partially relevant query-dataset pairs from NTCIR, while those starting with gen_ are synthetic training cases, where the queries are generated queries.

Below is an example of how to load and use the ./Data/cases.tsv file:

# Load the cases file
with open('Data/cases.tsv', 'r') as f:
    # Iterate through each line
    for line in f:
        case_id, query_id, target_dataset_id = line.split('\t')  # Get the case ID, the query ID, and the target dataset ID

        # Other code to process the data...

Relevance Judgments

The "./Data/human_annotated_judgments.json" file contains 7,415 human-annotated judgments, and the "./Data/llm_annotated_judgments.json" file contains 122,585 judgments generated by a large language model (LLM). Each JSON object has eight keys: query_id, target_dataset_id, candidate_dataset_id, case_id (the ID of the input), query_rel (relevance of the candidate dataset to the query, 0: irrelevant; 1: partially relevant; 2: highly relevant), field_query_rel, target_sim (similarity of the candidate dataset to the target datasets, 0: dissimilar; 1: partially similar; 2: highly similar), and field_target_sim. The field_query_rel and field_target_sim are both lists of length 5 consisting of 0 and 1, corresponding to the fields [title, description, tags, author, summary].

{
    "query_id": "NTCIR_200000", 
    "target_dataset_id": "002ece58-9603-43f1-8e2e-54e3d9649e84", 
    "candidate_dataset_id": "99e3b6a2-d097-463f-b6e1-3caceff300c9", 
    "case_id": "1", 
    "query_rel": 1, 
    "field_query_rel": [1, 1, 1, 0, 0], 
    "target_sim": 2, 
    "field_target_sim": [1, 1, 1, 1, 1]
}

Below is an example of how to load and use the ./Data/human_annotated_judgments.json file:

import json

# Load the judgments file
with open('Data/human_annotated_judgments.json', 'r') as f:
    judgments_data = json.load(f)
    
    # Iterate through each judgment
    for judgment in judgments_data:
        case_id = judgment['case_id']  # Get the case ID
        candidate_dataset_id = judgment['candidate_dataset_id']  # Get the candidate dataset ID
        query_rel = judgment['query_rel']  # Get the query relevance score
        field_query_rel = judgment['field_query_rel']  # Get the field-level query relevance scores (title, description, tags, author, summary)
        
        # Other code to process the judgment data...

Splits for Training, Validation, and Test Sets

To ensure that evaluation results are comparable, we provide predefined train-validation-test splits. The "./Data/Splits/5-Fold_split" folder contains five sub-folders, each providing three qrel files for training, validation, and test sets. The "./Data/Splits/Annotators_split" folder contains three qrel files for the training, validation, and test sets as well.

These files are used in the same way as the relevance judgments files.

Baselines for DSE

For retrieval, we evaluated two sparse retrieval models: (1) TF-IDF based cosine similarity, (2) BM25 and five dense retrieval models: (3) BGE, (4) GTE, (5) Contextualized late interaction over BERT (ColBERTv2), (6) coCondenser and (7) Dense Passage Retrieval (DPR). For reranking, we have evaluated four models: (1) Stella, (2) SFR-Embedding-Mistral, (3) BGE-reranker, and (4) GLM-4-Plus.

The details of the experiments are given in the corresponding section of our paper.

The "./Baselines/Retrieval" folder and the "./Baselines/Reranking" folder contain the results for each baseline method, formatted as: {case_id: {candidate_dataset_id: score, ...}, ...}.

Below is an example of how to load and process the retrieval results from the ./Baselines/Retrieval/BM25_results.json file:

import json

# Load the BM25 retrieval result file
with open('Baselines/Retrieval/BM25_results.json', 'r') as f:
    results = json.load(f)
    
    # Iterate through the results for each case
    for case_id, ranking_dict in results.items():
        for candidate_dataset_id, ranking_score in ranking_dict.items():
            print(f"Case ID: {case_id}, Candidate Dataset ID: {candidate_dataset_id}, Score: {ranking_score}")

            # Additional processing for the results...

Baselines for Explanation

We employed post-hoc explanation methods to identify which fields of the candidate dataset are relevant to the query or similar to the target datasets. We evaluated four different explainers: (1) Feature Ablation explainer, (2) LIME explainer, (3) SHAP explainer, and (4) LLM explainer, using F1-score. The first three methods are combined with retrieval models, while the LLM explainer operates independently.

The results for each explainer are stored in the "./Baselines/Explanation" folder, formatted as: {case_id: {candidate_dataset_id: {explanaion_type: [0,1,1,0,0], ...}, ...}, ...}.

Details of these experiments are provided in the corresponding sections of our paper.

Below is an example of how to load and process the retrieval results from the ./Baselines/Explanation/FeatureAblation/BM25_result.json file:

import json

fields = ['title', 'description', 'tags', 'author', 'summary']

# Load the Feature Ablation explainer combined with BM25 retriever results
with open('Baselines/Explanation/FeatureAblation/BM25_result.json', 'r') as f:
    results = json.load(f)
    
    # Iterate through the results for each case
    for case_id, explanation_dict in results.items():
        for candidate_dataset_id, sub_explanation_dict in explanation_dict.items():
            print(f"Case ID: {case_id}, Candidate Dataset ID: {candidate_dataset_id}")

            # Display field-level query relevance
            print("Field-level Query Relevance:")
            field_level_query_rel = sub_explanation_dict['query']
            for field_idx, field in enumerate(fields):
                if field_level_query_rel[field_idx] == 1:
                    print(f"{field} field is relevant to the query")
                elif field_level_query_rel[field_idx] == 0:
                    print(f"{field} field is irrelevant to the query")

            # Display field-level target similarity
            print("Field-level Target Similarity:")
            field_level_target_sim = sub_explanation_dict['dataset']
            # Additional processing for target similarity...

Source Codes

All implementation source code is available in the ./Code directory.

Dependencies

To run the code, ensure you have the following dependencies installed:

  • Python 3.9
  • rank-bm25
  • scikit-learn
  • sentence-transformers
  • faiss-gpu
  • ragatouille
  • tevatron
  • torch
  • shap
  • lime
  • zhipuai
  • FlagEmbedding

Retrieval Models

Detailed documentation and code examples for retrieval models are provided in the ./Code/Retrieval/README.md.

The retrieval models include:

  • Sparse Retrieval Models:
    • BM25
    • TF-IDF
  • Dense Retrieval Models:
    • Unspervised Dense Retrieval Models:
      • BGE (bge-large-en-v1.5)
      • GTE (gte-large)
    • Spervised Dense Retrieval Models:
      • coCondenser
      • ColBERTv2
      • DPR

Reranking Models

Documentation and code examples for reranking models are provided in the ./Code/Rerank/README.md.

The reranking models include:

  • Stella
  • SFR-Embedding-Mistral
  • BGE-reranker
  • LLM

Explanation Methods

Documentation and code examples for explanation methods are provided in the ./Code/Explanation/README.md.

The explanation methods include:

  • Feature Ablation
  • LIME
  • SHAP
  • LLM

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •