Error encountered when using metricx with batch_size > 1 #2

samirsalman · 2024-02-27T16:50:05Z

Issue Description:

Hi there,

I'm currently encountering an error while utilizing metricx, specifically when the batch_size parameter is set to a value greater than 1. Below, I've outlined the command I'm using along with the error message received:

Command:

CUDA_VISIBLE_DEVICES=2 python -m metricx23.predict --tokenizer google/mt5-xl --model_name_or_path google/metricx-23-qe-xl-v2p0 --max_input_length 512 --batch_size 512 --input_file ../testset_metrix.jsonl --output_file ./metricx.jsonl --qe

Error Message:

RuntimeError: stack expects each tensor to be equal size, but got [76] at entry 0 and [58] at entry 1
Issue Analysis:

It seems that the error occurs due to tensor size discrepancies within the specified batch when batch_size is set higher than 1.

Request for Assistance:

Could you please provide guidance on resolving this issue? Any suggestions or insights would be greatly appreciated.

Thank you!

nicolasdahan · 2024-03-17T00:57:56Z

Hi,

It seems there is a padding issue occurring due to variations in the lengths of input sequences within the same batch size. To resolve this, all input sequences within a batch should have the same length.

In the get_dataset() function in predict.py, I've added a new function called _pad. This function is responsible for filling input sequences with padding tokens to make them all the same length, equal to max_input_length

`def get_dataset(
input_file: str, tokenizer, max_input_length: int, device, is_qe: bool
):
"""Gets the test dataset for prediction.

If `is_qe` is true, the input data must have "hypothesis" and "source" fields.
If it is false, there must be "hypothesis" and "reference" fields.

Args:
  input_file: The path to the jsonl input file.
  tokenizer: The tokenizer to use.
  max_input_length: The maximum input sequence length.
  device: The ID of the device to put the PyTorch tensors on.
  is_qe: Indicates whether the metric is a QE metric or not.

Returns:
  The dataset.
"""

def _make_input(example):
    if is_qe:
        example["input"] = (
            "candidate: "
            + example["hypothesis"]
            + " source: "
            + example["source"]
        )
    else:
        example["input"] = (
            "candidate: "
            + example["hypothesis"]
            + " reference: "
            + example["reference"]
        )
    return example

def _tokenize(example):
    return tokenizer(
        example["input"],
        max_length=max_input_length,
        truncation=True,
        padding=False, 
    )

def _remove_eos(example):
    example["input_ids"] = example["input_ids"][:-1]
    example["attention_mask"] = example["attention_mask"][:-1]
    return example

def _pad(example):
    input_ids = example["input_ids"]
    attention_mask = example["attention_mask"]
    padded_input_ids = input_ids + [tokenizer.pad_token_id] * (max_input_length - len(input_ids))
    padded_attention_mask = attention_mask + [0] * (max_input_length - len(attention_mask))
    example["input_ids"] = torch.tensor(padded_input_ids, dtype=torch.long)
    example["attention_mask"] = torch.tensor(padded_attention_mask, dtype=torch.long)
    return example

ds = datasets.load_dataset("json", data_files={"test": input_file})
ds = ds.map(_make_input)
ds = ds.map(_tokenize)
ds = ds.map(_remove_eos)
ds = ds.map(_pad)
ds.set_format(
    type="torch",
    columns=["input_ids", "attention_mask"],
    device=device,
    output_all_columns=True,
)
return ds

`

samirsalman · 2024-03-19T11:54:26Z

Hi @nicolasdahan,
thank you for your answer, btw this solution is completely inefficient. The best solution is to pad to the longest sequence in the batch.
@danieldeutsch do you have any suggestion?

rafikg · 2024-06-03T21:14:35Z

@nicolasdahan @samirsalman
You can just edit _tokenizer() function as following to pad to the max_length


def _tokenize(example):
       return tokenizer(
           example["input"],
           max_length=max_input_length,
           truncation=True,
           padding='max_length', 
       )

If you need to pad to the longest sentence in the batch, you need to use a DataCollatorWithPaddingobject as following:

def get_dataset(
        input_file: str, tokenizer, max_input_length: int, device, is_qe: bool
):
.......

    def _tokenize(example):
        return tokenizer(
            example["input"],
            max_length=max_input_length,
            truncation=True,
            padding=False,
        )

    ds.set_format(
            type="torch",
            columns=["input_ids", "attention_mask"],
            device=device,
            output_all_columns=True,
        )

    # Add data collator for batching mode
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
    return ds, data_collator

ds, datacollator = get_dataset(
        args.input_file,
        tokenizer,
        args.max_input_length,
        device,
        args.qe,
    )


trainer = transformers.Trainer(
        model=model,
        args=training_args,
        data_collator=datacollator

    )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error encountered when using metricx with batch_size > 1 #2

Error encountered when using metricx with batch_size > 1 #2

samirsalman commented Feb 27, 2024

nicolasdahan commented Mar 17, 2024

samirsalman commented Mar 19, 2024

rafikg commented Jun 3, 2024 •

edited

Loading

Error encountered when using metricx with batch_size > 1 #2

Error encountered when using metricx with batch_size > 1 #2

Comments

samirsalman commented Feb 27, 2024

Error Message:

Request for Assistance:

nicolasdahan commented Mar 17, 2024

samirsalman commented Mar 19, 2024

rafikg commented Jun 3, 2024 • edited Loading

rafikg commented Jun 3, 2024 •

edited

Loading