Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error encountered when using metricx with batch_size > 1 #2

Open
samirsalman opened this issue Feb 27, 2024 · 3 comments
Open

Error encountered when using metricx with batch_size > 1 #2

samirsalman opened this issue Feb 27, 2024 · 3 comments

Comments

@samirsalman
Copy link

Issue Description:

Hi there,

I'm currently encountering an error while utilizing metricx, specifically when the batch_size parameter is set to a value greater than 1. Below, I've outlined the command I'm using along with the error message received:

Command:

CUDA_VISIBLE_DEVICES=2 python -m metricx23.predict --tokenizer google/mt5-xl --model_name_or_path google/metricx-23-qe-xl-v2p0 --max_input_length 512 --batch_size 512 --input_file ../testset_metrix.jsonl --output_file ./metricx.jsonl --qe

Error Message:

RuntimeError: stack expects each tensor to be equal size, but got [76] at entry 0 and [58] at entry 1
Issue Analysis:

It seems that the error occurs due to tensor size discrepancies within the specified batch when batch_size is set higher than 1.

Request for Assistance:

Could you please provide guidance on resolving this issue? Any suggestions or insights would be greatly appreciated.

Thank you!

@nicolasdahan
Copy link

Hi,

It seems there is a padding issue occurring due to variations in the lengths of input sequences within the same batch size. To resolve this, all input sequences within a batch should have the same length.

In the get_dataset() function in predict.py, I've added a new function called _pad. This function is responsible for filling input sequences with padding tokens to make them all the same length, equal to max_input_length

`def get_dataset(
input_file: str, tokenizer, max_input_length: int, device, is_qe: bool
):
"""Gets the test dataset for prediction.

If `is_qe` is true, the input data must have "hypothesis" and "source" fields.
If it is false, there must be "hypothesis" and "reference" fields.

Args:
  input_file: The path to the jsonl input file.
  tokenizer: The tokenizer to use.
  max_input_length: The maximum input sequence length.
  device: The ID of the device to put the PyTorch tensors on.
  is_qe: Indicates whether the metric is a QE metric or not.

Returns:
  The dataset.
"""

def _make_input(example):
    if is_qe:
        example["input"] = (
            "candidate: "
            + example["hypothesis"]
            + " source: "
            + example["source"]
        )
    else:
        example["input"] = (
            "candidate: "
            + example["hypothesis"]
            + " reference: "
            + example["reference"]
        )
    return example

def _tokenize(example):
    return tokenizer(
        example["input"],
        max_length=max_input_length,
        truncation=True,
        padding=False, 
    )

def _remove_eos(example):
    example["input_ids"] = example["input_ids"][:-1]
    example["attention_mask"] = example["attention_mask"][:-1]
    return example

def _pad(example):
    input_ids = example["input_ids"]
    attention_mask = example["attention_mask"]
    padded_input_ids = input_ids + [tokenizer.pad_token_id] * (max_input_length - len(input_ids))
    padded_attention_mask = attention_mask + [0] * (max_input_length - len(attention_mask))
    example["input_ids"] = torch.tensor(padded_input_ids, dtype=torch.long)
    example["attention_mask"] = torch.tensor(padded_attention_mask, dtype=torch.long)
    return example

ds = datasets.load_dataset("json", data_files={"test": input_file})
ds = ds.map(_make_input)
ds = ds.map(_tokenize)
ds = ds.map(_remove_eos)
ds = ds.map(_pad)
ds.set_format(
    type="torch",
    columns=["input_ids", "attention_mask"],
    device=device,
    output_all_columns=True,
)
return ds

`

@samirsalman
Copy link
Author

Hi @nicolasdahan,
thank you for your answer, btw this solution is completely inefficient. The best solution is to pad to the longest sequence in the batch.
@danieldeutsch do you have any suggestion?

@rafikg
Copy link

rafikg commented Jun 3, 2024

@nicolasdahan @samirsalman
You can just edit _tokenizer() function as following to pad to the max_length


def _tokenize(example):
       return tokenizer(
           example["input"],
           max_length=max_input_length,
           truncation=True,
           padding='max_length', 
       )

If you need to pad to the longest sentence in the batch, you need to use a DataCollatorWithPaddingobject as following:

def get_dataset(
        input_file: str, tokenizer, max_input_length: int, device, is_qe: bool
):
.......

    def _tokenize(example):
        return tokenizer(
            example["input"],
            max_length=max_input_length,
            truncation=True,
            padding=False,
        )

    ds.set_format(
            type="torch",
            columns=["input_ids", "attention_mask"],
            device=device,
            output_all_columns=True,
        )

    # Add data collator for batching mode
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
    return ds, data_collator

ds, datacollator = get_dataset(
        args.input_file,
        tokenizer,
        args.max_input_length,
        device,
        args.qe,
    )


trainer = transformers.Trainer(
        model=model,
        args=training_args,
        data_collator=datacollator

    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants