Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent batch size #50

Open
kylebgorman opened this issue Feb 8, 2023 · 9 comments
Open

Consistent batch size #50

kylebgorman opened this issue Feb 8, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@kylebgorman
Copy link
Contributor

I have a need (discussed off-thread) to make it so that every batch is the same size. This seems to me to require the following:

  1. datasets know the length of the longest source, features (if present), and target (@property is appropriate here)
  2. there is a flag (call it --pad_to_max or something) which when enabled causes source, features, and target (respectively) to be padded to the appropriate length from (1) respectively; it just needs to be passed as pad_len to the batches.PaddedTensor constructor to achieve this

I will put this aside until #40 is completed, however, since this may interact in weird ways with that.

@kylebgorman kylebgorman self-assigned this Feb 8, 2023
@kylebgorman kylebgorman added the enhancement New feature or request label Feb 8, 2023
@Adamits
Copy link
Collaborator

Adamits commented Feb 9, 2023

We could consider following huggingface here? Though in our case we would not default to no padding, and I think padding=True is pretty terrible since all other potential arguments are strings, and we should add another string argument (or simply leaving that behavior as the default).

https://huggingface.co/docs/transformers/pad_truncation

@kylebgorman
Copy link
Contributor Author

kylebgorman commented Feb 9, 2023

What specifically in that doc do you want to follow? My thoughts:

  • I never want to truncate the source string.
  • I never want to truncate the target string during training; if I have one that is too long, I want to filter it out using simple UNIX tools instead.
  • I am fine with incremental decoding having an upper bound on target length that matches the one used in training; if that results in truncated predictions, so be it.

@Adamits
Copy link
Collaborator

Adamits commented Feb 9, 2023

Sorry, I thought I had link to a specific table:

Truncation Padding Instruction
no truncation no padding tokenizer(batch_sentences)
  padding to max sequence in batch tokenizer(batch_sentences, padding=True) or
    tokenizer(batch_sentences, padding='longest')
  padding to max model input length tokenizer(batch_sentences, padding='max_length')
  padding to specific length tokenizer(batch_sentences, padding='max_length', max_length=42)

Basically, just that we can have a padding argument that accepts strings like: longest, max_length, etc, and have an optional max_length argument. Looking more closely and thinking about it, though, most of these options are not really useful for us. I think we just want something like the longest and max_length options. Anyway, I think max_batch_length and max_dataset_length or something similar would be clearer. So maybe ignore HF :).

@kylebgorman
Copy link
Contributor Author

Yeah, I think we just want a flag that opts us into padding='max_length' (subject to --max_source_length and/or --max_target_length constraints) rather than the default padding='longest'.

Truncation makes more sense in BERT-ish land for other reasons, seems to me.

@Adamits
Copy link
Collaborator

Adamits commented Feb 9, 2023

Sounds good!

Yes when you are training an LM on all the data you can possibly find, truncating is not such a problem :).

@kylebgorman
Copy link
Contributor Author

Now that #40 is closed and such, I am wondering: what is the best way to enable this?

  • I think it should be a flag --pad_max which causes the padding to be as long as the longest source, features, and target length (respectively), with --no_pad_max the default.
  • Which module is responsible for computing the max for source, target, and features? Dataset? Dataconfig?
  • Is it okay for the warning or exception to be raised as in Adds Exception for inputs > maximum_source_length #51 in --pad_max mode? That is, we'd just let it happen naturally, on the first draw from the collator?

This information seems to need to live in the collator as a member. Then when it is asked to make a tensor, we'll pass pad_len=... to the appropriate max.

Any thoughts on this design @Adamits?

@Adamits
Copy link
Collaborator

Adamits commented Feb 22, 2023

  • I think it should be a flag --pad_max which causes the padding to be as long as the longest source, features, and target length (respectively), with --no_pad_max the default.

This sounds good. We probably should add documentation for padding (default pads to batch max, this arg sets it to dataset max)

  • Which module is responsible for computing the max for source, target, and features? Dataset? Dataconfig?

Dataset makes the most sense to me. I think the DataConfig just knows how to translate an input file into a sample. Dataset, to me, makes sense as the place that tracks all properties of the actual dataset. I think this should be easy for now, since we store all the samples on the dataset.
If we ever wanted to change this so that we don't store all samples in memory (not sure we need to care about that?), it would probably be a little tricker. We would need to preprocess just to find the max sequence lengths.

This is a good question. I think we should let it raise. However, it is worth considering raising a different message if pad_max is used? Probably not necessary since the obvious solution is just to increase max_source_length.

This information seems to need to live in the collator as a member. Then when it is asked to make a tensor, we'll pass pad_len=... to the appropriate max.

Agreed. So in this way, we can get the pad max on the dataset, pass it to the collator, and if that attribute is not None, we always pad to it when running the collator. I think this sounds good!

Any thoughts on this design @Adamits?

@kylebgorman
Copy link
Contributor Author

Okay, so:

  • --pad_max is the flag.
  • Dataset computes the max(es)
  • This gets passed as the pad_max argument in batches.PaddedTensor calls via the collators.
  • I may add a TODO to tweak the error later.

kylebgorman added a commit to kylebgorman/yoyodyne that referenced this issue May 29, 2023
It is not plugged into anything yet.

Working on issue CUNY-CL#50.
@kylebgorman
Copy link
Contributor Author

I now have an early draft of this here, in the max branch of my fork. Important simplifications while I get the basics right:

  • I assume that we are always padding to the max (for source and target); I can make it optional later.
  • I am ignoring feature models for now; that can be incorporated later.

I set the actual source_max_length to the min(max(longest source string in train, longest source string in dev), --max_source_length)), and similarly for target length. I then lightly modify the LSTM (it needs to be told the max source length) and the transformer (it needs to make the positional embedding as large as max of the longest source and target string). Everything else is just plumbing.

If you have some time I'd appreciate your input @Adamits...I could generate an early PR for review if that'd help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants