ByT5 encoder #73

kylebgorman · 2023-06-27T02:42:27Z

As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g., --source_encoder_arch byt5_base) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.

This should become much easier to do upon completion of #72---we'd just implement a new encoder ByT5 in yoyodyne/models/modules/byt5.py. In the constructor, you'd use the transformers.T5Encoder.from_pretrained class method to instantiate an encoder; there are four sizes (small, base, large, xl) and we could just add all four.

I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.

The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add byte as a special-case separator option.

Here are some early notes on how to do this.

The text was updated successfully, but these errors were encountered:

kylebgorman added the enhancement New feature or request label Jun 27, 2023

kylebgorman mentioned this issue Jun 28, 2023

Feature update #72

Merged

kylebgorman assigned bonham79 Jul 1, 2023

bonham79 mentioned this issue Dec 18, 2024

Build new architectures #295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByT5 encoder #73

ByT5 encoder #73

kylebgorman commented Jun 27, 2023 •

edited

Loading

ByT5 encoder #73

ByT5 encoder #73

Comments

kylebgorman commented Jun 27, 2023 • edited Loading

kylebgorman commented Jun 27, 2023 •

edited

Loading