You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g., --source_encoder_arch byt5_base) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.
This should become much easier to do upon completion of #72---we'd just implement a new encoder ByT5 in yoyodyne/models/modules/byt5.py. In the constructor, you'd use the transformers.T5Encoder.from_pretrained class method to instantiate an encoder; there are four sizes (small, base, large, xl) and we could just add all four.
I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.
The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add byte as a special-case separator option.
As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g.,
--source_encoder_arch byt5_base
) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.This should become much easier to do upon completion of #72---we'd just implement a new encoder
ByT5
inyoyodyne/models/modules/byt5.py
. In the constructor, you'd use thetransformers.T5Encoder.from_pretrained
class method to instantiate an encoder; there are four sizes (small
,base
,large
,xl
) and we could just add all four.I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.
The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add
byte
as a special-case separator option.Here are some early notes on how to do this.
The text was updated successfully, but these errors were encountered: