You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the Transformer is not really implemented as it should. We should revisit to implement it like the in original Transformer paper; including always training for predicting next sample (like language models), and calling the encoder+decoder in auto-regressive ways when producing forecasts. See: Attention Is All You Need
Note from @pennfranc :
This current implementation is fully functional and can already produce some good predictions. However, it is still limited in how it uses the Transformer architecture because the tgt input of torch.nn.Transformer is not utlized to its full extent. Currently, we simply pass the last value of the src input to tgt. To get closer to the way the Transformer is usually used in language models, we should allow the model to consume its own output as part of the tgt argument, such that when predicting sequences of values, the input to the tgt argument would grow as outputs of the transformer model would be added to it. Of course, the training of the model would have to be adapted accordingly.
The text was updated successfully, but these errors were encountered:
Hi @dennisbader@madtoinou , while working on the RWKV PR I realized, that I'm not using teacher forcing during training which would hinder the training quite a bit. It's a big part of this issue so I wanted to ask if I could pick it up, so that I had a point reference of how its final implementation should look like if I get it merged (+ the issue looks really cool ;) )
Currently the Transformer is not really implemented as it should. We should revisit to implement it like the in original Transformer paper; including always training for predicting next sample (like language models), and calling the encoder+decoder in auto-regressive ways when producing forecasts. See: Attention Is All You Need
Note from @pennfranc :
This current implementation is fully functional and can already produce some good predictions. However, it is still limited in how it uses the Transformer architecture because the
tgt
input oftorch.nn.Transformer
is not utlized to its full extent. Currently, we simply pass the last value of thesrc
input totgt
. To get closer to the way the Transformer is usually used in language models, we should allow the model to consume its own output as part of thetgt
argument, such that when predicting sequences of values, the input to thetgt
argument would grow as outputs of the transformer model would be added to it. Of course, the training of the model would have to be adapted accordingly.The text was updated successfully, but these errors were encountered: