Implementation of Transformer from Attention Is All You Need paper in PyTorch TODO byte-pair encoding label smoothing beam search more tests batch bucketing visualize attention