I trained using four A100 GUP and the total batch size is 36. After a total of 300,000 times of training, this is the result of the model:  which is quite different from the result given in your paper :  I did not change the code, what could be the cause?