batchsize and training speed

What's the actual batchsize during training when I use 8 gpu with batch_size=40 as the config yaml?
is it 40x8=320 or 5x8=40?

And I see the training speed is just 400steps per hour in my 8xH800 machine.
while in the paper it says training 2000k steps using 8xA800. which may take about more than 5k hours (200days).
So what's wrong with it ?