Hi @yaroslavvb
I tried to reproduce "imagenet 18" on my host, it works well with fp16(top1 acc get 75.776% at the 27th epoch), but only get 51.018% top1-acc at the 27th epoch when fp32.

The entrypoint is as followed, I turned down the batch_size to avoid OOM with fp32, and the same argument with fp16 in my experiment:
PYTHONPATH=/imagenet18 \
NCCL_DEBUG=VERSION \
stdbuf -oL nohup python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
--master_addr=127.0.0.1 \
--master_port=6010 \
training/train_imagenet_nv.py /data/imagenet \
--logdir /imagenet18/log_small_bs \
--distributed \
--init-bn0 \
--no-bn-wd \
--phases '[{"ep": 0, "sz": 128, "bs": 224, "trndir": "/sz/160"}, {"ep": (0, 7), "lr": (1.0, 2.0)}, {"ep": (7, 13), "lr": (2.0, 0.25)}, {"ep": 13, "sz": 224, "bs": 96, "trndir": "/sz/352", "min_scale": 0.087}, {"ep": (13, 22), "lr": (0.42857142857142855, 0.04285714285714286)}, {"ep": (22, 25), "lr": (0.04285714285714286, 0.004285714285714286)}, {"ep": 25, "sz": 288, "bs": 50, "min_scale": 0.5, "rect_val": True}, {"ep": (25, 28), "lr": (0.0022321428571428575, 0.00022321428571428573)}]' 2>&1 > local_train.log &
Have you reproduced the conclusion with fp32?
Hi @yaroslavvb
I tried to reproduce "imagenet 18" on my host, it works well with fp16(top1 acc get 75.776% at the 27th epoch), but only get 51.018% top1-acc at the 27th epoch when fp32.
The entrypoint is as followed, I turned down the batch_size to avoid OOM with fp32, and the same argument with fp16 in my experiment:
Have you reproduced the conclusion with fp32?