I can not reproduce "imagenet 18" with fp32

Hi @yaroslavvb 

I tried to reproduce "imagenet 18" on my host, it works well with fp16(top1 acc get 75.776% at the 27th epoch), but only get 51.018% top1-acc at the 27th epoch when fp32.

![image](https://user-images.githubusercontent.com/1426912/48247610-0f888080-e42f-11e8-9dd6-852d803eefec.png)

The entrypoint is as followed, I turned down the batch_size to avoid OOM with fp32, and the same argument with fp16 in my experiment:

``` bash
PYTHONPATH=/imagenet18 \
NCCL_DEBUG=VERSION \
stdbuf -oL nohup python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
--master_addr=127.0.0.1 \
--master_port=6010 \
training/train_imagenet_nv.py /data/imagenet \
--logdir /imagenet18/log_small_bs \
--distributed \
--init-bn0 \
--no-bn-wd \
--phases '[{"ep": 0, "sz": 128, "bs": 224, "trndir": "/sz/160"}, {"ep": (0, 7), "lr": (1.0, 2.0)}, {"ep": (7, 13), "lr": (2.0, 0.25)}, {"ep": 13, "sz": 224, "bs": 96, "trndir": "/sz/352", "min_scale": 0.087}, {"ep": (13, 22), "lr": (0.42857142857142855, 0.04285714285714286)}, {"ep": (22, 25), "lr": (0.04285714285714286, 0.004285714285714286)}, {"ep": 25, "sz": 288, "bs": 50, "min_scale": 0.5, "rect_val": True}, {"ep": (25, 28), "lr": (0.0022321428571428575, 0.00022321428571428573)}]' 2>&1 > local_train.log &
```

Have you reproduced the conclusion with fp32?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can not reproduce "imagenet 18" with fp32 #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

I can not reproduce "imagenet 18" with fp32 #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions