Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) #297

Open
jeonga0303 opened this issue Jun 5, 2024 · 3 comments

Comments

@jeonga0303
Copy link

image

I customized the config.py.
how to train fine-tuning classification model?

image
@jeonga0303 jeonga0303 changed the title how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) [solved[ how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) Jun 9, 2024
@jeonga0303 jeonga0303 changed the title [solved[ how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) [solved] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) Jun 9, 2024
@jeonga0303 jeonga0303 reopened this Jun 10, 2024
@jeonga0303 jeonga0303 changed the title [solved] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) Jun 10, 2024
@jeonga0303
Copy link
Author

jeonga0303 commented Jun 10, 2024

how to convert Nc1..?

image
image
image

@jeonga0303
Copy link
Author

jeonga0303 commented Jun 10, 2024

image

I tried changing the file configuration in the following order.

I'm training, but the data is big, so I'll let you know the results in the future

  1. download pth file

  2. config.py

_C.DATA.IMG_SIZE = 224
_C.MODEL.PRETRAINED = 'internimage_b_1k_224.pth'
_C.MODEL.NUM_CLASSES = 4

  1. util.py ( Nc1: 1000 > Nc2: 4 )
    convert load_pretrained function.
    if 'head.bias' in state_dict:
        head_bias_pretrained = state_dict['head.bias']
        Nc1 = head_bias_pretrained.shape[0]
        Nc2 = model.head.bias.shape[0]
        logger.info(f'{Nc1}, {Nc2}')
        if (Nc1 != Nc2):
            # head_weight = model.head.weight
            # head_bias = model.head.bias
            model.head.weight = torch.nn.Parameter(torch.zeros_like(model.head.weight))
            model.head.bias = torch.nn.Parameter(torch.zeros_like(model.head.bias))
            state_dict.pop('head.weight', None)
            state_dict.pop('head.bias', None)
  1. dataset/samplers.py
    convert iteration.
def __iter__(self):
        # deterministically shuffle based on epoch
        g = torch.Generator()
        g.manual_seed(self.epoch)

        t = torch.Generator()
        t.manual_seed(0)

        indices = torch.randperm(len(self.dataset), generator=t).tolist()
        indices = [i for i in indices if i % self.num_parts == self.rank]

        # add extra samples to make it evenly divisible
        while len(indices) < self.total_size_parts:
            indices += indices[:(self.total_size_parts - len(indices))]
        
        indices = indices[:self.total_size_parts]
        assert len(indices) == self.total_size_parts, f'Length of indices ({len(indices)}) does not match total_size_parts ({self.total_size_parts})'

        # subsample
        indices = indices[self.rank // self.num_parts:self.total_size_parts:self.num_replicas // self.num_parts]

        index = torch.randperm(len(indices), generator=g).tolist()
        indices = list(np.array(indices)[index])

        assert len(indices) == self.num_samples, f'Length of indices ({len(indices)}) does not match num_samples ({self.num_samples})'

        return iter(indices)
  1. cmd
    python -m torch.distributed.launch --nproc_per_node 2 --master_port 12345 main.py --cfg configs/without_lr_decay/internimage_b_1k_224_custom.yaml --data-path [data-path] --pretrained internimage_b_1k_224.pth --batch-size 120

my gpu is a100 * 2.

  • If you use a huge dataset, use the following command
    python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --cfg configs/without_lr_decay/internimage_b_1k_224_custom.yaml --batch-size 256 --accumulation-steps 4 --pretrained internimage_b_1k_224.pth --data-path [data-path] --local-rank 1 --output work_dirs

2024.06.11 train success (image classification fine-tuning)

image

@jeonga0303 jeonga0303 changed the title how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) [solved] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) Jun 11, 2024
@jeonga0303
Copy link
Author

[bug]

I don't think there's progress in training..
Everything's the same as loss
May I know the reason?

image

@jeonga0303 jeonga0303 reopened this Jun 11, 2024
@jeonga0303 jeonga0303 changed the title [solved] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) [bug] how to train fine-tuning classification model (size mismatch for head.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([4]).) Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant