Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I fix the input size during testing? #238

Open
klkl2164 opened this issue May 20, 2024 · 3 comments
Open

How should I fix the input size during testing? #238

klkl2164 opened this issue May 20, 2024 · 3 comments

Comments

@klkl2164
Copy link

I have modified the backbone of Mask2Former to Vmamba, which requires the input size of my model to be fixed, for example, 640x640. This is not an issue during training because the train_dataloader outputs cropped images, and I just need to specify the specific crop parameters. However, I encountered a problem during testing. I am not sure how the test_dataloader operates exactly (I am not very familiar with the detectron2 framework and couldn't find the specific code location). During testing, the width and height of the images are not equal, with one of them being 640. My question is, which part of the code should I modify to ensure that the input images to the model are 640x640 during testing? I don't need any other data augmentation methods. I would greatly appreciate it if someone could provide an answer.

@zhengyuan-xie
Copy link

Same question. I resize the images in the forward function during the inference period, but it is not elegant :(

@klkl2164
Copy link
Author

klkl2164 commented May 22, 2024

Same question. I resize the images in the forward function during the inference period, but it is not elegant :(

I use HUST's ViM as the backbonehttps://github.com/hustvl/Vim/blob/main/vim/models_mamba.py, in which PatchEmbed specifies the input size. I followed the Swin Transformer and added a padding operation, so non-fixed inputs can be used. Fortunately, both ViM and Mask2Former's pixel decoder do not have many requirements for input size. You can try modifying PatchEmbed in this way.
'''
class PatchEmbedfromswintransformer(nn.Module):

def __init__(self, img_size=224, patch_size=16, stride=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
    super().__init__()
    img_size = to_2tuple(img_size)
    patch_size = to_2tuple(patch_size)
    self.img_size = img_size
    self.patch_size = patch_size
    self.grid_size = ((img_size[0] - patch_size[0]) // stride + 1, (img_size[1] - patch_size[1]) // stride + 1)
    self.num_patches = self.grid_size[0] * self.grid_size[1]
    self.flatten = flatten

    self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride)
    self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

def forward(self, x):
    """Forward function."""
    # padding
    _, _, H, W = x.size()
    if W % self.patch_size[1] != 0:
        x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))
    if H % self.patch_size[0] != 0:
        x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))

    x = self.proj(x)  # B C Wh Ww


    if self.flatten:
        x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
    x = self.norm(x)

    return x

'''

@zhengyuan-xie
Copy link

Same question. I resize the images in the forward function during the inference period, but it is not elegant :(

I use HUST's ViM as the backbonehttps://github.com/hustvl/Vim/blob/main/vim/models_mamba.py, in which PatchEmbed specifies the input size. I followed the Swin Transformer and added a padding operation, so non-fixed inputs can be used. Fortunately, both ViM and Mask2Former's pixel decoder do not have many requirements for input size. You can try modifying PatchEmbed in this way. ''' class PatchEmbedfromswintransformer(nn.Module):

def __init__(self, img_size=224, patch_size=16, stride=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True):
    super().__init__()
    img_size = to_2tuple(img_size)
    patch_size = to_2tuple(patch_size)
    self.img_size = img_size
    self.patch_size = patch_size
    self.grid_size = ((img_size[0] - patch_size[0]) // stride + 1, (img_size[1] - patch_size[1]) // stride + 1)
    self.num_patches = self.grid_size[0] * self.grid_size[1]
    self.flatten = flatten

    self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=stride)
    self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

def forward(self, x):
    """Forward function."""
    # padding
    _, _, H, W = x.size()
    if W % self.patch_size[1] != 0:
        x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))
    if H % self.patch_size[0] != 0:
        x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))

    x = self.proj(x)  # B C Wh Ww


    if self.flatten:
        x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
    x = self.norm(x)

    return x

'''

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants