Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to work with images in different sizes... #4

Closed
utopic-dev opened this issue Nov 14, 2024 · 4 comments
Closed

How to work with images in different sizes... #4

utopic-dev opened this issue Nov 14, 2024 · 4 comments

Comments

@utopic-dev
Copy link

Guys, first of all I want to thank you for the incredible work, very elegant and admirable, I have a specific case here where the images I have for training are 150x40px and when increasing them to 512x, they are becoming very distorted and the model seems to not be able to interpret much, other trainings with larger images were a success, I would humbly like to know if it is possible to adjust the scripts to be able to train with smaller images, when changing options.py parser.add_argument img-size and patch size and also in HTR_VT, it triggers a series of errors that so far I have not been able to solve.

Can anyone help me with how to work with smaller images, or is the standard size of 512px mandatory? I thank you in advance for your attention and help.

@YutingLi0606
Copy link
Owner

Hi, thank you for your interest! I’m more than happy to answer your question. 😊

To begin with, HTR-VT utilizes a slightly modified ResNet-18 instead of the standard Patch Embedding in ViT. The downsampling size depends on this ResNet configuration. If you wish to use a different image size, you will need to adjust the ResNet accordingly.

In options.py, make sure to align the patch size with the new downsampling size.

For example, we use an input size of 512x64 with a downsampling size of 4x64 in ResNet (corresponding to the patch size in options.py). This results in a 128x1 feature input to the transformer encoder. Based on our experiments, maintaining this feature shape as nx1 is optimal for this task.

If you plan to use a size like 150x40, here are two suggestions:

  1. Consider resizing the input to 128x32 or using a higher resolution like 256x64.
  2. Modify the ResNet (adjust the stride, add layers, or remove layers) to ensure the downsampling size remains appropriate.
    For instance, with an input size of 128x32, after adjustments, you can obtain a feature size of 32x1. In this case, the patch size would be 4x32. Modify the ResNet configuration as follows:

self.layer2 = self._make_layer(BasicBlock, nb_feat // 2, 2, stride=2)
Change to:
self.layer2 = self._make_layer(BasicBlock, nb_feat // 2, 2, stride=(2, 1))

Note: The hyperparameters provided above are examples. You should fine-tune them based on your experimental results.

I hope this helps!

Best regards,
Yuting

@utopic-dev
Copy link
Author

utopic-dev commented Dec 4, 2024

Hello Yuting, First of all, sorry me for the delay in responding to you. Thank you very much for your incredible explanation and help. This only reinforces how incredible you and your work are. Once again, I thank you for your attention. I will look for you and add you on social media. I want to show you some projects. Incredible work. Congratulations.

You help a lot!

Thanks again!

Best regards! 🙏

@utopic-dev
Copy link
Author

Is there somewhere I can connect with you? I really appreciate it

@YutingLi0606
Copy link
Owner

Hi~ My email is [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants