Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpretation of Feature Layers #67

Closed
mfischer-ucl opened this issue Jun 24, 2024 · 4 comments
Closed

Interpretation of Feature Layers #67

mfischer-ucl opened this issue Jun 24, 2024 · 4 comments

Comments

@mfischer-ucl
Copy link

mfischer-ucl commented Jun 24, 2024

Hi, congrats on the great work on Radio and the CPVR '24 paper, and thanks for open-sourcing the code!

For DiNO, people have looked into this and have found that the earlier layers encode positional information and the deeper layers encode semantic information (eg this excellent work).

I was wondering whether you have looked into the (semantic) interpretation of the intermediate features learned by Radio, and potentially noticed similar findings?

Thanks again!

@mranzinger
Copy link
Collaborator

Hi. Thanks for the paper link. We haven't explicitly followed the analysis in that paper, so thanks for the link. We do know that some difference in information content exists in the earlier layers though, because UPerNet on top of a frozen RADIO improves semantic segmentation, and we even have some results that suggest concatenating the features from different depths yields better VLLM metrics.

@mfischer-ucl
Copy link
Author

Thanks, this is very helpful. I can confirm that indeed concatenating the features from different network depths seems to achieve better results (I used blocks 8, 16, 23 and 31).

A quick question on feature dimension: I noticed that the intermediate layer's shape in DiNO is [4097, fdim], where fdim is the DiNO feature dimension (786) and 4097 is 64x64 + the cls token.

In RADIO, I get (for a 1024p input image) intermediate features of shape [4112, fdim], which is 4096 + 16, so I was assuming that the 16 first dimensions are the cls tokens, too. Can you confirm this is, and if so, that they are appended at the beginning, i.e., [16:, :] are the features?

@mranzinger
Copy link
Collaborator

Yes, you're correct. The first 16 tokens are cls and register tokens. So indeed, [16:,:] would do what you want (assuming you don't have a batch dimension).

I can confirm that indeed concatenating the features from different network depths seems to achieve better results (I used blocks 8, 16, 23 and 31).

Are you at liberty to share anything about your use case?

@mfischer-ucl
Copy link
Author

Thanks for the quick answer, that's reassuring and good to know.

We're using the features as input for a downstream attention block that learns dense segmentation. I can't disclose more at the moment, but will update here once there is some work published :) Thanks again ! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants