Interpretation of Feature Layers #67

mfischer-ucl · 2024-06-24T14:16:38Z

Hi, congrats on the great work on Radio and the CPVR '24 paper, and thanks for open-sourcing the code!

For DiNO, people have looked into this and have found that the earlier layers encode positional information and the deeper layers encode semantic information (eg this excellent work).

I was wondering whether you have looked into the (semantic) interpretation of the intermediate features learned by Radio, and potentially noticed similar findings?

Thanks again!

mranzinger · 2024-06-27T12:07:56Z

Hi. Thanks for the paper link. We haven't explicitly followed the analysis in that paper, so thanks for the link. We do know that some difference in information content exists in the earlier layers though, because UPerNet on top of a frozen RADIO improves semantic segmentation, and we even have some results that suggest concatenating the features from different depths yields better VLLM metrics.

mfischer-ucl · 2024-07-01T09:14:44Z

Thanks, this is very helpful. I can confirm that indeed concatenating the features from different network depths seems to achieve better results (I used blocks 8, 16, 23 and 31).

A quick question on feature dimension: I noticed that the intermediate layer's shape in DiNO is [4097, fdim], where fdim is the DiNO feature dimension (786) and 4097 is 64x64 + the cls token.

In RADIO, I get (for a 1024p input image) intermediate features of shape [4112, fdim], which is 4096 + 16, so I was assuming that the 16 first dimensions are the cls tokens, too. Can you confirm this is, and if so, that they are appended at the beginning, i.e., [16:, :] are the features?

mranzinger · 2024-07-01T14:58:16Z

Yes, you're correct. The first 16 tokens are cls and register tokens. So indeed, [16:,:] would do what you want (assuming you don't have a batch dimension).

I can confirm that indeed concatenating the features from different network depths seems to achieve better results (I used blocks 8, 16, 23 and 31).

Are you at liberty to share anything about your use case?

mfischer-ucl · 2024-07-01T18:30:28Z

Thanks for the quick answer, that's reassuring and good to know.

We're using the features as input for a downstream attention block that learns dense segmentation. I can't disclose more at the moment, but will update here once there is some work published :) Thanks again ! 👍

mfischer-ucl closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpretation of Feature Layers #67

Interpretation of Feature Layers #67

mfischer-ucl commented Jun 24, 2024 •

edited

Loading

mranzinger commented Jun 27, 2024

mfischer-ucl commented Jul 1, 2024

mranzinger commented Jul 1, 2024

mfischer-ucl commented Jul 1, 2024

Interpretation of Feature Layers #67

Interpretation of Feature Layers #67

Comments

mfischer-ucl commented Jun 24, 2024 • edited Loading

mranzinger commented Jun 27, 2024

mfischer-ucl commented Jul 1, 2024

mranzinger commented Jul 1, 2024

mfischer-ucl commented Jul 1, 2024

mfischer-ucl commented Jun 24, 2024 •

edited

Loading