Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Discretization Figure #6

Open
baolinv opened this issue Jul 12, 2023 · 6 comments
Open

Internal Discretization Figure #6

baolinv opened this issue Jul 12, 2023 · 6 comments

Comments

@baolinv
Copy link

baolinv commented Jul 12, 2023

Hi, thanks for your great work!
I doubt how is (d) Internal discretization in Figure 1 in the paper generated.

I infer that the id is the max value of (QiKi) from "(QiKi) is the spatial location for which each specific IDR is responsible ", as described at the end of the fourth page of the paper.

Could you provide me with the concrete computation process?

@lpiccinelli-eth
Copy link
Collaborator

Thank you for your appreciation!
The figure you are mentioning was produced in the following recipe: picking the attention map of the first iteration/attention layer (since the second layer is a residual update, it was less meaningful) of the ISD heads. Then, for each attention map selected (see below), upsampling them to the output resolution (1/4 of input image resolution) and equalizing them by thresholding the attention map at 0.5 and 0.99 and rescaling to [0,1] with, e.g., low_q, up_q = torch.quantile(attn_map, 0.5), torch.quantile(attn_map, 0.98); attn_map = torch.clamp(attn_map, low_q, up_q); attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min()) (equalization was done to prevent fog effect for some maps, that is why there are some gaps in the visualization maps).

The selected attention maps (I think, not sure, and the code is not immediate to get) were coming from IDRs number 0, 13, 30, 31 from the lowest resolution and IDRs number 2 from the highest resolution.

@baolinv
Copy link
Author

baolinv commented Aug 7, 2023

Thanks much for your detailed reply. But I still don't get similar semantic regions.

Can you help me check where the problem occurred for the following code?

Attention:
depth_attn

class ISDHead(nn.Module):
    def forward(self, feature_map: torch.Tensor, idrs: torch.Tensor, isshow_attn=True):
        b, c, h, w = feature_map.shape
        feature_map = rearrange(feature_map + self.pixel_pe(feature_map), "b c h w -> b (h w) c")
        depth_attn = None
        for i in range(self.depth):
            update = getattr(self, f"cross_attn_{i + 1}")(feature_map.clone(), idrs)
            feature_map = feature_map + update
            feature_map = feature_map + getattr(self, f"mlp_{i + 1}")(feature_map.clone())

            if i == 0:
                depth_attn = update

        out = getattr(self, "proj_output")(feature_map)
        out = rearrange(out, "b (h w) c -> b c h w", h=h, w=w)
        if isshow_attn:
            return out, depth_attn
        else:
            return out

Generate ID from Attention:
cls_map

              attn_map=torch.reshape(attn_map,[1,h,w,-1]).permute(0,3,1,2)
              attn_map = F.interpolate(attn_map, size=(120,160),mode="bilinear",align_corners=True)              
              low_q, up_q = torch.quantile(attn_map, 0.5), torch.quantile(attn_map, 0.98)
              attn_map = torch.clamp(attn_map, low_q, up_q)
              attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min())
              attn=attn_map.squeeze().cpu().numpy()
              
             cls_map = np.argmax(attn, axis=0).astype(np.uint8)

cls_map is IDRs? I suspect the issue is here. For test images, I don't get similar semantic regions like paper using the code.

I'm looking forward to your response.

@lpiccinelli-eth
Copy link
Collaborator

The first snippet works fine, and I guess you are returning depth_attn also from the ISD class, too, as a list of depth_attn for each resolution. The second part should be a bit different.
I believe that the pixel-wise argmax operation returns really noisy maps due to some attention collapsing onto each other as shown in the paper (i.e., some attention maps become almost identical to others).
What we actually did was select some representative, i.e., different within each other, attention maps from, e.g., 5 maps from the list attn, then threshold them at 0.5 (you can avoid this base don what you want to visualize), and then plot them on the image. The list of the indices of the attention maps you can see in the teaser figure should (not absolute certainty) be the one in my first comment.

@baolinv
Copy link
Author

baolinv commented Aug 9, 2023

Thanks much for your quick response. I'm sorry to disturb you again.

You say:
"What we actually did was select some representative, i.e., different within each other"
however, it is a little difficult to define uniform rules for "different within each other".

I try to use rules related to threshold, var, cluster, and so on. But the perfect image is not generated as shown in the paper.

If convenient, could you share your code by email? (my email: [email protected])

I really want to reproduce the result and I'm looking forward to your response.

@fanshixiong
Copy link

Hi. I have already configured the environment, but I don’t know how to use your code to find the depth of a picture. How can I get the depth map?

@suxuanya
Copy link

suxuanya commented May 14, 2024

I also doubt how is (d) Internal discretization in Figure 1 in the paper generated. Could you share the code? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants