Unstable adaptive temperature?

Hi, thanks for the innovative work! I'm trying to reproduce Transover++ with Transover, and I found that the adaptive temperature seems unstable. So I'm wondering if there's any other trick other than a linear layer described in the paper that needs to be applied?

In particular, Eq(3) may involve the following code adding to PhysicsAttention:
In __init__():
```
self.temperature0 = nn.Parameter(torch.ones([1, heads, 1, 1]) * 0.5)
self.to_delta_temp = nn.Linear(head_dim, 1) 
nn.init.zeros_(self.to_delta_tau.weight)
nn.init.zeros_(self.to_delta_tau.bias)
```
In forward():
```
temperature = self.temperature0 + self.to_delta_temp(x_mid)
slice_weights = self.softmax(self.in_project_slice(x_mid) / temperature)
```
With above code, gradients may explode and It seems hard to prevent zeros in the temperature variable above. Is there any trick on this? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unstable adaptive temperature? #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unstable adaptive temperature? #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions