Loss fn might be wrong #1

KangOxford · 2024-09-02T14:24:24Z

They use the common ppo loss updating. So the algorithm is actually maximise the rewards. But the loss make the experts to be closer to zeros, in this case, the logit (discriminator.forward()) will be 0 if it is an expert behaviour. The logit will be 1 if it is a fake behaviour. The reward is actually the log_logit, would be bigger if it is fake.

KangOxford · 2024-09-02T14:25:23Z

the generated is compared with ones
and
the expert is compared with zeros

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss fn might be wrong #1

Loss fn might be wrong #1

KangOxford commented Sep 2, 2024

KangOxford commented Sep 2, 2024

Loss fn might be wrong #1

Loss fn might be wrong #1

Comments

KangOxford commented Sep 2, 2024

KangOxford commented Sep 2, 2024