You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
They use the common ppo loss updating. So the algorithm is actually maximise the rewards. But the loss make the experts to be closer to zeros, in this case, the logit (discriminator.forward()) will be 0 if it is an expert behaviour. The logit will be 1 if it is a fake behaviour. The reward is actually the log_logit, would be bigger if it is fake.
The text was updated successfully, but these errors were encountered:
They use the common ppo loss updating. So the algorithm is actually maximise the rewards. But the loss make the experts to be closer to zeros, in this case, the logit (discriminator.forward()) will be 0 if it is an expert behaviour. The logit will be 1 if it is a fake behaviour. The reward is actually the log_logit, would be bigger if it is fake.
The text was updated successfully, but these errors were encountered: