You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for this excellent job, I still have some questions about rl_loss, rl_loss = neg_reward * sample_out.loss, the neg_reward is obtained by greedy_rouge - sample_rouge, and the sample_out.loss means the cross-entropy loss, it is equal to -LogP(). However, in the paper, self-critical policy gradient training algorithm uses LogP(), this confused me, could you please explain this?
Update
I have read SeqGAN code from SeqGAN, according to the policy gradient, the loss is computed as loss += -out[j][target.data[i][j]]*reward[j], out means Log_softmax, so the author adds "-" to using gradient descent later.
The text was updated successfully, but these errors were encountered:
Thank you for this excellent job, I still have some questions about rl_loss,
rl_loss = neg_reward * sample_out.loss
, theneg_reward
is obtained bygreedy_rouge - sample_rouge
, and thesample_out.loss
means the cross-entropy loss, it is equal to-LogP()
. However, in the paper, self-critical policy gradient training algorithm usesLogP()
, this confused me, could you please explain this?Update
I have read SeqGAN code from SeqGAN, according to the policy gradient, the loss is computed as
loss += -out[j][target.data[i][j]]*reward[j]
, out means Log_softmax, so the author adds "-" to using gradient descent later.The text was updated successfully, but these errors were encountered: