Current progress & Near‐term plan

This version uses CNN based representation network + LSTM based dynamics network architecture for RGB input environment (LunarLander-v2, using rgb states wrapped by PixelObservationWrapper(self.env))
It works for randomly initialized env(random terrain and acceleration), but still not perfectly converge to more than 200 score.
Training(non-fixed env seed) takes several hours, longer experiment length than 5000, with Deep CNN.
Training(fixed env seed) takes up to 1 hour, experiment length 200~2000, with Deep CNN. Agent can learn perfectly in this case.
It needs to be improved further.

Provide feedback