Current progress & Near‐term plan

This version uses CNN based representation network + LSTM based dynamics network architecture for RGB input environment (LunarLander-v2, using rgb states wrapped by PixelObservationWrapper(self.env))
It works for randomly initialized env(random terrain and acceleration), but still not perfectly converge to more than 200 score.
Training takes more than 10 hours, longer experiment length than 5000, with Deep CNN ann env with non-fixed random seed.
It needs to be improved further.

Provide feedback