Pytorch Implementation Of Deep-Q Reinforcement Learning Algorithm to play the Attari Breakout game.
Demistifying Deep Reinforcement Learning
- OpenAI Gym with attari environment.[Installation for Linux] [For windows]
- OpenCV
- Pytorch
- TensorboardX
- To train DQN: python main.py --train_dqn
- To test DQN : python main.py --test_dqn --resume (Path to model weights)
- Saved Model Weights
Reference Paper: Human Level Control Through Deep Reinforcement Learning.[Link]
We store states, action, reward in memory to be used for experience replay. We sample random minibatches from this buffer to train our model. It helps in decorrelating the input. The number of frames that can be stored in this buffer depends on the size of your RAM / GPU Memory. In my implementation, I used a cyclic replay buffer of size 0.4M frames. [1] For the first 50000 steps of training, we do not train our model, we only use this to fill our replay buffer to a initial capacity.
An important thing to take into consideration is the amount of memory 0.4M frames will consume in your RAM. If we store scaled versions(dtype: np.float32) of frames in buffer, each frame costs us 0.12Mb. If you use 0.4M frames, the total memory you require would be around 45Gb which we obviously dont want. To make efficient use of memory, do not scale the frames, simply store each frame in np.uint8 format in buffer and whenever required just convert to float32. In np.uint8 format, the total memory required would be around 10~11Gb for 0.4M frames.
attari_wrapper_openai.py modifies the original attari environment to add functionalities which was implemented in DeepMind's Paper. It also applies a pre-processing function to convert the original 210x160x3 frame to 84x84 grayscale frame and stacks up 4 recent frames to get the input of shape 4x84x84 to be forwared to the CNN model. Make sure you do not set the Scale parameter as True to avoid memory issues.
We start with an initial epsilon value of 1.0 for 50000 steps, for the next 1M steps, the value of epsilon is linearly decreased to a constant final value of 0.01 untill termination of training.
From 50000 step onwards, we start to optimize our model, For every 4 steps into the epsiode, we sample a random batch of frames, compute losses using the policy network and target network and update. The code is implemented in optimize_model() method. I run the training for a total of around 5M steps with 50k episodes.
Fig1: Reward/Episode
Fig2: Average Reward/100Episodes
Fig3: LengthOfEpisode/Episode
Fig4: AverageLengthOfEpisodes/100Episode
Fig5: Loss/Episode
Fig6: AverageLoss/100Episode
You could run this code on google colab and it will take around 4-5 hours of training to achieve the above results. Make sure you dont change the seed to reproduce it.