Regular updates on deep learning, reinforcement learning, and their applications to combinatorial optimization problems.
This repository provides insights into various reinforcement learning algorithms and their implementations, accompanied by clean and well-structured code.
-
Deep Q Network (DQN)
Implementation of the foundational DQN algorithm, which uses Q-learning with deep neural networks for decision making. -
Double DQN (DDQN)
A more stable version of DQN that reduces overestimation bias in the Q-value. -
Dueling DQN
An enhancement to DQN that separates value and advantage functions, improving the network's performance. -
REINFORCE
A Monte Carlo policy gradient method for directly optimizing policy performance. -
REINFORCE with Baseline
A variation of REINFORCE that reduces variance by subtracting a learned baseline value. -
Actor-Critic
Combines the benefits of value-based and policy-based methods by learning both the policy and value functions. -
Advantage Actor-Critic (A2C)
A synchronous version of Actor-Critic that uses the advantage function to optimize policy. -
Proximal Policy Optimization (PPO)
A state-of-the-art policy gradient method that ensures stable learning by limiting policy updates.
- Graph Convolutional Network (GCN) Reference: Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks[J]. arXiv preprint arXiv:1609.02907, 2016.
-
LSTM and Pointer Network, A2C, Greedy & Sampling for TSP
Reference: Bello, I., Pham, H., Le, Q. V., et al. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016. -
Embedding and Pointer Network, REINFORCE with Rollout Baseline, Greedy & Sampling for VRP
Reference: Nazari M, Oroojlooy A, Snyder L, et al. Reinforcement learning for solving the vehicle routing problem[C]. Advances in Neural Information Processing Systems, 2018, 31. -
Multi-Head Self-Attention, REINFORCE, Active Search for TSP
Reference: Bello, I., Pham, H., Le, Q. V., et al. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016. Active search starts from scratch using DRL to solve individual problems by continuously learning and adjusting the model during inference. -
Multi-Head Self-Attention, REINFORCE, Greedy & Sampling for TSP
Reference: Kool, W., Van Hoof, H., Welling, M. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475, 2018. -
Multi-Head Self-Attention, REINFORCE with Rollout Baseline, Greedy & Sampling for TSP
Reference: Kool, W., Van Hoof, H., Welling, M. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475, 2018. The rollout baseline is a more effective strategy compared to the moving average baseline. -
Multi-Head Self-Attention, REINFORCE with Rollout Baseline, Greedy & Sampling for VRP
Reference: Kool, W., Van Hoof, H., Welling, M. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475, 2018.
Stay tuned for more updates!