This repository implements tabular Q-Learning for discrete room-based navigation, comparing convergence properties under two configurations as part of IOTA 5201 (RL for Intelligent Decision Making in CPS).
-
Config 1 (Baseline): Initial state B → Goal F (1-step optimal path)
- Optimal return: 50
- Convergence: ~50 episodes to 99.84% optimality
-
Config 2 (Modified): Initial state A → Goal C (3-step optimal path)
- Optimal return: 48
- Convergence: ~100-150 episodes to 99.58% optimality
- Python 3.7+
- numpy (numerical computation)
- matplotlib (visualization)
- pyyaml (configuration files)
Clone repository git clone https://github.com/hazelian0619/RLMidterm.git cd RLMidterm
Install dependencies pip install -r requirements.txt
text
Run both configurations python main.py
Output files:
- baseline_convergence.png (Config 1 convergence)
- modified_convergence.png (Config 2 convergence)
- comparison.png (Side-by-side comparison)
- q_values_config1.txt (Final Q-table)
- q_values_config2.txt (Final Q-table) text
| Directory | Files | Purpose |
|---|---|---|
core/ |
environment.py, qlearner.py |
Q-Learning algorithm & GridWorld |
config/ |
defaults.py |
Hyperparameter configuration |
experiments/ |
runner.py, configs/*.yaml |
Experiment execution & configs |
analysis/ |
metrics.py, plotter.py |
Result analysis & visualization |
utils/ |
logging.py |
Utility functions |
| Root | main.py, requirements.txt |
Entry point & dependencies |
- main.py: Run
python main.pyto execute both configurations - requirements.txt: Install dependencies with
pip install -r requirements.txt - experiments/configs/baseline.yaml: Configuration 1 (B→F, 1-step)
- experiments/configs/modified.yaml: Configuration 2 (A→C, 3-step)
Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') - Q(s,a)]
text
Where:
α = 0.9: Learning rateγ = 0.8: Discount factorε = 0.1: Exploration rate (ε-greedy)
| Metric | Config 1 | Config 2 | Ratio |
|---|---|---|---|
| Episodes to convergence | 50 | 100-150 | 2-3× |
| Mean return (final 100) | 49.92 | 47.80 | -2.12 |
| Optimality | 99.84% | 99.58% | ~equal |
| Path length | 1 step | 3 steps | 3× |
| Q-norm final | 170.34 | 320.90 | 1.88× |
Shortest path with direct connection Initial state: B Goal state: F Reward structure: R(B,F)=50, R(E,F)=50, all other transitions=-1
text
Longer path requiring goal reorientation Initial state: A Goal state: C Reward structure: R(D,C)=50, R(C,C)=50 Old goal: R(B,F)=R(E,F)=R(F,F)=-1 (unlearning) Optimal path: A → E → D → C
text
After running main.py, you will get:
-
baseline_convergence.png - Config 1 training curves
- Left: Episode returns over time
- Right: Q-table L1 norm convergence
-
modified_convergence.png - Config 2 training curves
- Shows slower convergence due to 3-step path
-
comparison.png - Direct convergence comparison
- Blue: Config 1 (rapid convergence)
- Orange: Config 2 (slower convergence)
-
Q-value files - Final learned Q-values for analysis
-
Convergence Scaling: Training time scales ~linearly with path complexity (2-3× for 3× path increase)
-
Optimality Robustness: Both configurations achieve >99% optimality despite different complexity levels
-
Exploration Efficiency: Config 2 has lower per-episode deficits (0.0013 vs 0.0016) despite longer paths
-
Hyperparameter Transferability: Choices (α=0.9, γ=0.8, ε=0.1) work well for both tasks
See report.pdf for complete analysis including:
- Theoretical background and significance
- Environment and reward structure details
- Initial/goal state change impact analysis
- Convergence behavior analysis
- Theoretical validation and CPS applications
Lian Wanchen
Information & Artificial Intelligence Track
The Hong Kong University of Science and Technology (Guangzhou)
Student ID: 50018021
- Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction.