The first graph shows three approaches:
- Baseline PPO (Blue)
- GNN with Random Policy Data (Orange)
- GNN with Partial Policy Data (Green)
Key observations:
- All three approaches eventually reach similar performance (~500 reward)
- The baseline PPO actually learns slightly faster initially
- Using GNN embeddings (both random and partial policy) doesn't provide significant advantages
- GNN with partial policy data shows slightly slower initial learning
Main issues:
- CartPole is too simple - it's solved quickly by basic PPO, leaving little room for improvement
- The GNN embeddings weren't trained with any meaningful objective
- The discretization scheme might be too crude for the continuous state space
The second graph shows:
- Baseline PPO (Blue)
- Value-Trained GNN + PPO (Orange)
Key observations:
- Initial learning is slower with the GNN approach
- Eventually both converge to similar performance
- More variance in the GNN approach's learning curve
- The final performance is about the same (-100 reward)
Main issues:
- Value prediction training might not be capturing the most useful information
- Only 10 epochs of GNN training might be insufficient
- The embeddings aren't being updated during RL training
The key is to:
- Move to a more challenging environment where graph structure could actually help
- Make the graph representation dynamic and updateable
- Use more sophisticated state abstraction methods
- Train the GNN with multiple relevant objectives
- Better incorporate edge features (actions and rewards)
- Consider exploration-specific metrics in the graph structure