-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions Related to Multiagent Evaluation #186
Comments
I would like to provide some corrections and additional details related to my previous post about the multiagent experiment. To state the conclusion first, it appears that the problem I encountered with the multiagent setup also occurs in the single agent scenario. An important detail I initially omitted is that I set |
Hi @paehal w.r.t. 1. I don't think it's the issue but you might double check following SB3 instructions to visualize the default models w.r.t. 2., I am a bit confused, if the evaluated policy is scoring the same as you saw in the training, that tends to rule out 1.. you also say that you are obtaining twice the reward of the single agent case that leads you to believe the learning should be successful but then that the same problem (what problem?) "also occurs in the single agent scenario" in general, do not assume that a high reward certainly means the system is behaving how you desire, RL is known to "game" the simulation: are you sure that the high reward you see can be achieved IIF the drone move as/where you want? observation length and control frequency CAN affect learning and control performance but they should not be "breaking" anything (but note that the number of steps/freq is proportional to the times you collect reward so it can change its value per episode) |
Apologies for the delayed response, and thank you for your answer. I have figured out the cause of the issue. I was setting the target location for the drone movement in the I have two additional questions related to this multiagent simulation. If you know, could you please enlighten me?
I appreciate your help and look forward to your response. |
w.r.t. to 1, I think that the easiest way is to modify the desired DRL agent in SB3 to have a collections of actor and policy networks (a pair for each agent) and simply slice the observation when training, recombine the action when predicting/testing (effectively, you have N independent RL problems and agents but note that the environment of each is no longer stationary). w.r.t. 2, you should be able to simply force the GUI for the training environment (you can do it by changing the defaults in the constructors, for example) but it would lead to incredibly slow training, I am not sure it will work too well, especially with multiple agents. |
Thank you for your response. Are you suggesting that we should set up multiple models and train them multiple times as you've described below? As I asked earlier, my understanding is that in the case of a simulation with multiagents, the same policy model is used for all agents. Therefore, if we set up different models for each agent, does it mean we need to train each of them separately? In any case, it seems like this would be a fairly complex modification, wouldn't it?
|
I would do the modification inside PPO, to create multiple independent networks operating on different parts of the obs and act vector of the environment, but yes, it requires to understand the SB3 implementation in a certain degree of depth. |
Thanks, I'll ask the experts at stablebaseline3 github. |
I apologize for any confusion on my part, but I would like to clarify one thing. The |
I am currently testing a task involving two agents, each moving to a specified target location. To facilitate this, I've configured
self.TARGET_POS
in theinit
ofMultiHoverAviary.py
to set different target locations for each episode as follows,Consequently, I have modified
obs_12
inBaseRLAviary.py
's_computeObs
function to include a length=3 vector related to the target position, renaming it toobs_15
.Initially, I conducted training and inference with a single agent and confirmed successful learning towards the desired values. And, when training with multiagent (N=2) for the same task, the reward obtained during learning was approximately twice that of a single agent, suggesting nearly ideal learning.
The reward function is set as follows:
The issue arises in eval mode, where the system doesn't perform well. Specifically, the agents fail to approach the designated targets. Notably, agent 1 always performs better than agent 2.
After several debugging attempts, I suspect a few causes and would appreciate any insights:
evaluate_policy
, the mean reward is about 3600, which is comparable to the trained values. Hence, I wonder if the problem lies not inevaluate_policy
but in the following predict function:Any advice on these issues would be greatly appreciated. Thank you for your time and assistance.
The text was updated successfully, but these errors were encountered: