-
Notifications
You must be signed in to change notification settings - Fork 106
Environments
The map contains apples and lemons. The first player (red) is very sensitive and scores 10 for the team for an apple (green square) and −10 for a lemon (yellow square). The second (blue), less sensitive player scores 1 for the team for an apple and −1 for a lemon. There is a wall of lemons between the players and the apples. Apples and lemons disappear when collected, and the environment resets when all apples are eaten. It is important that the sensitive agent eats the apples while the less sensitive agent should leave them to its teammate but clear the way by eating obstructing lemons.
- Reference Paper : Value-Decomposition Networks For Cooperative Multi-Agent Learning ( Section 4.2)
- Action Space:
0: Down, 1: Left, 2: Up , 3: Right, 4: Noop
- Agent Observation :
Agent Coordinate + 3x3 mask around the agent + Steps in env.
- Best Score:
NA
Versions
:
Name | Description |
---|---|
Checkers-v0 | Each agent receives only it's local observation |
Checkers-v1 | Each agent receives local observation all other agents |
Checkers-v3 | Each agent receives only it's local observation and normalized step count
|
Checkers-v4 | Each agent receives local observation all other agents and normalized step count
|
It's a grid world environment having n agents
where each agent wants to move their corresponding home location ( marked in boxes outlined in same colors).
The challenging part of the game is to pass through the narrow corridor through which only one agent can pass at a time. They need to coordinate to not block the pathway for the other. A reward of +5 is given to each agent for reaching their home cell.
The episode ends when both agents has reached their home state or for a maximum of 100 steps in environment.
- Action Space:
0: Down, 1: Left, 2: Up , 3: Right, 4: Noop
- Agent Observation :
Agent Coordinate + Steps in env.
- Best Score:
NA
- Reference Paper : Value-Decomposition Networks For Cooperative Multi-Agent Learning ( Section 4.2)
Versions:
Name | Description |
---|---|
Switch2-v0 | Each agent receives only it's local position coordinates |
Switch2-v1 | Each agent receives position coordinates of all other agents |
Switch2-v3 | Each agent receives only it's local observation and normalized step count
|
Switch2-v4 | Each agent receives local observation all other agents and normalized step count
|
Similar variants are available for Switch4
Switch2-v0 | Switch4-v0 |
---|---|
Predator-prey involves a grid world, in which multiple predators attempt to capture randomly slow moving prey.We define the “catching” of a prey as when the prey is within the cardinal direction of at-least one predator.
Unlike the general predator-prey, a positive reward is given only if multiple preys catch a prey simultaneously, requiring a higher degree of cooperation. The predators get a reward of 1 if two or more catch a prey at the same time, but they are given negative reward -P if only one predator catches the prey.We kept the value of P=0.5 (this could be modified by registering new variants of environment) The terminating condition of this task is when all preys are dead or for a max. of 100 steps in environment. A prey is considered to be dead if it's caught by more than on predator.
- Agent Observation :
Agent Coordinates + Agent ID + coordinates of the preys relative to itself in 5x5 view, if observed
- Agent Action Space:
0: Down, 1: Left, 2: Up , 3: Right, 4: Noop
- Best Score:
NA
Versions:
We test with two different grid worlds:
- PredatorPrey5x5: a 5 × 5 grid world with two predators and one prey
- PredatorPrey7x7: a 7 × 7 grid world with four predators and two prey
Name | Description |
---|---|
PredatorPrey5x5-v0 |
Each agent get's it's own local observation |
PredatorPrey5x5-v1 |
Each agent get's local observation of every other agent |
PredatorPrey5x5-v2 |
Each agent get's it's own local observation and the prey doesn't move after being randomly initialized |
PredatorPrey5x5-v3 |
Each agent get's local observation of every other agent and the prey doesn't move after being randomly initialized |
Similar variants are available for PredatorPrey7x7
PredatorPrey5x5-v0 | PredatorPrey7x7-v0 |
---|---|
We simulate a simple battle involving two opposing teams in a 15×15 grid as shown in Fig. 2(middle). Each team consists of m = 5 agents and their initial positions are sampled uniformly in a 5 × 5 square around the team center, which is picked uniformly in the grid. At each time step, an agent can perform one of the following actions: move one cell in one of four directions; attack another agent by specifying its ID j (there are m attack actions, each corresponding to one enemy agent); or do nothing. If agent A attacks agent B, then B’s health point will be reduced by 1, but only if B is inside the firing range of A (its surrounding 3 × 3 area). Agents need one time step of cooling down after an attack, during which they cannot attack. All agents start with 3 health points, and die when their health reaches 0. A team will win if all agents in the other team die. The simulation ends when one team wins, or neither of teams win within 40 time steps (a draw).
The model controls one team during training, and the other team consist of bots that follow a hardcoded policy. The bot policy is to attack the nearest enemy agent if it is within its firing range. If not, it approaches the nearest visible enemy agent within visual range. An agent is visible to all bots if it is inside the visual range of any individual bot. This shared vision gives an advantage to the bot team.
When input to a model, each agent is represented by a set of one-hot binary vectors {i, t, l, h, c} encoding its unique ID, team ID, location, health points and cooldown. A model controlling an agent also sees other agents in its visual range (3 × 3 surrounding area). The model gets reward of -1 if the team loses or draws at the end of the game. In addition, it also get reward of −0.1 times the total health points of the enemy team, which encourages it to attack enemy bots.
Reference : Learning Multiagent Communication with Backpropagation
Two Player Pong game.
- Action Space:
0: Noop, 1: Up, 2: Down
- Agent Observation :
Agent Coordinate + Ball location ( head and tail)
- Best Score:
NA
Lumberjack is a grid world environment where agents are lumberjacks whose goal is to cut down all trees in the world. Each tree has assigned a strength which symbolizes how many lumberjacks are necessary to cut down the tree. The tree is cut down automatically as soon as the number of the agent (standing in a given cell with the tree) is same or higher to the tree strength. The environment ends as soon as all trees are cut down or the limit of the maximum step is reached.
- Action Space:
0: Noop, 1: Down, 2: Left, 3: Up, 4: Right
- Agent Observation:
Agent ID + Agent Coordinate + Steps in env + Agent Surroundings
- Best Score:
NA
- Reference Paper: A Game-Theoretic Model and Best-Response Learning Method for Ad Hoc Coordination in Multiagent Systems
versions
Name | Description |
---|---|
Lumberjacks-v0 | Each agent receives only its local observation |
Lumberjacks-v1 | Each agent receives local observation all other agents |
This consists of a 4-way junction on a 14 × 14 grid. At each time step, "new" cars enter the grid with probability p_arrive
from each of the four directions. However, the total number of cars at any given time is limited to Nmax
.
Each car occupies a single cell at any given time and is randomly assigned to one of three possible routes (keeping to the right-hand side of the road). At every time step, a car has two possible actions: gas which advances it by one cell on its route or brake to stay at its current location. A car will be removed once it reaches its destination at the edge of the grid.
Two cars collide if their locations overlap. A collision incurs a reward rcoll = −10
, but does not affect the simulation in any other way. To discourage a traffic jam, each car gets reward of τ * r_time = −0.01τ
at every time step, where τ
is the number time steps passed since the car arrived. Therefore, the total reward at time t is
where C^t is the number of collisions occurring at time t and N^t is number of cars present. The simulation is terminated after 'max_steps(default:40)' steps and is classified as a failure if one or more collisions have occurred.
Each car is represented by one-hot binary vector set {n, l, r}, that encodes its unique ID, current location and assigned route number respectively. Each agent controlling a car can only observe other cars in its vision range (a surrounding 3 × 3 neighborhood), though low level communication is allowed in "v1" version of the game.
The state vector s_j for each agent is thus a concatenation of all these vectors, having dimension (3^2) × (|n| + |l| + |r|).
Reference: Learning Multi-agent Communication with Backpropagation Url: https://papers.nips.cc/paper/6398-learning-multiagent-communication-with-backpropagation.pdf
Versions:
Name | Description |
---|---|
TrafficJunction4-v0 | Each agent get's it's own local observation |
TrafficJunction4-v1 | Each agent get's local observation of every other agent |
Same variants are available for 10 agents as TrafficJunction10-v{0, 1}
TrafficJunction4-v0 | TrafficJunction10-v0 |
---|---|
The task tests if two agents can synchronize their behavior, when picking up objects and returning them to a drop point. In the Fetch task, both players start on the same side of the map and have pickup points on the opposite side. A player scores 3 points for the team for pick-up, and another 5 points for dropping off the item at the drop point near the starting position. Then the pickup is available to either player again. It is optimal for the agents to cycle such that when one player reaches the pickup point the other returns to base, to be ready to pick up again.
- Best Score:
NA
- Status: Not developed at the moment, If interested, please start by raising an issue and we can take it forward from there
Contributions are Welcome!