Skip to content
/ COPL Public

Visual Grounding for Object-Level Generalization in Reinforcement Learning (ECCV 2024)

License

Notifications You must be signed in to change notification settings

PKU-RL/COPL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Grounding for Object-Level Generalization in Reinforcement Learning

Code for paper "Visual Grounding for Object-Level Generalization in Reinforcement Learning" accepted by ECCV 2024 [PDF].

COPL

Overview of our proposed CLIP-guided Object-grounded Policy Learning (COPL). (left) Visual grounding: The instruction (e.g. "hunt a cow") is converted into a unified 2D confidence map of target object (e.g. cow) via our modified MineCLIP. (right) Transfering VLM knowledge into RL: The agent takes the confidence map as the task representation and is trained with our proposed focal reward derived from the confidence map to guide the agent toward the target object.

Installation

  • Create a conda environment with python 3.9 and install Python packages in requirements.txt.
  • Install jdk 1.8.0_171. Then install our modified MineDojo environment.
  • Download pre-trained models for MineCraft: run bash downloads.sh to download the MineCLIP model.

Training

To train single-task RL for hunting a sheep with focal reward:

Run ./scripts/sheep_focal.sh 0, where 0 is the random seed. --multi_task_config in the script specifies the task and it can be changed to other config files in src/config/env/single_task to train RL for other tasks, such as hunting a cow and hunting a pig.

To train COPL for hunting domain:

Run ./scripts/hunt_copl.sh 0, where 0 is the random seed. --multi_task_config in the script specifies the task domain and it can be changed to src/config/env/multi_tasks/harvest.json to train COPL for harvest domain.

Demos

Here we present some videos of agents performing hunting tasks trained using COPL, as well as confidence maps for different objects given by our modified MineCLIP. From left to right: raw video, confidence maps for cow, sheep, and pig, respectively.

hunt a cow cow_viz
hunt a sheep sheep_viz
hunt a pig pig_viz

Citation

If you find our work useful in your research and would like to cite our project, please use the following citation:

@inproceedings{jiang2024visual,
      title={Visual Grounding for Object-Level Generalization in Reinforcement Learning}, 
      author={Jiang, Haobin and Lu, Zongqing},
      booktitle={European Conference on Computer Vision (ECCV)},
      year={2024},
}

About

Visual Grounding for Object-Level Generalization in Reinforcement Learning (ECCV 2024)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published