This repository is for the project:"Action Transcript Prediction from Multi-Modal Environments using Vision-Language Models." from University of Bonn, Computer Science Institute VI, Center for Robotics.

Dataset and installation

essential packages for the code, please check requirements.txt

Download Alfred dataset, please check: https://github.com/askforalfred/alfred

If you want to try a lighter backbone, i.e., MobileCLIP, please install from their official repo: https://github.com/apple/ml-mobileclip

Repo structure:

pretraining: All the pretraining and preprocessing code

model: end-to-end model for generating action sequence

dataset: Dataset for end-to-end training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset and installation

Repo structure:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset and installation

Repo structure: