CrafText is a goal-conditioned extension of the Craftax environment, specifically developed as a benchmark for multimodal reinforcement learning. It challenges agents to follow natural language instructions grounded in rich, visual environments inspired by Minecraft. Agents must combine visual perception** and language understanding to execute complex action sequences in dynamic, open-ended worlds.
🔗 Environment Repository: https://github.com/AIRI-Institute/CrafText
- 🟡 PPO-T: A baseline based on the PPO algorithm where the agent encodes instructions using BERT embeddings and processes them jointly with visual observations.
- 🟢 PPO-T+: An enhanced version of PPO-T that leverages high-level plans to improve instruction following.
- 🔵 FiLM: A model using feature-wise modulation (FiLM) to dynamically condition policy behavior on language instructions.
- 🟣 LLM Baseline: Zero-shot evaluation using GPT-4, LLaMA, Qwen, and other large language models via API access and models from Hugging Face.
Below are the available baselines with example commands to launch training runs.
python3 baselines/ppo_conv_rnn.py \
--craftext_settings easy_train \
--env_name "Craftax-Classic-Pixels-v1" \
--num_envs 1024 \
--total_timesteps 6000000000 \
python3 baselines/ppo_conv_rnn.py \
--craftext_settings easy_train \
--env_name "Craftax-Classic-Pixels-v1" \
--num_envs 1024 \
--total_timesteps 6000000000 \
--use_plans True \
python3 baselines/ppo_filmed.py \
--craftext_settings easy_train \
--env_name "Craftax-Classic-Pixels-v1" \
--num_envs 1024 \
--total_timesteps 6000000000 \
python llm_baselines/run.py \
--model_source huggingface \
--model_name Qwen/Qwen2.5-0.5B \
--output_file qwen_results.txt