Warning
This is currently a pre-production Forumla that has not been thoroughly tested and which installs a currently non-official version of Vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research. This package contains a web app which is used for running and organzing evaluations as well as a command line interface to aid in the development of tasks. More information can be found on the website.
For prototyping purposes, Gatlen has created his own fork of Vivaria which this formulae installs. See the original repo here
Homebrew ("brew") is a macOS (and Linux) package manager. New contributers to this Homebrew formulae (especially those new to Homebrew formula development) should see CONTRIBUTING.md.
- homebrew-vivaria
00. Install Requirements (Docker)
Make sure to have docker compose
version > 2.0. You can check this by running:
docker compose version
Tip
If you don't have docker compose
, you can install docker desktop with:
brew install --cask docker
01. Tap this repository
brew tap GatlenCulp/vivaria
02. Install Vivaria
brew install vivaria
03. Run the post-installation setup (This will ask you for a valid OpenAI API Key)
Be cautious running this command multiple times as it will overwrite your current configuration and will require you to follow all the instructions from here onward
viv setup
Example Output
Please enter your OpenAI API key: sk-Hk[REDACTED] Using output directory: /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria Creating new file /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.server Successfully wrote to /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.server Creating new file /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.db Successfully wrote to /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.db Creating new file /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env Successfully wrote to /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env Created /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/docker-compose.override.yml Updated /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/docker-compose.dev.yml: Changed 'user: node:docker' to 'user: node:0' viv CLI configuration completed successfully. Vivaria setup completed successfully. To finish installation, run: viv docker compose up --detach --wait Building the docker image may take upwards of an hour.
04. Open docker
Open Docker Desktop automatically with:
open -a Docker
05. Build and run the server images (This may take a while)
viv docker compose up --detach --wait
06. Check that the containers are running
viv docker compose ps
Example Output
🪴 Handing over execution to docker. Running command: 🪴 docker compose ps (at /opt/homebrew/Cellar/vivaria/0.1.5/vivaria) NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS vivaria-background-process-runner-1 vivaria-background-process-runner "docker-entrypoint.s…" background-process-runner 8 hours ago Up 58 seconds 4001/tcp vivaria-database-1 vivaria-database "docker-entrypoint.s…" database 8 hours ago Up About a minute (healthy) 0.0.0.0:5432->5432/tcp vivaria-proxy-1 quay.io/panubo/sshd "/entry.sh /usr/sbin…" proxy 8 hours ago Up About a minute 0.0.0.0:2222->22/tcp vivaria-server-1 vivaria-server "docker-entrypoint.s…" server 8 hours ago Up 58 seconds (healthy) 0.0.0.0:4001->4001/tcp vivaria-ui-1 vivaria-ui "docker-entrypoint.s…" ui 8 hours ago Up About a minute (healthy) 0.0.0.0:4000->4000/tcp
07. Check that the task server is running
curl http://localhost:4001/health
{"result":{"data":"ok"}}
08. Open the Vivaria web GUI
You can access the web GUI at https://localhost:4000/ and continue past the "Your connecton is not private" warning.
09. The website will prompt you for your ACCESS_TOKEN
and ID_TOKEN
from .env.server
.
You can get these by running
cat "$(brew --prefix vivaria)/vivaria/.env.server" | grep -E "ACCESS_TOKEN=|ID_TOKEN="
ACCESS_TOKEN=[REDACTED] ID_TOKEN=[REDACTED]
You should then have access to the web interface. The landing page should look like this:
10. Start your first task environment
Load up the example task "abandon" from the "reverse_hash" task family (this may take a while):
viv task start reverse_hash/abandon \
--task-family-path "$(brew --prefix vivaria)/vivaria/task-standard/examples/reverse_hash"
11. Check that the task is running
viv task list
task-environment--reverse_hash--abandon--4222326789--8628449513
12. Enter the task environment container
viv task enter --user agent
Example Output
No task environment specified. Using the task environment from the previous command:
task-environment--reverse_hash--abandon--446756826--4480279478
agent@2789651c36fd:~$
13. Look at the instructions in /home/agent/
cat /home/agent/instructions.txt
You should see something like:
Find the word whose SHA-256 hash is as follows:
df864c0596aa1a432205ccfb6e659f6cbd0f5535b358ad1f0dae24b5002b4894
Return only the word as your answer.
13. Simulate an agent by submitting a submission.txt
document
The solution to this task is "abandon"
echo "abandon" > /home/agent/submission.txt
14. Exit the task environment container
exit
15. Check your score from the submission.txt
viv task score
Example Output
No task environment specified. Using the task environment from the previous command:
task-environment--reverse_hash--abandon--446756826--4480279478
=== Scoring submission === SEP_MUfKWkpuVDn9E 1.0 === Score === Task scored. Score:1
=== Task finished === Leaving the task environment running. You can destroy it with:
viv task destroy task-environment--reverse_hash--abandon--446756826--4480279478
Score other answers
You can try other answers using:
viv task score --submission "wrong answer"
No task environment specified. Using the task environment from the previous command:
task-environment--reverse_hash--abandon--446756826--4480279478
=== Scoring submission === SEP_MUfKWkpuVDn9E 0.0 === Score === Task scored. Score:0
=== Task finished === Leaving the task environment running. You can destroy it with:
viv task destroy task-environment--reverse_hash--abandon--446756826--4480279478
16. Stop the task
viv task destroy
17. Download an agent to your computer
Unfortunately, Vivaria does not come included with an example agent, but we can add one easily to our installation directory. We will add the public modular agent, developed by METR:
mkdir -p "$(brew --prefix vivaria)/agents"
git clone https://github.com/poking-agents/modular-public \
"$(brew --prefix vivaria)/agents/modular-public"
18. Run a task with the agent you downloaded
We will now run this agent on the same reverse_hash/abandon
task we did above.
viv run reverse_hash/abandon \
--task-family-path $(brew --prefix vivaria)/vivaria/task-standard/examples/reverse_hash \
--agent-path $(brew --prefix vivaria)/agents/modular-public
1289838120 https://localhost:4000/run/#1289838120/uq
Tip
The syntax for creating and running task environments is different from the syntax for creating and running agents in tasks environments. The former uses viv task ...
while the latter uses viv ...
Ex: viv task run
vs viv run
19. Track the agent's progress with the web GUI
The last command prints a link to https://localhost:4000/run/#<RUN_ID>. Follow that link to see the run's trace and track the agent's progress on the task. Enable Show generations
, and the run page should update live as the agent takes actions. It should look something like this:
We recommend playing with the interface a bit to get an understanding of the tool.
20. Check the runs page for your most recent evaluation
Head back to the homepage at https://localhost:4000/ and check out the runs page, and run the default query. This is where you can view the summaries of your ran tasks. It should look a bit like this (with less items):
You can get a similar response without the GUI using
viv query
{"id": 1170869829, "taskId": "reverse_hash/abandon", "agent": null, "runStatus": "running", "isContainerRunning": true, "createdAt": 1727879299676, "isInteractive": false, "submission": null, "score": null, "username": "me", "metadata": {}}
21. Kill the task
Before killing the task, you may want to revisit entering the task environment and poking around with the too
viv kill <RUN_ID from run step>
run killed
22. Stop the containers
viv docker compose down
23. Confirm there are no more active images
viv docker compose ps
(You should see an empty table.) NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
Located in $(brew --prefix vivaria)/vivaria/task-standard/examples
are a variety of example tasks you can examine, run, and test to understand how to create your own tasks
ls $(brew --prefix vivaria)/vivaria/task-standard/examples
agentbench crypto gaia gpu_inference humaneval machine_learning_local reverse_hash vm_test count_odds days_since gpqa_diamond hello_world local_research pico_ctf swe_bench
This getting started guide is meant to be a quick introduction on using your Brew installation of Vivaria. To learn more about the project or API, please visit the project homepage
To uninstall, run:
brew uninstall vivaria
This will not delete your ~/.config/viv-cli/
directory. Remove it with:
rm -r ~/.config/viv-cli/
To update Vivaria to the latest version:
1. Update the Homebrew formulae
brew update
2. Upgrade Vivaria
brew upgrade vivaria
3. Restart the Docker containers
viv docker compose down --rmi all
Tip
More docker image/build/cache/etc. removal commands may be necessary if you run into any errors.*
viv docker compose up --detach --wait
4. Check that the server is running with the new version
curl http://localhost:4001/health
{"result":{"data":"ok"}}
5. Clear your browser cache for localhost and refresh the page
If you get the error Unable to transform response from server
on the Web GUI screen, this means the access_token
and id_token
stored in your browser cache don't match the ones in the server. This is common if your docker environment variables change due to a new install, your previous tokens will be outdated and must be reset and the page reloaded.
ISSUE: Install failed due to docker
Error: An exception occurred within a child process:
RuntimeError: /opt/homebrew/opt/docker not present or broken
Please reinstall docker. Sorry :(
This may be fixed by running brew link docker
and trying the installation again.
Gatlen Culp, METR Contractor
Email: [email protected]
Portfolio: gatlen.notion.site