homebrew-vivaria

Warning

This is currently a pre-production Forumla that has not been thoroughly tested and which installs a currently non-official version of Vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research. This package contains a web app which is used for running and organzing evaluations as well as a command line interface to aid in the development of tasks. More information can be found on the website.

For prototyping purposes, Gatlen has created his own fork of Vivaria which this formulae installs. See the original repo here

Homebrew ("brew") is a macOS (and Linux) package manager. New contributers to this Homebrew formulae (especially those new to Homebrew formula development) should see CONTRIBUTING.md.

00 TOC

homebrew-vivaria

01 Installation

00. Install Requirements (Docker) Make sure to have docker compose version > 2.0. You can check this by running:

docker compose version

Tip

If you don't have docker compose, you can install docker desktop with:

brew install --cask docker

01. Tap this repository

brew tap GatlenCulp/vivaria

02. Install Vivaria

brew install vivaria

02 Post-install Setup

03. Run the post-installation setup (This will ask you for a valid OpenAI API Key)

Be cautious running this command multiple times as it will overwrite your current configuration and will require you to follow all the instructions from here onward

viv setup

Example Output

Please enter your OpenAI API key: sk-Hk[REDACTED] Using output directory: /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria Creating new file /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.server Successfully wrote to /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.server Creating new file /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.db Successfully wrote to /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env.db Creating new file /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env Successfully wrote to /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/.env Created /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/docker-compose.override.yml Updated /opt/homebrew/Cellar/vivaria/HEAD-6cc4707/vivaria/docker-compose.dev.yml: Changed 'user: node:docker' to 'user: node:0' viv CLI configuration completed successfully. Vivaria setup completed successfully. To finish installation, run: viv docker compose up --detach --wait Building the docker image may take upwards of an hour.

03 Getting Started

03.01 Starting the Web GUI

04. Open docker

Open Docker Desktop automatically with:

open -a Docker

05. Build and run the server images (This may take a while)

viv docker compose up --detach --wait

06. Check that the containers are running

viv docker compose ps

Example Output

🪴 Handing over execution to docker. Running command: 🪴 docker compose ps (at /opt/homebrew/Cellar/vivaria/0.1.5/vivaria) NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS vivaria-background-process-runner-1 vivaria-background-process-runner "docker-entrypoint.s…" background-process-runner 8 hours ago Up 58 seconds 4001/tcp vivaria-database-1 vivaria-database "docker-entrypoint.s…" database 8 hours ago Up About a minute (healthy) 0.0.0.0:5432->5432/tcp vivaria-proxy-1 quay.io/panubo/sshd "/entry.sh /usr/sbin…" proxy 8 hours ago Up About a minute 0.0.0.0:2222->22/tcp vivaria-server-1 vivaria-server "docker-entrypoint.s…" server 8 hours ago Up 58 seconds (healthy) 0.0.0.0:4001->4001/tcp vivaria-ui-1 vivaria-ui "docker-entrypoint.s…" ui 8 hours ago Up About a minute (healthy) 0.0.0.0:4000->4000/tcp

03.02 Accessing the Web GUI

07. Check that the task server is running

curl http://localhost:4001/health

{"result":{"data":"ok"}}

08. Open the Vivaria web GUI

You can access the web GUI at https://localhost:4000/ and continue past the "Your connecton is not private" warning.

09. The website will prompt you for your ACCESS_TOKEN and ID_TOKEN from .env.server.

You can get these by running

cat "$(brew --prefix vivaria)/vivaria/.env.server" | grep -E "ACCESS_TOKEN=|ID_TOKEN="

ACCESS_TOKEN=[REDACTED] ID_TOKEN=[REDACTED]

You should then have access to the web interface. The landing page should look like this:

03.03 Starting and Testing a Task via the CLI

10. Start your first task environment

Load up the example task "abandon" from the "reverse_hash" task family (this may take a while):

viv task start reverse_hash/abandon \
    --task-family-path "$(brew --prefix vivaria)/vivaria/task-standard/examples/reverse_hash"

11. Check that the task is running

viv task list

task-environment--reverse_hash--abandon--4222326789--8628449513

12. Enter the task environment container

viv task enter --user agent

Example Output

No task environment specified. Using the task environment from the previous command: task-environment--reverse_hash--abandon--446756826--4480279478

agent@2789651c36fd:~$

13. Look at the instructions in /home/agent/

cat /home/agent/instructions.txt

You should see something like:

Find the word whose SHA-256 hash is as follows: df864c0596aa1a432205ccfb6e659f6cbd0f5535b358ad1f0dae24b5002b4894

Return only the word as your answer.

13. Simulate an agent by submitting a submission.txt document

The solution to this task is "abandon"

echo "abandon" > /home/agent/submission.txt

14. Exit the task environment container

exit

15. Check your score from the submission.txt

viv task score

Example Output

No task environment specified. Using the task environment from the previous command: task-environment--reverse_hash--abandon--446756826--4480279478 === Scoring submission === SEP_MUfKWkpuVDn9E 1.0 === Score === Task scored. Score: 1 === Task finished === Leaving the task environment running. You can destroy it with:

viv task destroy task-environment--reverse_hash--abandon--446756826--4480279478

Score other answers

You can try other answers using:

viv task score --submission "wrong answer"

No task environment specified. Using the task environment from the previous command: task-environment--reverse_hash--abandon--446756826--4480279478 === Scoring submission === SEP_MUfKWkpuVDn9E 0.0 === Score === Task scored. Score: 0 === Task finished === Leaving the task environment running. You can destroy it with:

viv task destroy task-environment--reverse_hash--abandon--446756826--4480279478

16. Stop the task

viv task destroy

03.04 Evaluating an Agent on a Task using the CLI and Web GUI

17. Download an agent to your computer

Unfortunately, Vivaria does not come included with an example agent, but we can add one easily to our installation directory. We will add the public modular agent, developed by METR:

mkdir -p "$(brew --prefix vivaria)/agents"

git clone https://github.com/poking-agents/modular-public \
  "$(brew --prefix vivaria)/agents/modular-public"

18. Run a task with the agent you downloaded We will now run this agent on the same reverse_hash/abandon task we did above.

viv run reverse_hash/abandon \
  --task-family-path $(brew --prefix vivaria)/vivaria/task-standard/examples/reverse_hash \
  --agent-path $(brew --prefix vivaria)/agents/modular-public

1289838120 https://localhost:4000/run/#1289838120/uq

Tip

The syntax for creating and running task environments is different from the syntax for creating and running agents in tasks environments. The former uses viv task ... while the latter uses viv ...

Ex: viv task run vs viv run

19. Track the agent's progress with the web GUI

The last command prints a link to https://localhost:4000/run/#<RUN_ID>. Follow that link to see the run's trace and track the agent's progress on the task. Enable Show generations, and the run page should update live as the agent takes actions. It should look something like this:

We recommend playing with the interface a bit to get an understanding of the tool.

20. Check the runs page for your most recent evaluation

Head back to the homepage at https://localhost:4000/ and check out the runs page, and run the default query. This is where you can view the summaries of your ran tasks. It should look a bit like this (with less items):

You can get a similar response without the GUI using

viv query

{"id": 1170869829, "taskId": "reverse_hash/abandon", "agent": null, "runStatus": "running", "isContainerRunning": true, "createdAt": 1727879299676, "isInteractive": false, "submission": null, "score": null, "username": "me", "metadata": {}}

21. Kill the task

Before killing the task, you may want to revisit entering the task environment and poking around with the too

viv kill <RUN_ID from run step>

run killed

03.05 Shutting Down the Web Server

22. Stop the containers

viv docker compose down

23. Confirm there are no more active images

viv docker compose ps

(You should see an empty table.) NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS

03.06 Experimenting with Our Examples

Located in $(brew --prefix vivaria)/vivaria/task-standard/examples are a variety of example tasks you can examine, run, and test to understand how to create your own tasks

ls $(brew --prefix vivaria)/vivaria/task-standard/examples

agentbench crypto gaia gpu_inference humaneval machine_learning_local reverse_hash vm_test count_odds days_since gpqa_diamond hello_world local_research pico_ctf swe_bench

03.07 Learn More

This getting started guide is meant to be a quick introduction on using your Brew installation of Vivaria. To learn more about the project or API, please visit the project homepage

04 Uninstalling

To uninstall, run:

brew uninstall vivaria

This will not delete your ~/.config/viv-cli/ directory. Remove it with:

rm -r ~/.config/viv-cli/

05 Updating and Reinstalling

To update Vivaria to the latest version:

1. Update the Homebrew formulae

brew update

2. Upgrade Vivaria

brew upgrade vivaria

3. Restart the Docker containers

viv docker compose down --rmi all

Tip

More docker image/build/cache/etc. removal commands may be necessary if you run into any errors.*

viv docker compose up --detach --wait

4. Check that the server is running with the new version

curl http://localhost:4001/health

{"result":{"data":"ok"}}

5. Clear your browser cache for localhost and refresh the page

If you get the error Unable to transform response from server on the Web GUI screen, this means the access_token and id_token stored in your browser cache don't match the ones in the server. This is common if your docker environment variables change due to a new install, your previous tokens will be outdated and must be reset and the page reloaded.

06 Known Issues

ISSUE: Install failed due to docker

Error: An exception occurred within a child process:
  RuntimeError: /opt/homebrew/opt/docker not present or broken
Please reinstall docker. Sorry :(

This may be fixed by running brew link docker and trying the installation again.

07 Contact the Maintainer

Gatlen Culp, METR Contractor
Email: [email protected]
Portfolio: gatlen.notion.site

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
.vscode		.vscode
Formula		Formula
assets		assets
scripts		scripts
.editorconfig		.editorconfig
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

homebrew-vivaria

00 TOC

01 Installation

02 Post-install Setup

03 Getting Started

03.01 Starting the Web GUI

03.02 Accessing the Web GUI

03.03 Starting and Testing a Task via the CLI

03.04 Evaluating an Agent on a Task using the CLI and Web GUI

03.05 Shutting Down the Web Server

03.06 Experimenting with Our Examples

03.07 Learn More

04 Uninstalling

05 Updating and Reinstalling

06 Known Issues

07 Contact the Maintainer

About

Releases

Packages

Languages

License

GatlenCulp/homebrew-vivaria

Folders and files

Latest commit

History

Repository files navigation

homebrew-vivaria

00 TOC

01 Installation

02 Post-install Setup

03 Getting Started

03.01 Starting the Web GUI

03.02 Accessing the Web GUI

03.03 Starting and Testing a Task via the CLI

03.04 Evaluating an Agent on a Task using the CLI and Web GUI

03.05 Shutting Down the Web Server

03.06 Experimenting with Our Examples

03.07 Learn More

04 Uninstalling

05 Updating and Reinstalling

06 Known Issues

07 Contact the Maintainer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages