diff --git a/README.md b/README.md index 1b41f42..681b903 100644 --- a/README.md +++ b/README.md @@ -2,54 +2,196 @@ **WARNING: This is currently a pre-production Forumla that has not been thoroughly tested and which installs a currently non-official version of Vivaria** +![./assets/metr_logo.svg](./assets/metr_logo.svg) +![./assets/logo.png](./assets/logo.png) + [Vivaria](https://vivaria.metr.org/) is METR's tool for running evaluations and conducting agent elicitation research. This package contains a web app which is used for running and organzing evaluations as well as a command line interface to aid in the development of tasks. More information can be found on the website [here](https://vivaria.metr.org/). -For testing purposes and creating installation helper scripts, Gatlen has created [his own fork of Vivaria](https://github.com/GatlenCulp/vivaria/) which this installs. +For testing purposes and creating installation helper scripts, Gatlen has created [his own fork of Vivaria](https://github.com/GatlenCulp/vivaria/) which this installs. See the original [here](https://github.com/METR/vivaria) + +## 00 TOC - [homebrew-vivaria](#homebrew-vivaria) + - [00 TOC](#00-toc) - [01 Setup](#01-setup) - [01.01 Installation](#0101-installation) - - [02 Testing](#02-testing) - - [03 Uninstalling](#03-uninstalling) - - [04 Updating](#04-updating) - - [05 Known Issues](#05-known-issues) - - [06 Developing and Maintaining this Formula](#06-developing-and-maintaining-this-formula) - - [06.01 The Source](#0601-the-source) - - [06.02 The Python Dependencies](#0602-the-python-dependencies) - - [06.03 Copying Everything Over](#0603-copying-everything-over) - - [06.04 Debugging](#0604-debugging) - - [06.05 Random Useful Notes](#0605-random-useful-notes) - - [06.06 Roadmap](#0606-roadmap) - - [07 Contact the Maintainer](#07-contact-the-maintainer) - + - [01.02 Web UI](#0102-web-ui) + - [01.03 Starting and Testing a Task via the CLI](#0103-starting-and-testing-a-task-via-the-cli) + - [01.04 Evaluating an Agent on a Task using the CLI and Web UI](#0104-evaluating-an-agent-on-a-task-using-the-cli-and-web-ui) + - [01.05 Shutting down](#0105-shutting-down) + - [01.06 Examples for Testing](#0106-examples-for-testing) + - [02 Uninstalling](#02-uninstalling) + - [03 Updating](#03-updating) + - [04 Known Issues](#04-known-issues) + - [05 Developing and Maintaining this Formula](#05-developing-and-maintaining-this-formula) + - [05.01 The Source](#0501-the-source) + - [05.02 The Python Dependencies](#0502-the-python-dependencies) + - [05.03 Copying Everything Over](#0503-copying-everything-over) + - [05.04 Debugging](#0504-debugging) + - [05.05 Random Useful Notes](#0505-random-useful-notes) + - [05.06 Roadmap](#0506-roadmap) + - [06 Contact the Maintainer](#06-contact-the-maintainer) +--- ## 01 Setup ### 01.01 Installation + 1. Tap this repository -`brew tap GatlenCulp/vivaria` +```bash +brew tap GatlenCulp/vivaria +``` + 2. Install Vivaria -`brew install vivaria` +```bash +brew install vivaria +``` + 3. Run the post-installation setup (This will ask you for a valid [OpenAI API Key](https://medium.com/@lorenzozar/how-to-get-your-own-openai-api-key-f4d44e60c327)). *Be cautious running this command multiple times as it will overwrite your current configuration and will require you to follow all the instructions from here onward* -`viv setup` -1. Open docker -`open -a Docker` -1. Build and run the server images (This may take a while) -`viv docker compose up --detach --wait` +```bash +viv setup +``` + +4. Open docker +```bash +open -a Docker +``` + +5. Build and run the server images (This may take a while) +```bash +viv docker compose up --detach --wait +``` + +### 01.02 Web UI -### 02 Testing 6. Test that the server is up -`docker compose ps` (You should see a list of containers) -`curl http://localhost:4001/health` (You should see a JSON response `{...TODO...}`) +```bash +docker compose ps # You should see a list of containers +curl http://localhost:4001/health # You should see a JSON response {...TODO...} +``` + 7. Open the Vivaria web interface at https://localhost:4000/ and continue past the "insecure connection" warning. -8. The website will prompt you for your `ACCESS_TOKEN` and `ID_TOKEN` from `.env.server` -## 03 Uninstalling +8. The website will prompt you for your `ACCESS_TOKEN` and `ID_TOKEN` from `.env.server`. +![./assets/token-prompt.png](./assets/token-prompt.png) +You can get these by running +```bash +cat "$(brew --prefix vivaria)/vivaria/.env.server" | grep -E "ACCESS_TOKEN=|ID_TOKEN=" +``` + +you should then have access to the web interface. The landing page should look like this: + +TODO: INSERT IMAGE + +### 01.03 Starting and Testing a Task via the CLI + +9. You can start your first task environment by loading up the example task "abandon" from the "reverse_hash" task family: +```bash +viv task start reverse_hash/abandon $(brew --prefix vivaria)/vivaria/task-standard/examples/reverse_hash +``` + +10. Now that the task is running you should be able to see it by running: +```bash +viv task list +``` + +11. Enter the task environment container by running: +```bash +viv task enter --user agent +``` + +12. Look at the instructions in `/home/agent/`: +```bash +pwd +cat instructions.txt +``` +You should see something like: +>Please find the word whose SHA-256 hash is as follows: 98sdfs923432jsdsdf +> Return only the word as your answer. + +13. Write up a `submission.txt` document. The solution to this task is "abandon": +```bash +echo "abandon" > submission.txt +``` + +14. Exit the container: +```bash +exit +``` + +15. Check your score from the `submission.txt` document or check other solutions: +```bash +viv task score # You should see something like "1" TODO +viv task score --submission "wrong answer" # You should see something like "0" TODO +``` + +16. Stop the task: +```bash +viv task kill +``` + +### 01.04 Evaluating an Agent on a Task using the CLI and Web UI + +17. Unfortunately, Vivaria does not come included with an example agent, but we can add one easily to our installation directory. We will add the public [modular agent](https://github.com/poking-agents/modular-public), developed by METR: +```bash +mkdir -p "$(brew --prefix vivaria)/agents" +git clone https://github.com/poking-agents/modular-public \ + "$(brew --prefix vivaria)/agents/modular-public" +``` + +18. We will now run this agent on the same `reverse_hash/abandon` task we did above. +```bash +viv run reverse_hash/abandon \ + --task-family-path $(brew --prefix vivaria)/vivaria/task-standard/examples/reverse_hash \ + --agent-path $(brew --prefix vivaria)/agents/modular-public +``` + +19. The last command prints a link to [https://localhost:4000/path/to/run](https://localhost:4000/...) Follow that link to see the run's trace and track the agent's progress on the task. The run page should update as the agent takes actions live. It should look something like this: + +![./assets/run-page.png](./assets/run-page.png) + +20. Play with the interface a bit to get an understanding of the tool +21. Head back to the homepage at [https://localhost:4000/](https://localhost:4000/) and check out the runs page, and run the default query. This is where you can view the summaries of your ran tasks. It should look a bit like this (with less items): + +![./assets/runs-page.png](./assets/runs-page.png) + +22. You can then kill the task (or jump into the task environment and poke around.) +```bash +viv task kill +``` + +### 01.05 Shutting down + +23. To shut down the server, you just need to stop the images + +```bash +docker compose down +``` + +24. And confirm there are no more active images + +```bash +docker compose ps +``` + +### 01.06 Examples for Testing + +25. Located in `$(brew --prefix vivaria)/vivaria/task-standard/examples` are a variety of example tasks you can examine, run, and test to understand how to create your own tasks + +```bash +ls $(brew --prefix vivaria)/vivaria/task-standard/examples +``` +> agentbench crypto gaia gpu_inference humaneval machine_learning_local reverse_hash vm_test +count_odds days_since gpqa_diamond hello_world local_research pico_ctf swe_bench + +--- +## 02 Uninstalling To uninstall, run `brew uninstall vivaria`. This will not delete your `~/.config/viv-cli/` directory. That must be removed manually. -## 04 Updating +--- +## 03 Updating To update Vivaria to the latest version: @@ -81,16 +223,16 @@ Note: If you encounter any issues after upgrading, you may need to remove the ol viv docker compose down --rmi all viv docker compose up --detach --wait --build ``` - -## 05 Known Issues +--- +## 04 Known Issues **ISSUE: Install failed due to docker** This may be fixed by running `brew link docker` and trying the installation again. --- -## 06 Developing and Maintaining this Formula +## 05 Developing and Maintaining this Formula -### 06.01 The Source +### 05.01 The Source Brew downloads your source and places it in a temporary build directory. This source is commonly pulled in two different ways. @@ -118,7 +260,7 @@ Vivaria requires the `.git` repository files and GitHub does not include those i revision: "d67cc7894064e45f3459104c0f004fc1bd86612b" ``` -### 06.02 The Python Dependencies +### 05.02 The Python Dependencies Brew has a light policy to not allow you to use the internet during the installation process, which is a problem for pip installing the packages necessary for Vivaria. Luckily, Brew has a decent interface for managing Python packages and setting up a virtual environment. More information can be found here: [https://docs.brew.sh/Python-for-Formula-Authors](https://docs.brew.sh/Python-for-Formula-Authors). @@ -160,7 +302,7 @@ class Vivaria < Formula In making the virtual environment and building the package which will automatically make an executable called `viv` in the virtual environment's bin. We then copy this executable to `final_install_path/bin` which contain executables which are symlinked to Brew's bin (ex: `/opt/homebrew/bin/`) which is on the user's path, making the `viv` script available to the user anywhere. -### 06.03 Copying Everything Over +### 05.03 Copying Everything Over At this point, the viv-cli is essentially installed, but since the web ui relies on typescript and docker files, we need to maintain a large chunk of the original project files in `final_install_path`. In the rest of the script, we install all the docs into the folder brew expects to find them (`final_install_path/share/doc`), delete everything we no longer need, and copy the rest over from the build path to the final install path under `vivaria`. @@ -189,7 +331,7 @@ class Vivaria < Formula src_dir.install Dir["*", ".*"].reject { |f| ['.', '..'].include?(File.basename(f)) } ``` -### 06.04 Debugging +### 05.04 Debugging To install the formula with debug mode and receive more verbose errors during developing the formula, you can run: `brew install --formula --debug --verbose ./Formula/vivaria.rb` @@ -201,7 +343,7 @@ I was attempting to set up a Ruby debugger w/ intellisense in VSCode but it wasn `gem install ruby-lsp` `gem install debug` -### 06.05 Random Useful Notes +### 05.05 Random Useful Notes `echo $(brew --prefix vivaria)` can be used to get the [opt-prefix](https://docs.brew.sh/Manpage) for Vivaria. This returns a static path to a symlinked folder pointing to the most recent version of vivaria. @@ -245,7 +387,7 @@ gatlenculp/vivaria/vivaria Error: 22 problems in 1 formula detected. ``` -### 06.06 Roadmap +### 05.06 Roadmap - [ ] Automatically configure an SSH key for the user to use with viv. ```ruby @@ -261,7 +403,7 @@ File.open(prefix/".env", "a") { |f| f.puts "SSH_PUBLIC_KEY_PATH=#{ssh_key_path}. system "viv", "register-ssh-public-key", "#{ssh_key_path}.pub" ``` -## 07 Contact the Maintainer +## 06 Contact the Maintainer Gatlen Culp, METR Contractor Email: gatlen.culp@metr.org diff --git a/assets/logo.png b/assets/logo.png new file mode 100644 index 0000000..0bf9d7e Binary files /dev/null and b/assets/logo.png differ diff --git a/assets/metr_logo.svg b/assets/metr_logo.svg new file mode 100644 index 0000000..5673ca9 --- /dev/null +++ b/assets/metr_logo.svg @@ -0,0 +1,80 @@ + + + + + + + + + + + + + + + + + + diff --git a/assets/playground.png b/assets/playground.png new file mode 100644 index 0000000..777c847 Binary files /dev/null and b/assets/playground.png differ diff --git a/assets/run-page.png b/assets/run-page.png new file mode 100644 index 0000000..6038e81 Binary files /dev/null and b/assets/run-page.png differ diff --git a/assets/runs-page.png b/assets/runs-page.png new file mode 100644 index 0000000..1032051 Binary files /dev/null and b/assets/runs-page.png differ diff --git a/assets/token-prompt.png b/assets/token-prompt.png new file mode 100644 index 0000000..55fc84a Binary files /dev/null and b/assets/token-prompt.png differ