diff --git a/docs/developer_adding_new_models.md b/docs/developer_adding_new_models.md index b91e8e027d..e0e5a58cb7 100644 --- a/docs/developer_adding_new_models.md +++ b/docs/developer_adding_new_models.md @@ -22,7 +22,7 @@ In the second case, if you want to add a model to HELM, you can directly do it i * Include any link justifying the metadata used in `ModelMetadata` such as the release data, number of parameters, capabilities and so on (you should not infer anything). * Check that you are respecting the format used in those files (`ModelMetadata` should be named as `/` and the `ModelDeployment` should be named as `/`, for example `ModelMetadata`: `openai/gpt2` and `ModelDeployment`: `huggingface/gpt2`). Add the appropriate comments and so on. * Run `helm-run --run-entries "mmlu:subject=anatomy,model_deployment=" --suite v1 --max-eval-instances 10` and make sure that everything works. Include the logs from the terminal in your PR. -* Not create unnecessary objects (`Client` `TokenizerCOnfig`, `WindowService`) and if you have to create one of these objects, document in your PR why you had to. Make them general enough so that they could be re-used by other models (especially the `Client`). +* Not create unnecessary objects (`Client` `TokenizerConfig`, `WindowService`) and if you have to create one of these objects, document in your PR why you had to. Make them general enough so that they could be re-used by other models (especially the `Client`). ## Example diff --git a/docs/importing_custom_modules.md b/docs/importing_custom_modules.md index c790269b44..15823793b7 100644 --- a/docs/importing_custom_modules.md +++ b/docs/importing_custom_modules.md @@ -1,36 +1,244 @@ # Importing Custom Modules -HELM is a modular framework with a plug-in architecture. You can write your own implementation for a client, tokenizer, scenario or metric and use them in HELM with HELM installed as a library, without needing to modify HELM itself. +HELM is a modular framework with a plug-in architecture. You can write your own implementation for run specs, clients, tokenizers, scenarios, metrics, annotators, perturbations, and window services and use them in HELM with HELM installed as a library, without needing to modify HELM itself. -The main way for you to use your code in HELM is to write a custom Python class that is a subclass of `Client`, `Tokenizer`, `Scenario` or `Metric` in a Python module. You can then specify a `ClientSpec`, `TokenizerSpec`, `ScenarioSpec` or `MetricSpec` (which are all classes of `ObjectSpec`) where the `class_name` is the name of your custom Python class. +This guide explains: -## Plugin-style registration +- how HELM discovers custom code +- your options for loading plugins +- a complete end-to-end example using Python entry points (recommended) +- when you may need to set `PYTHONPATH` -Extensions must register themselves at import time, and HELM supports four ways to accomplish this: +--- -1. **Python entry points (recommended).** If your custom code is organized as an installable Python package, you can declare a `helm` entry-point group in your `pyproject.toml`: +## How HELM finds your code - ```toml - [project.entry-points.helm] - my_plugin = "my_package.helm_plugin" - ``` +Custom extensions generally work in one of two ways: - This will allow `helm-run` to automatically import your plugin to make it available at runtime. Installing your package as a wheel (or in developer mode via `pip install -e .`), ensures helm can always discover your plugin without explicit modification of `PYTHONPATH`. +1. **Run specs (registered by decorator).** + Run specs are registered *at import time* via `@run_spec_function(...)` and are discoverable by name when you invoke `helm-run`. -2. **Namespace packages under the `helm` module.** HELM automatically discovers run specs placed in the `helm.benchmark.run_specs` namespace (via [`pkgutil.iter_modules`](https://docs.python.org/3/library/pkgutil.html#pkgutil.iter_modules)). You can ship a separate package that contributes modules to this namespace (for example, `helm/benchmark/run_specs/my_run_spec.py`) and registers additional run spec functions when imported. In this case your module must be available in the `PYTHONPATH` as described below. +2. **ObjectSpec-backed classes (loaded by class name).** + Scenarios, metrics, clients, tokenizers, annotators, perturbations, and window services are defined as classes. HELM loads them by: + - importing the module portion of your fully qualified name, and then + - looking up the class name you specify in the relevant `ObjectSpec` (e.g., `ScenarioSpec`, `MetricSpec`, `ClientSpec`, `TokenizerSpec`, `AnnotatorSpec`, `PerturbationSpec`, `WindowServiceSpec`). -3. **Explicit imports via `--plugins`.** This option explicitly tells `helm-run` which module contains your plugin code. You can pass either importable module names or filesystem paths to Python files: +**Key idea:** your module must be importable by Python, and (for run specs) it must be imported so registration code runs. - ```bash - helm-run --plugins my_package.helm_plugin /path/to/local_plugin.py ... - ``` +--- - HELM resolves module names with `importlib.import_module` and file paths with `ubelt.import_module_from_path`, so you can load quick experiments without packaging them. Paths are interpreted literally; module names still need to be importable (for example, by adjusting `PYTHONPATH` as described below). +## Ways to load plugins -4. **Write a Python wrapper script**. There is no need to use the `helm-run` entry point, you can instead write a Python wrapper script that calls `helm.benchmark.run.run_benchmark()`. Python will automatically add the directory containing that script to the Python module search path. If your custom classes live in a Python module under that directory, they will automatically be importable by Python. See [Python's documentation](https://docs.python.org/3/library/sys_path_init.html) for more details. +HELM supports four common approaches. Pick the one that matches how "production" vs. "experimental" your plugin is. + +### 1) Python entry points (recommended for reusable plugins) + +If your custom code is an installable Python package, declare a `helm` entry-point group in your `pyproject.toml`: + +```toml +[project.entry-points.helm] +my_plugin = "my_package.helm_plugin" +``` + +When your package is installed (e.g., as a wheel or with `pip install -e .`), `helm-run` can automatically import the entry point module, making your run specs and classes available without manually tweaking `PYTHONPATH`. + +### 2) Explicit imports via `--plugins` (best for quick experiments) + +You can explicitly tell `helm-run` what to import, using either an importable module name or a filesystem path to a `.py` file: + +```bash +helm-run --plugins my_package.helm_plugin /path/to/local_plugin.py ... +``` + +- **Module names** must already be importable (e.g., installed or on `PYTHONPATH`). +- **File paths** are loaded directly, which is convenient for one-off local experiments. + +### 3) Namespace packages under `helm.benchmark.run_specs` (legacy name-based method) + +HELM automatically discovers run specs placed in the `helm.benchmark.run_specs` namespace (via [`pkgutil.iter_modules`](https://docs.python.org/3/library/pkgutil.html#pkgutil.iter_modules)). You can ship a separate package that contributes modules to this namespace (for example, `helm/benchmark/run_specs/my_run_spec.py`) and registers additional run spec functions when imported. In this case your module must be available in the `PYTHONPATH` as described below. + +### 4) A Python wrapper script (when you don't want to use `helm-run`) + +There is no need to use the `helm-run` entry point, you can instead write a Python wrapper script that calls `helm.benchmark.run.run_benchmark()`. Python will automatically add the directory containing that script to the Python module search path. If your custom classes live in a Python module under that directory, they will automatically be importable by Python. See [Python's documentation](https://docs.python.org/3/library/sys_path_init.html) for more details. For example, suppose you implemented a custom `Client` subclass named `MyClient` in the `my_client.py` file under your current working directory, and you have a `ClientSpec` specifying the `class_name` as `my_client.MyClient`. Suppose you added a script called `run_helm.py` that calls `helm.benchmark.run.run_benchmark()` directly. When run using `python run_helm.py`, HELM will be able to import your modules without any additional changes. +When you run `python your_script.py`, Python automatically adds the script's directory to the module search path, so modules under that directory are importable without extra `PYTHONPATH` changes. + +--- + +## Example plugin (entry points + run spec + ObjectSpec classes) + +This compact example shows both registration styles in a single module: + +- a **run spec** registered via `@run_spec_function(...)` +- a **scenario** and **metric** referenced via `class_name=...` in `ScenarioSpec`/`MetricSpec` + +We'll use the entry point approach because it's the most robust for repeated runs. + +### Prerequisites + +- A compatible Python (this example uses 3.11) +- [`uv`](https://docs.astral.sh/uv/) installed + +### Step 1 - Initialize a packaged project + +From the directory where you want the plugin project: + +```bash +uv init --package my_example_helm_module --python=3.11 +cd my_example_helm_module +``` + +### Step 2 - Create your plugin module + +Create a module for your plugin code: + +```bash +mkdir -p src/my_example_helm_module +touch src/my_example_helm_module/my_submodule_plugin_code.py +``` + +Your directory should look like: + +```text +my_example_helm_module +├── pyproject.toml +├── README.md +└── src + └── my_example_helm_module + ├── __init__.py + └── my_submodule_plugin_code.py +``` + +Paste the following into `src/my_example_helm_module/my_submodule_plugin_code.py`: + +```python +from typing import List, Optional + +from helm.benchmark.run_spec import RunSpec, run_spec_function +from helm.benchmark.adaptation.adapter_spec import AdapterSpec +from helm.benchmark.metrics.metric import Metric, MetricSpec +from helm.benchmark.metrics.statistic import Stat +from helm.benchmark.metrics.metric_service import MetricService +from helm.benchmark.adaptation.request_state import RequestState +from helm.benchmark.scenarios.scenario import Scenario, ScenarioSpec, ScenarioMetadata, Instance +from helm.benchmark.metrics.evaluate_reference_metrics import compute_reference_metrics +from helm.benchmark.scenarios.scenario import TRAIN_SPLIT, TEST_SPLIT, CORRECT_TAG +from helm.benchmark.scenarios.scenario import Input, Output, Reference + + +class CustomScenario(Scenario): + name = "custom_scenario" + description = "A tiny scenario used for testing." + tags = ["custom"] + + def get_instances(self, output_path: str) -> List[Instance]: + # We include 5 TRAIN_SPLIT instances because the generation adapter + # uses a few-shot train instances pool by default. If you return 0 + # train instances, you'll see: "only 0 training instances, wanted 5". + examples = [ + # (question, answer, split) + ("1+1=?", "2", TRAIN_SPLIT), + ("2+2=?", "4", TRAIN_SPLIT), + ("3+3=?", "6", TRAIN_SPLIT), + ("4+4=?", "8", TRAIN_SPLIT), + ("5+5=?", "10", TRAIN_SPLIT), + ("6+6=?", "12", TEST_SPLIT), + ("7+7=?", "14", TEST_SPLIT), + ] + + instances: List[Instance] = [] + train_i = 0 + test_i = 0 + + for q, a, split in examples: + if split == TRAIN_SPLIT: + train_i += 1 + instance_id = f"train-{train_i:03d}" + else: + test_i += 1 + instance_id = f"test-{test_i:03d}" + + instances.append( + Instance( + id=instance_id, + input=Input(text=f"Q: {q}\nA:"), + references=[Reference(output=Output(text=a), tags=[CORRECT_TAG])], + split=split, + ) + ) + + return instances + + def get_metadata(self) -> ScenarioMetadata: + return ScenarioMetadata(name=self.name, main_metric="custom_metric", main_split="test") + + +class CustomMetric(Metric): + """A simple, extensible metric. + + To keep the example compact, we just call HELM's reference-metric helper. + """ + + def __init__(self, names: Optional[List[str]] = None): + self.names = names or ["exact_match"] + + def evaluate_generation( + self, + adapter_spec: AdapterSpec, + request_state: RequestState, + metric_service: MetricService, + eval_cache_path: str, + ) -> List[Stat]: + return compute_reference_metrics( + names=self.names, + adapter_spec=adapter_spec, + request_state=request_state, + metric_service=metric_service, + ) + + +@run_spec_function("my_custom_run_spec") +def build_custom_run_spec() -> RunSpec: + return RunSpec( + name="my_custom_run_spec", + scenario_spec=ScenarioSpec(class_name="my_example_helm_module.my_submodule_plugin_code.CustomScenario"), + adapter_spec=AdapterSpec(method="generation"), + metric_specs=[MetricSpec(class_name="my_example_helm_module.my_submodule_plugin_code.CustomMetric")], + ) +``` + +Two things to notice: + +- The run spec is registered by the decorator **when the module is imported**. +- The scenario and metric are referenced via fully qualified `class_name` strings. + +### Step 3 - Register the plugin via entry points + +Edit `pyproject.toml` and add: + +```toml +[project.entry-points.helm] +my_helm_plugin = "my_example_helm_module.my_submodule_plugin_code" +``` + +Then install your package in editable mode: + +```bash +uv pip install -e . +``` + +### Step 4 - Run with your custom plugin + +Now `helm-run` should discover your plugin through the entry point: + +```bash +helm-run --run-entries my_custom_run_spec:model=openai/gpt2 --suite tutorial --max-eval-instances 10 +``` + +--- + ## Adding the current working directory to PYTHONPATH @@ -38,10 +246,10 @@ HELM will only be able to use custom classes that can be imported by Python. Dep If the custom classes live in a Python module under the current working directory, you should modify `PYTHONPATH` to make that Python module importable. -This is required because - in some environment - Python does not add the current working directory to the Python module search path running when using command line comments / Python entry points such as `helm-run`. See [Python's documentation](https://docs.python.org/3/library/sys_path_init.html) for more details. +This is required because - in some environments - Python does not add the current working directory to the Python module search path when running command line commands / Python entry points such as `helm-run`. See [Python's documentation](https://docs.python.org/3/library/sys_path_init.html) for more details. For example, suppose you implemented a custom `Client` subclass named `MyClient` in the `my_client.py` file under your current working directory, and you have a `ClientSpec` specifying the `class_name` as `my_client.MyClient`. To make your file importable by Python, you have to add `.` to your `PYTHONPATH` so that Python will search in your current working directory for your custom Python modules. -In Bash, you can do this by running `export PYTHONPATH=".:$PYTHONPATH"` before running `helm-run`, or by prefixing `helm-run` with `PYTHONPATH=".:$PYTHONPATH `. +In Bash, you can do this by running `export PYTHONPATH=".:$PYTHONPATH"` before running `helm-run`, or by prefixing `helm-run` with `PYTHONPATH=".:$PYTHONPATH"`. diff --git a/src/helm/benchmark/test_plugins.py b/src/helm/benchmark/test_plugins.py index 07faab8b3e..65c8885e92 100644 --- a/src/helm/benchmark/test_plugins.py +++ b/src/helm/benchmark/test_plugins.py @@ -2,7 +2,7 @@ import importlib.metadata import logging import sys - +from textwrap import dedent from helm.benchmark.run import import_user_plugins, load_entry_point_plugins @@ -34,16 +34,18 @@ def test_load_entry_point_plugins_handles_failures(tmp_path, monkeypatch, caplog dist_info = plugin_dir / "entrypoint-0.0.0.dist-info" dist_info.mkdir() - (dist_info / "METADATA").write_text("""\ -Metadata-Version: 2.1 -Name: entrypoint -Version: 0.0.0 -""") - (dist_info / "entry_points.txt").write_text("""\ -[helm_test] -good = good_plugin:FLAG -bad = bad_plugin:FLAG -""") + (dist_info / "METADATA").write_text(dedent( + """ + Metadata-Version: 2.1 + Name: entrypoint + Version: 0.0.0 + """)) + (dist_info / "entry_points.txt").write_text(dedent( + """ + [helm_test] + good = good_plugin:FLAG + bad = bad_plugin:FLAG + """)) monkeypatch.syspath_prepend(str(plugin_dir)) importlib.invalidate_caches() @@ -64,24 +66,23 @@ def test_import_user_plugins_supports_namespace_packages(tmp_path, monkeypatch): (plugin_root / "helm" / "benchmark" / "__init__.py").write_text("") (run_specs_dir / "__init__.py").write_text("") - (run_specs_dir / "custom.py").write_text( + (run_specs_dir / "custom.py").write_text(dedent( """ -from helm.benchmark.adaptation.adapter_spec import AdapterSpec -from helm.benchmark.metrics.metric import MetricSpec -from helm.benchmark.run_spec import RunSpec, run_spec_function -from helm.benchmark.scenarios.scenario import ScenarioSpec - - -@run_spec_function("custom_namespace_run") -def build_run_spec(): - return RunSpec( - name="custom_namespace_run", - scenario_spec=ScenarioSpec(class_name="helm.benchmark.scenarios.scenario.Scenario"), - adapter_spec=AdapterSpec(model="dummy"), - metric_specs=[MetricSpec(class_name="helm.benchmark.metrics.metric.Metric")], - ) -""" - ) + from helm.benchmark.adaptation.adapter_spec import AdapterSpec + from helm.benchmark.metrics.metric import MetricSpec + from helm.benchmark.run_spec import RunSpec, run_spec_function + from helm.benchmark.scenarios.scenario import ScenarioSpec + + + @run_spec_function("custom_namespace_run") + def build_run_spec(): + return RunSpec( + name="custom_namespace_run", + scenario_spec=ScenarioSpec(class_name="helm.benchmark.scenarios.scenario.Scenario"), + adapter_spec=AdapterSpec(model="dummy"), + metric_specs=[MetricSpec(class_name="helm.benchmark.metrics.metric.Metric")], + ) + """)) import helm import helm.benchmark @@ -103,3 +104,108 @@ def build_run_spec(): assert get_run_spec_function("custom_namespace_run") is not None + +def test_import_user_plugins_supports_object_spec_plugins(tmp_path, monkeypatch): + module_name = "custom_component_plugin" + module_file = tmp_path / f"{module_name}.py" + module_file.write_text(dedent( + """ + from typing import List + + from helm.benchmark.adaptation.adapter_spec import AdapterSpec + from helm.benchmark.adaptation.request_state import RequestState + from helm.benchmark.metrics.metric import Metric, MetricSpec + from helm.benchmark.metrics.metric_service import MetricService + from helm.benchmark.metrics.statistic import Stat + from helm.benchmark.run_spec import RunSpec, run_spec_function + from helm.benchmark.scenarios.scenario import Scenario, ScenarioMetadata, ScenarioSpec, Instance + from helm.clients.client import Client + from helm.common.request import Request, RequestResult + from helm.common.tokenization_request import ( + TokenizationRequest, + TokenizationRequestResult, + DecodeRequest, + DecodeRequestResult, + TokenizationToken, + ) + from helm.tokenizers.tokenizer import Tokenizer + + + @run_spec_function("custom_plugin_run_spec") + def build_run_spec() -> RunSpec: + return RunSpec( + name="custom_plugin_run_spec", + scenario_spec=ScenarioSpec(class_name="custom_component_plugin.CustomScenario"), + adapter_spec=AdapterSpec(model="dummy"), + metric_specs=[MetricSpec(class_name="custom_component_plugin.CustomMetric")], + ) + + + class CustomScenario(Scenario): + name = "custom_plugin_scenario" + description = "A custom scenario for plugin tests." + tags = ["custom"] + + def get_instances(self, output_path: str) -> List[Instance]: + return [] + + def get_metadata(self) -> ScenarioMetadata: + return ScenarioMetadata(name=self.name, main_metric="custom_metric", main_split="test") + + + class CustomMetric(Metric): + def evaluate_generation( + self, + adapter_spec: AdapterSpec, + request_state: RequestState, + metric_service: MetricService, + eval_cache_path: str, + ) -> List[Stat]: + return [] + + + class CustomClient(Client): + def make_request(self, request: Request) -> RequestResult: + return RequestResult(success=True, cached=False, embedding=[], completions=[]) + + + class CustomTokenizer(Tokenizer): + def tokenize(self, request: TokenizationRequest) -> TokenizationRequestResult: + return TokenizationRequestResult( + success=True, + cached=False, + text=request.text, + tokens=[TokenizationToken(value=request.text)], + ) + + def decode(self, request: DecodeRequest) -> DecodeRequestResult: + return DecodeRequestResult(success=True, cached=False, text="".join(map(str, request.tokens))) + """)) + + monkeypatch.syspath_prepend(tmp_path) + + if module_name in sys.modules: + importlib.invalidate_caches() + del sys.modules[module_name] + + import_user_plugins([module_name]) + + from helm.benchmark.metrics.metric import Metric, MetricSpec, create_metric + from helm.benchmark.model_deployment_registry import ClientSpec + from helm.benchmark.scenarios.scenario import Scenario, ScenarioSpec, create_scenario + from helm.benchmark.run_spec import get_run_spec_function + from helm.benchmark.tokenizer_config_registry import TokenizerSpec + from helm.clients.client import Client + from helm.common.object_spec import create_object + from helm.tokenizers.tokenizer import Tokenizer + + scenario = create_scenario(ScenarioSpec(class_name=f"{module_name}.CustomScenario")) + metric = create_metric(MetricSpec(class_name=f"{module_name}.CustomMetric")) + client = create_object(ClientSpec(class_name=f"{module_name}.CustomClient")) + tokenizer = create_object(TokenizerSpec(class_name=f"{module_name}.CustomTokenizer")) + + assert isinstance(scenario, Scenario) + assert isinstance(metric, Metric) + assert isinstance(client, Client) + assert isinstance(tokenizer, Tokenizer) + assert get_run_spec_function("custom_plugin_run_spec") is not None