feat: Experiments<->Evals 2.0 compatibility #9442

anticorrelator · 2025-09-09T01:21:30Z

Updates executor/rate limiter port in phoenix.client to be compatible with phoenix.evals changes
Enables broad compatibility with Phoenix Evals 2.0 Evaluators for use in experiments
Enables functions that return Phoenix Evals 2.0 Scores to be passed into experiments
Expands Phoenix Evals 2.0 input_mapping

Input mapping callables can now accept multiple arguments, and if the argument names overlap with top-level eval input keys, they will automatically be bound to the argument.

Furthermore, the standard "special" evaluator binds ("input", "output", "expected", "metadata", "example") can be used as both arguments to mapping lambdas or accessed using the jq-like path specification for data.

packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py

graphite-app · 2025-09-09T01:38:57Z

packages/phoenix-evals/src/phoenix/evals/preview/utils.py

+        if len(parameters) == 1:
+            parameter_name = next(iter(parameters.keys()))
+            if parameter_name in eval_input:
+                pass
+            else:
+                return mapping_function(eval_input)
+        else:
+            return mapping_function(eval_input)


There appears to be a logic issue in the single parameter binding condition. When a function has exactly one parameter:

If the parameter name exists in eval_input, the code falls through to the multi-parameter binding logic (due to the empty pass statement)

If the parameter name doesn't exist in eval_input, it correctly falls back to legacy behavior

This creates inconsistent handling of single-parameter functions. For backward compatibility, single-parameter functions should either:

Always receive the entire eval_input object (legacy behavior)

Or consistently use parameter name matching when available

Consider revising to either:

if len(parameters) == 1: # Always use legacy behavior for single-parameter functions return mapping_function(eval_input)

Or explicitly handle the single parameter case before falling through to multi-parameter logic.

Suggested change

if len(parameters) == 1:

parameter_name = next(iter(parameters.keys()))

if parameter_name in eval_input:

pass

else:

return mapping_function(eval_input)

else:

return mapping_function(eval_input)

if len(parameters) == 1:

# Always use legacy behavior for single-parameter functions

return mapping_function(eval_input)

else:

return mapping_function(eval_input)

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py

packages/phoenix-client/src/phoenix/client/resources/experiments/evaluators.py

packages/phoenix-client/src/phoenix/client/resources/experiments/types.py

packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py

packages/phoenix-evals/src/phoenix/evals/preview/utils.py

graphite-app · 2025-09-11T19:30:05Z

docs/evaluation/preview/llm.md

-  - `agenerate_classification(prompt: str | MultimodalPrompt, labels: list[str] | dict[str,str], include_explanation: bool = True, description: str | None = None, **kwargs) -> dict`
+  - `async_generate_text(prompt: str | MultimodalPrompt, **kwargs) -> str`
+  - `async_generate_object(prompt: str | MultimodalPrompt, schema: dict, method: str, **kwargs) -> dict`
+  - `async_generate_classification(prompt: str | MultimodalPrompt, labels: list[str] | dict[str,str], include_explanation: bool = True, description: str | None = None, **kwargs) -> dict`

 The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods (without the 'a' prefix) for synchronous code, and the async methods (with the 'a' prefix) for asynchronous code.


The documentation still refers to methods with the 'a' prefix, but the methods have been renamed to use the 'async_' prefix. This should be updated to: "The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods for synchronous code, and the async methods (with the 'async_' prefix) for asynchronous code."

Suggested change

The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods (without the 'a' prefix) for synchronous code, and the async methods (with the 'a' prefix) for asynchronous code.

The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods for synchronous code, and the async methods (with the 'async_' prefix) for asynchronous code.

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

graphite-app · 2025-09-12T00:57:06Z

packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py

+        elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
+            results_to_submit = list(result)  # type: ignore[reportUnknownArgumentType]


Type safety bug: The code checks if result is a Sequence but excludes str, bytes, dict, then immediately casts it to list without proper type validation. If result is a Sequence that contains non-EvaluationResult items, this will cause runtime errors when the results are processed later. The cast should include proper validation or the type checking should be more restrictive.

Suggested change

elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):

results_to_submit = list(result) # type: ignore[reportUnknownArgumentType]

elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):

# Validate that all items in the sequence are of the expected type

results_to_submit = [item for item in result if isinstance(item, EvaluationResult)]

if len(results_to_submit) != len(result):

logger.warning(

"Some items in result sequence were not EvaluationResult objects and were filtered out"

)

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

graphite-app · 2025-09-12T00:57:07Z

packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py

+        elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
+            results_to_submit = list(result)  # type: ignore[reportUnknownArgumentType]


Type safety bug: Same issue as the sync version - the code checks if result is a Sequence but excludes str, bytes, dict, then immediately casts it to list without proper type validation. If result is a Sequence that contains non-EvaluationResult items, this will cause runtime errors when the results are processed later.

Suggested change

elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):

results_to_submit = list(result) # type: ignore[reportUnknownArgumentType]

elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):

# Ensure all items in the sequence are valid result objects

if all(isinstance(item, EvaluationResult) for item in result):

results_to_submit = list(result)

else:

raise TypeError("All items in the sequence must be EvaluationResult objects")

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

graphite-app · 2025-09-12T00:57:08Z

packages/phoenix-evals/src/phoenix/evals/preview/utils.py

+        if len(parameters) == 1:
+            parameter_name = next(iter(parameters.keys()))
+            if parameter_name in eval_input:
+                return mapping_function(eval_input[parameter_name])
+            else:
+                return mapping_function(eval_input)
+        else:
+            return mapping_function(eval_input)


Logic error: The function has inconsistent parameter handling. When len(parameters) == 1, it first checks if the parameter name exists in eval_input and calls the function with that specific value, but if not found, it falls back to calling with the entire eval_input dict. This creates inconsistent behavior where the same function could receive either a specific value or the entire dict depending on key names, potentially causing runtime errors in the mapping function.

Suggested change

if len(parameters) == 1:

parameter_name = next(iter(parameters.keys()))

if parameter_name in eval_input:

return mapping_function(eval_input[parameter_name])

else:

return mapping_function(eval_input)

else:

return mapping_function(eval_input)

if len(parameters) == 1:

parameter_name = next(iter(parameters.keys()))

if parameter_name in eval_input:

return mapping_function(eval_input[parameter_name])

else:

# Parameter name not found in eval_input, raise an error or handle consistently

raise KeyError(f"Parameter '{parameter_name}' not found in eval_input")

else:

return mapping_function(eval_input)

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

ehutt

Awesome!!!

ehutt · 2025-09-12T01:22:11Z

packages/phoenix-evals/src/phoenix/evals/preview/utils.py

@@ -7,6 +8,46 @@


 # --- Input Map/Transform Helpers ---
+def _bind_mapping_function(


Fancy! We should make sure to document this well since it's nonobvious behavior.

ehutt · 2025-09-12T01:24:10Z

packages/phoenix-evals/src/phoenix/evals/preview/evaluators.py

-    __call__ = evaluate
-    # ensure the callable inherits evaluate's docs for IDE support
-    __call__.__doc__ = evaluate.__doc__
+    def bind(self, input_mapping: InputMappingType) -> None:


ehutt · 2025-09-12T01:25:55Z

packages/phoenix-evals/src/phoenix/evals/preview/evaluators.py

@@ -230,6 +239,14 @@ def describe(self) -> Dict[str, Any]:
            "input_schema": schema,
        }

+    def input_mapping_description(self) -> Dict[str, Any]:


This just returns a list of the input_mapping keys, yeah? How do you imagine people using this method? It seems not very useful to me.

oh, this replicates a method on the old BoundEvaluator object which was enforced by a unit test, I don't think it's super valuable personally

ehutt · 2025-09-12T01:30:46Z

packages/phoenix-evals/src/phoenix/evals/preview/evaluators.py

@@ -613,6 +632,9 @@ def _evaluate(self, eval_input: EvalInput) -> List[Score]:
                score = _convert_to_score(result, name, source, direction)
                return [score]

+            def __call__(self, *args: Any, **kwargs: Any) -> Any:


Very nice. We should make sure to document this behavior.

cursor · 2025-09-12T14:52:18Z

packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py

-            evaluator = (
-                create_evaluator(name=name)(value) if not isinstance(value, Evaluator) else value
-            )
+    elif isinstance(obj, Mapping):


Bug: Dead Code in Mapping Evaluation

The _evaluators_by_name function contains an elif isinstance(obj, Mapping): statement immediately following a return statement. This causes a SyntaxError, preventing the code from compiling and correctly processing evaluator inputs provided as mappings.

anticorrelator requested a review from a team as a code owner September 9, 2025 01:21

github-project-automation bot added this to phoenix Sep 9, 2025

github-project-automation bot moved this to 📘 Todo in phoenix Sep 9, 2025

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 9, 2025

This comment was marked as outdated.

Sign in to view

graphite-app bot reviewed Sep 9, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

mikeldking reviewed Sep 9, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

graphite-app bot reviewed Sep 10, 2025

View reviewed changes

packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py Outdated Show resolved Hide resolved

packages/phoenix-evals/src/phoenix/evals/preview/utils.py Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

graphite-app bot reviewed Sep 11, 2025

View reviewed changes

packages/phoenix-evals/src/phoenix/evals/preview/utils.py Outdated Show resolved Hide resolved

anticorrelator added 17 commits September 11, 2025 01:20

Update evaluators to accept new Score types

cc8ac9d

Handle 2.0 style evaluators and scores

595a85b

Add comment

ac8f526

Bind mapping lambdas to input keys if possible

c5f7862

Allow evaluators to return lists of scores

a0f7dd8

Update executors/rate limit handling

5e59090

Handle both evals and client rate limit errors in client executor

4b07a1b

Fix syntax error

d31bf6a

Enforce string coercion for LLMEvaluators

32f98df

Use BeforeValidator

72b365c

Map single argument names

4f63503

Simplify bind method name

c55773a

bind(mapping -> input_mapping)

a275475

Refactor evals 2.0 score handling

0bfeffb

Ruff 🐶

1f0688b

Ruff 🐶

8e64a79

Resolve possible undefined variable error

60e8364

anticorrelator added 3 commits September 11, 2025 01:49

Ruff 🐶

129759a

Do not use __call__ sugar

d3853e8

Explicitly call evaluate in tests

a5f13b9

This comment was marked as outdated.

Sign in to view

anticorrelator added 5 commits September 11, 2025 02:40

Remove mapping description

0f45317

Restore mapping description if an evaluator has a bound input mapping

68cb4be

Use evaluate

56473e6

Evaluators are no longer callable

9c84d15

aevaluate -> async_evaluate

df40097

This comment was marked as outdated.

Sign in to view

anticorrelator added 2 commits September 11, 2025 12:04

agenerate -> async_generate

09d8fac

Do not exclude "0" scores due to truthiness

c2aae14

graphite-app bot reviewed Sep 11, 2025

View reviewed changes

anticorrelator added 2 commits September 11, 2025 17:33

Multi-eval support

0cffdbc

Set input/output value for eval spans

7744019

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 12, 2025

Ruff 🐶

5f2ea8c

This comment was marked as outdated.

Sign in to view

graphite-app bot reviewed Sep 12, 2025

View reviewed changes

Resolve mypy errors

347575e

ehutt approved these changes Sep 12, 2025

View reviewed changes

github-project-automation bot moved this from 📘 Todo to 👍 Approved in phoenix Sep 12, 2025

Remove input mapping description

bd23279

cursor bot reviewed Sep 12, 2025

View reviewed changes

anticorrelator merged commit 90e4dbc into main Sep 12, 2025
41 checks passed

anticorrelator deleted the dustin/experiment-evals-compatibility branch September 12, 2025 15:06

github-project-automation bot moved this from 👍 Approved to ✅ Done in phoenix Sep 12, 2025

This was referenced Sep 12, 2025

chore(main): release arize-phoenix-client 1.19.0 #9492

Open

chore(main): release arize-phoenix-evals 0.30.0 #9290

Open

	The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods (without the 'a' prefix) for synchronous code, and the async methods (with the 'a' prefix) for asynchronous code.
	The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods for synchronous code, and the async methods (with the 'async_' prefix) for asynchronous code.

		elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
		results_to_submit = list(result) # type: ignore[reportUnknownArgumentType]

-        elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
-            results_to_submit = list(result)  # type: ignore[reportUnknownArgumentType]
+        elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
+            # Validate that all items in the sequence are of the expected type
+            results_to_submit = [item for item in result if isinstance(item, EvaluationResult)]
+            if len(results_to_submit) != len(result):
+                logger.warning(
+                    "Some items in result sequence were not EvaluationResult objects and were filtered out"
+                )

		@@ -7,6 +8,46 @@


		# --- Input Map/Transform Helpers ---
		def _bind_mapping_function(

feat: Experiments<->Evals 2.0 compatibility #9442

feat: Experiments<->Evals 2.0 compatibility #9442

Uh oh!

Conversation

anticorrelator commented Sep 9, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

graphite-app bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

graphite-app bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

graphite-app bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

graphite-app bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ehutt left a comment

Choose a reason for hiding this comment

Uh oh!

ehutt Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ehutt Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ehutt Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

anticorrelator Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ehutt Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Sep 12, 2025

Choose a reason for hiding this comment

Bug: Dead Code in Mapping Evaluation

Uh oh!

Uh oh!

Uh oh!