Skip to content

Conversation

anticorrelator
Copy link
Contributor

  • Updates executor/rate limiter port in phoenix.client to be compatible with phoenix.evals changes
  • Enables broad compatibility with Phoenix Evals 2.0 Evaluators for use in experiments
  • Enables functions that return Phoenix Evals 2.0 Scores to be passed into experiments
  • Expands Phoenix Evals 2.0 input_mapping

Input mapping callables can now accept multiple arguments, and if the argument names overlap with top-level eval input keys, they will automatically be bound to the argument.

Furthermore, the standard "special" evaluator binds ("input", "output", "expected", "metadata", "example") can be used as both arguments to mapping lambdas or accessed using the jq-like path specification for data.

@anticorrelator anticorrelator requested a review from a team as a code owner September 9, 2025 01:21
@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix Sep 9, 2025
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Sep 9, 2025
cursor[bot]

This comment was marked as outdated.

Comment on lines 32 to 41
if len(parameters) == 1:
parameter_name = next(iter(parameters.keys()))
if parameter_name in eval_input:
pass
else:
return mapping_function(eval_input)
else:
return mapping_function(eval_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appears to be a logic issue in the single parameter binding condition. When a function has exactly one parameter:

  1. If the parameter name exists in eval_input, the code falls through to the multi-parameter binding logic (due to the empty pass statement)
  2. If the parameter name doesn't exist in eval_input, it correctly falls back to legacy behavior

This creates inconsistent handling of single-parameter functions. For backward compatibility, single-parameter functions should either:

  • Always receive the entire eval_input object (legacy behavior)
  • Or consistently use parameter name matching when available

Consider revising to either:

if len(parameters) == 1:
    # Always use legacy behavior for single-parameter functions
    return mapping_function(eval_input)

Or explicitly handle the single parameter case before falling through to multi-parameter logic.

Suggested change
if len(parameters) == 1:
parameter_name = next(iter(parameters.keys()))
if parameter_name in eval_input:
pass
else:
return mapping_function(eval_input)
else:
return mapping_function(eval_input)
if len(parameters) == 1:
# Always use legacy behavior for single-parameter functions
return mapping_function(eval_input)
else:
return mapping_function(eval_input)

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

- `agenerate_classification(prompt: str | MultimodalPrompt, labels: list[str] | dict[str,str], include_explanation: bool = True, description: str | None = None, **kwargs) -> dict`
- `async_generate_text(prompt: str | MultimodalPrompt, **kwargs) -> str`
- `async_generate_object(prompt: str | MultimodalPrompt, schema: dict, method: str, **kwargs) -> dict`
- `async_generate_classification(prompt: str | MultimodalPrompt, labels: list[str] | dict[str,str], include_explanation: bool = True, description: str | None = None, **kwargs) -> dict`

The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods (without the 'a' prefix) for synchronous code, and the async methods (with the 'a' prefix) for asynchronous code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation still refers to methods with the 'a' prefix, but the methods have been renamed to use the 'async_' prefix. This should be updated to: "The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods for synchronous code, and the async methods (with the 'async_' prefix) for asynchronous code."

Suggested change
The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods (without the 'a' prefix) for synchronous code, and the async methods (with the 'a' prefix) for asynchronous code.
The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods for synchronous code, and the async methods (with the 'async_' prefix) for asynchronous code.

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Sep 12, 2025
cursor[bot]

This comment was marked as outdated.

Comment on lines 1391 to 1392
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type safety bug: The code checks if result is a Sequence but excludes str, bytes, dict, then immediately casts it to list without proper type validation. If result is a Sequence that contains non-EvaluationResult items, this will cause runtime errors when the results are processed later. The cast should include proper validation or the type checking should be more restrictive.

Suggested change
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType]
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
# Validate that all items in the sequence are of the expected type
results_to_submit = [item for item in result if isinstance(item, EvaluationResult)]
if len(results_to_submit) != len(result):
logger.warning(
"Some items in result sequence were not EvaluationResult objects and were filtered out"
)

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Comment on lines 2425 to 2426
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type safety bug: Same issue as the sync version - the code checks if result is a Sequence but excludes str, bytes, dict, then immediately casts it to list without proper type validation. If result is a Sequence that contains non-EvaluationResult items, this will cause runtime errors when the results are processed later.

Suggested change
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType]
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)):
# Ensure all items in the sequence are valid result objects
if all(isinstance(item, EvaluationResult) for item in result):
results_to_submit = list(result)
else:
raise TypeError("All items in the sequence must be EvaluationResult objects")

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Comment on lines +34 to +41
if len(parameters) == 1:
parameter_name = next(iter(parameters.keys()))
if parameter_name in eval_input:
return mapping_function(eval_input[parameter_name])
else:
return mapping_function(eval_input)
else:
return mapping_function(eval_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic error: The function has inconsistent parameter handling. When len(parameters) == 1, it first checks if the parameter name exists in eval_input and calls the function with that specific value, but if not found, it falls back to calling with the entire eval_input dict. This creates inconsistent behavior where the same function could receive either a specific value or the entire dict depending on key names, potentially causing runtime errors in the mapping function.

Suggested change
if len(parameters) == 1:
parameter_name = next(iter(parameters.keys()))
if parameter_name in eval_input:
return mapping_function(eval_input[parameter_name])
else:
return mapping_function(eval_input)
else:
return mapping_function(eval_input)
if len(parameters) == 1:
parameter_name = next(iter(parameters.keys()))
if parameter_name in eval_input:
return mapping_function(eval_input[parameter_name])
else:
# Parameter name not found in eval_input, raise an error or handle consistently
raise KeyError(f"Parameter '{parameter_name}' not found in eval_input")
else:
return mapping_function(eval_input)

Spotted by Diamond

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Copy link
Contributor

@ehutt ehutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!!

@@ -7,6 +8,46 @@


# --- Input Map/Transform Helpers ---
def _bind_mapping_function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fancy! We should make sure to document this well since it's nonobvious behavior.

__call__ = evaluate
# ensure the callable inherits evaluate's docs for IDE support
__call__.__doc__ = evaluate.__doc__
def bind(self, input_mapping: InputMappingType) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -230,6 +239,14 @@ def describe(self) -> Dict[str, Any]:
"input_schema": schema,
}

def input_mapping_description(self) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just returns a list of the input_mapping keys, yeah? How do you imagine people using this method? It seems not very useful to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, this replicates a method on the old BoundEvaluator object which was enforced by a unit test, I don't think it's super valuable personally

@@ -613,6 +632,9 @@ def _evaluate(self, eval_input: EvalInput) -> List[Score]:
score = _convert_to_score(result, name, source, direction)
return [score]

def __call__(self, *args: Any, **kwargs: Any) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. We should make sure to document this behavior.

@github-project-automation github-project-automation bot moved this from 📘 Todo to 👍 Approved in phoenix Sep 12, 2025
evaluator = (
create_evaluator(name=name)(value) if not isinstance(value, Evaluator) else value
)
elif isinstance(obj, Mapping):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Dead Code in Mapping Evaluation

The _evaluators_by_name function contains an elif isinstance(obj, Mapping): statement immediately following a return statement. This causes a SyntaxError, preventing the code from compiling and correctly processing evaluator inputs provided as mappings.

Fix in Cursor Fix in Web

@anticorrelator anticorrelator merged commit 90e4dbc into main Sep 12, 2025
41 checks passed
@anticorrelator anticorrelator deleted the dustin/experiment-evals-compatibility branch September 12, 2025 15:06
@github-project-automation github-project-automation bot moved this from 👍 Approved to ✅ Done in phoenix Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants