-
Notifications
You must be signed in to change notification settings - Fork 571
feat: Experiments<->Evals 2.0 compatibility #9442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py
Outdated
Show resolved
Hide resolved
if len(parameters) == 1: | ||
parameter_name = next(iter(parameters.keys())) | ||
if parameter_name in eval_input: | ||
pass | ||
else: | ||
return mapping_function(eval_input) | ||
else: | ||
return mapping_function(eval_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a logic issue in the single parameter binding condition. When a function has exactly one parameter:
- If the parameter name exists in
eval_input
, the code falls through to the multi-parameter binding logic (due to the emptypass
statement) - If the parameter name doesn't exist in
eval_input
, it correctly falls back to legacy behavior
This creates inconsistent handling of single-parameter functions. For backward compatibility, single-parameter functions should either:
- Always receive the entire
eval_input
object (legacy behavior) - Or consistently use parameter name matching when available
Consider revising to either:
if len(parameters) == 1:
# Always use legacy behavior for single-parameter functions
return mapping_function(eval_input)
Or explicitly handle the single parameter case before falling through to multi-parameter logic.
if len(parameters) == 1: | |
parameter_name = next(iter(parameters.keys())) | |
if parameter_name in eval_input: | |
pass | |
else: | |
return mapping_function(eval_input) | |
else: | |
return mapping_function(eval_input) | |
if len(parameters) == 1: | |
# Always use legacy behavior for single-parameter functions | |
return mapping_function(eval_input) | |
else: | |
return mapping_function(eval_input) |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/evaluators.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/evaluators.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/evaluators.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/types.py
Outdated
Show resolved
Hide resolved
packages/phoenix-client/src/phoenix/client/resources/experiments/__init__.py
Outdated
Show resolved
Hide resolved
docs/evaluation/preview/llm.md
Outdated
- `agenerate_classification(prompt: str | MultimodalPrompt, labels: list[str] | dict[str,str], include_explanation: bool = True, description: str | None = None, **kwargs) -> dict` | ||
- `async_generate_text(prompt: str | MultimodalPrompt, **kwargs) -> str` | ||
- `async_generate_object(prompt: str | MultimodalPrompt, schema: dict, method: str, **kwargs) -> dict` | ||
- `async_generate_classification(prompt: str | MultimodalPrompt, labels: list[str] | dict[str,str], include_explanation: bool = True, description: str | None = None, **kwargs) -> dict` | ||
|
||
The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods (without the 'a' prefix) for synchronous code, and the async methods (with the 'a' prefix) for asynchronous code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation still refers to methods with the 'a' prefix, but the methods have been renamed to use the 'async_' prefix. This should be updated to: "The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods for synchronous code, and the async methods (with the 'async_' prefix) for asynchronous code."
The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods (without the 'a' prefix) for synchronous code, and the async methods (with the 'a' prefix) for asynchronous code. | |
The LLM class provides both synchronous and asynchronous methods for all operations. Use the sync methods for synchronous code, and the async methods (with the 'async_' prefix) for asynchronous code. |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)): | ||
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type safety bug: The code checks if result is a Sequence but excludes str, bytes, dict, then immediately casts it to list without proper type validation. If result is a Sequence that contains non-EvaluationResult items, this will cause runtime errors when the results are processed later. The cast should include proper validation or the type checking should be more restrictive.
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)): | |
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType] | |
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)): | |
# Validate that all items in the sequence are of the expected type | |
results_to_submit = [item for item in result if isinstance(item, EvaluationResult)] | |
if len(results_to_submit) != len(result): | |
logger.warning( | |
"Some items in result sequence were not EvaluationResult objects and were filtered out" | |
) |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)): | ||
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type safety bug: Same issue as the sync version - the code checks if result is a Sequence but excludes str, bytes, dict, then immediately casts it to list without proper type validation. If result is a Sequence that contains non-EvaluationResult items, this will cause runtime errors when the results are processed later.
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)): | |
results_to_submit = list(result) # type: ignore[reportUnknownArgumentType] | |
elif isinstance(result, Sequence) and not isinstance(result, (str, bytes, dict)): | |
# Ensure all items in the sequence are valid result objects | |
if all(isinstance(item, EvaluationResult) for item in result): | |
results_to_submit = list(result) | |
else: | |
raise TypeError("All items in the sequence must be EvaluationResult objects") |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
if len(parameters) == 1: | ||
parameter_name = next(iter(parameters.keys())) | ||
if parameter_name in eval_input: | ||
return mapping_function(eval_input[parameter_name]) | ||
else: | ||
return mapping_function(eval_input) | ||
else: | ||
return mapping_function(eval_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic error: The function has inconsistent parameter handling. When len(parameters) == 1, it first checks if the parameter name exists in eval_input and calls the function with that specific value, but if not found, it falls back to calling with the entire eval_input dict. This creates inconsistent behavior where the same function could receive either a specific value or the entire dict depending on key names, potentially causing runtime errors in the mapping function.
if len(parameters) == 1: | |
parameter_name = next(iter(parameters.keys())) | |
if parameter_name in eval_input: | |
return mapping_function(eval_input[parameter_name]) | |
else: | |
return mapping_function(eval_input) | |
else: | |
return mapping_function(eval_input) | |
if len(parameters) == 1: | |
parameter_name = next(iter(parameters.keys())) | |
if parameter_name in eval_input: | |
return mapping_function(eval_input[parameter_name]) | |
else: | |
# Parameter name not found in eval_input, raise an error or handle consistently | |
raise KeyError(f"Parameter '{parameter_name}' not found in eval_input") | |
else: | |
return mapping_function(eval_input) |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!!!
@@ -7,6 +8,46 @@ | |||
|
|||
|
|||
# --- Input Map/Transform Helpers --- | |||
def _bind_mapping_function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fancy! We should make sure to document this well since it's nonobvious behavior.
__call__ = evaluate | ||
# ensure the callable inherits evaluate's docs for IDE support | ||
__call__.__doc__ = evaluate.__doc__ | ||
def bind(self, input_mapping: InputMappingType) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -230,6 +239,14 @@ def describe(self) -> Dict[str, Any]: | |||
"input_schema": schema, | |||
} | |||
|
|||
def input_mapping_description(self) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just returns a list of the input_mapping keys, yeah? How do you imagine people using this method? It seems not very useful to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this replicates a method on the old BoundEvaluator object which was enforced by a unit test, I don't think it's super valuable personally
@@ -613,6 +632,9 @@ def _evaluate(self, eval_input: EvalInput) -> List[Score]: | |||
score = _convert_to_score(result, name, source, direction) | |||
return [score] | |||
|
|||
def __call__(self, *args: Any, **kwargs: Any) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. We should make sure to document this behavior.
evaluator = ( | ||
create_evaluator(name=name)(value) if not isinstance(value, Evaluator) else value | ||
) | ||
elif isinstance(obj, Mapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phoenix.client
to be compatible withphoenix.evals
changesInput mapping callables can now accept multiple arguments, and if the argument names overlap with top-level eval input keys, they will automatically be bound to the argument.
Furthermore, the standard "special" evaluator binds ("input", "output", "expected", "metadata", "example") can be used as both arguments to mapping lambdas or accessed using the jq-like path specification for data.