-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Hello, I'm currently experiencing the following issue with evaluators & their configuration in the azure-ai-evaluation python package.
Summary
When using the evaluate() function with evaluators that support both conversation-based and individual inputs (e.g., FluencyEvaluator, RelevanceEvaluator), if the target function returns both conversation AND individual inputs like query/response, the evaluators fail with the error:
Cannot provide both 'conversation' and individual inputs at the same time.
This happens even when the evaluator_config column mapping explicitly specifies only one input type (e.g., only conversation).
Expected Behavior
The evaluator_config column mapping should filter which inputs are passed to evaluators, not just rename them. If I configure an evaluator to only use conversation, then query/response should not be passed to that evaluator.
Actual Behavior
The column mapping only renames input columns. All inputs from the target/data are still passed to evaluators. When an evaluator's _convert_kwargs_to_eval_input method receives both conversation and singleton inputs, it raises an exception.
Root Cause Analysis
The issue stems from the interaction of several components:
-
BatchEngine._apply_column_mapping_to_lines()only maps specified columns but does not remove unmapped columns from the input dictionary. -
BatchEngine.__preprocess_inputs()checks if the evaluator function has**kwargs. Since all built-in evaluators inherit fromEvaluatorBasewith a__call__(self, *args, **kwargs)signature,has_kwargsis alwaysTrue, causing ALL inputs to be passed through:if has_kwargs: return inputs # All inputs passed, nothing filtered
-
EvaluatorBase._convert_kwargs_to_eval_input()explicitly checks for and rejects having bothconversationand singleton inputs:if conversation is not None and any(singletons.values()): raise EvaluationException( message=f"{type(self).__name__}: Cannot provide both 'conversation' and individual inputs..." )
Reproduction Steps
from azure.ai.evaluation import evaluate, FluencyEvaluator
# Target that returns both conversation and query/response
def my_target(query: str) -> dict:
response = f"Response to: {query}"
return {
"response": response,
"conversation": {
"messages": [
{"role": "user", "content": query},
{"role": "assistant", "content": response}
]
}
}
# Even though we only map 'conversation', the evaluator still receives 'response'
result = evaluate(
data="test_data.jsonl", # Contains {"query": "test question"}
target=my_target,
evaluators={
"fluency": FluencyEvaluator(model_config)
},
evaluator_config={
"fluency": {
"conversation": "${target.conversation}" # Only want conversation
}
}
)
# Raises: FluencyEvaluator: Cannot provide both 'conversation' and individual inputs at the same time.Current Workaround
The two workarounds are running conversation-based and individual input evaluators in separate runs with discrete targets, or wrap evaluators to filter kwargs before passing them to the underlying evaluator:
class ConversationMetricWrapper:
def __init__(self, evaluator, use_conversation=True):
self.evaluator = evaluator
self.use_conversation = use_conversation
def _filter_kwargs(self, kwargs):
if self.use_conversation:
return {"conversation": kwargs.get("conversation")}
else:
filtered = kwargs.copy()
filtered.pop("conversation", None)
return filtered
def __call__(self, *args, **kwargs):
return self.evaluator(*args, **self._filter_kwargs(kwargs))This workaround also requires additional complexity to maintain isinstance() checks for Azure AI Foundry portal integration, which expects built-in evaluator class types for proper metric display and tracking.
Suggested Fix
One or more of the following changes could address this:
-
Option A: Filter in
_apply_column_mapping_to_lines()- Only include columns that are explicitly mapped in the output dictionary. Unmapped columns from the source data should not be passed to evaluators. -
Option B: Filter in
__preprocess_inputs()- When evaluators have**kwargs, inspect the actual parameter annotations/overloads to determine which inputs are valid, rather than passing everything. -
Option C: Make
_convert_kwargs_to_eval_input()more lenient - Instead of raising an error when both conversation and singletons are provided, prioritize one based on configuration or a class-level setting. -
Option D: Add a
filter_inputsoption to evaluator config - Allow users to explicitly specify which inputs should be filtered out:evaluator_config={ "fluency": { "conversation": "${target.conversation}", "_filter_inputs": ["query", "response"] # Explicitly filter these out } }
Environment
- azure-ai-evaluation version: 1.13.7
- Python version: 3.12
- OS: macOS
Impact
This issue prevents the usage of conversation-based and individual input-based evaluators in the same run without custom wrapper classes. The workaround also requires dynamic class creation to maintain Azure AI Foundry portal integration, as wrapped evaluators would otherwise fail isinstance() checks used for metric tracking.