azure-ai-evaluation EvaluatorConfig column mapping does not filter inputs

Hello, I'm currently experiencing the following issue with evaluators & their configuration in the `azure-ai-evaluation` python package.

**Summary**
When using the `evaluate()` function with evaluators that support both conversation-based and individual inputs (e.g., `FluencyEvaluator`, `RelevanceEvaluator`), if the target function returns both `conversation` AND individual inputs like `query`/`response`, the evaluators fail with the error:

`Cannot provide both 'conversation' and individual inputs at the same time.`

This happens even when the `evaluator_config` column mapping explicitly specifies only one input type (e.g., only `conversation`).

**Expected Behavior**
The `evaluator_config` column mapping should **filter** which inputs are passed to evaluators, not just **rename** them. If I configure an evaluator to only use `conversation`, then `query`/`response` should not be passed to that evaluator.

**Actual Behavior**
The column mapping only renames input columns. All inputs from the target/data are still passed to evaluators. When an evaluator's `_convert_kwargs_to_eval_input` method receives both `conversation` and singleton inputs, it raises an exception.

**Root Cause Analysis**

The issue stems from the interaction of several components:

1. **`BatchEngine._apply_column_mapping_to_lines()`** only maps specified columns but does not remove unmapped columns from the input dictionary.

2. **`BatchEngine.__preprocess_inputs()`** checks if the evaluator function has `**kwargs`. Since all built-in evaluators inherit from `EvaluatorBase` with a `__call__(self, *args, **kwargs)` signature, `has_kwargs` is always `True`, causing ALL inputs to be passed through:
   
   ```python
   if has_kwargs:
       return inputs  # All inputs passed, nothing filtered
   ```

3. **`EvaluatorBase._convert_kwargs_to_eval_input()`** explicitly checks for and rejects having both `conversation` and singleton inputs:
   
   ```python
   if conversation is not None and any(singletons.values()):
       raise EvaluationException(
           message=f"{type(self).__name__}: Cannot provide both 'conversation' and individual inputs..."
       )
   ```

**Reproduction Steps**

```python
from azure.ai.evaluation import evaluate, FluencyEvaluator

# Target that returns both conversation and query/response
def my_target(query: str) -> dict:
    response = f"Response to: {query}"
    return {
        "response": response,
        "conversation": {
            "messages": [
                {"role": "user", "content": query},
                {"role": "assistant", "content": response}
            ]
        }
    }

# Even though we only map 'conversation', the evaluator still receives 'response'
result = evaluate(
    data="test_data.jsonl",  # Contains {"query": "test question"}
    target=my_target,
    evaluators={
        "fluency": FluencyEvaluator(model_config)
    },
    evaluator_config={
        "fluency": {
            "conversation": "${target.conversation}"  # Only want conversation
        }
    }
)
# Raises: FluencyEvaluator: Cannot provide both 'conversation' and individual inputs at the same time.
```

**Current Workaround**

The two workarounds are running conversation-based and individual input evaluators in separate runs with discrete targets, or wrap evaluators to filter kwargs before passing them to the underlying evaluator:

```python
class ConversationMetricWrapper:
    def __init__(self, evaluator, use_conversation=True):
        self.evaluator = evaluator
        self.use_conversation = use_conversation
    
    def _filter_kwargs(self, kwargs):
        if self.use_conversation:
            return {"conversation": kwargs.get("conversation")}
        else:
            filtered = kwargs.copy()
            filtered.pop("conversation", None)
            return filtered
    
    def __call__(self, *args, **kwargs):
        return self.evaluator(*args, **self._filter_kwargs(kwargs))
```

This workaround also requires additional complexity to maintain `isinstance()` checks for Azure AI Foundry portal integration, which expects built-in evaluator class types for proper metric display and tracking.

**Suggested Fix**

One or more of the following changes could address this:

1. **Option A: Filter in `_apply_column_mapping_to_lines()`** - Only include columns that are explicitly mapped in the output dictionary. Unmapped columns from the source data should not be passed to evaluators.

2. **Option B: Filter in `__preprocess_inputs()`** - When evaluators have `**kwargs`, inspect the actual parameter annotations/overloads to determine which inputs are valid, rather than passing everything.

3. **Option C: Make `_convert_kwargs_to_eval_input()` more lenient** - Instead of raising an error when both conversation and singletons are provided, prioritize one based on configuration or a class-level setting.

4. **Option D: Add a `filter_inputs` option to evaluator config** - Allow users to explicitly specify which inputs should be filtered out:
   
   ```python
   evaluator_config={
       "fluency": {
           "conversation": "${target.conversation}",
           "_filter_inputs": ["query", "response"]  # Explicitly filter these out
       }
   }
   ```

**Environment**

- azure-ai-evaluation version: 1.13.7
- Python version: 3.12
- OS: macOS

**Impact**

This issue prevents the usage of conversation-based and individual input-based evaluators in the same run without custom wrapper classes. The workaround also requires dynamic class creation to maintain Azure AI Foundry portal integration, as wrapped evaluators would otherwise fail `isinstance()` checks used for metric tracking.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

azure-ai-evaluation EvaluatorConfig column mapping does not filter inputs #44287

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

azure-ai-evaluation EvaluatorConfig column mapping does not filter inputs #44287

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions