Support multi-turn evaluation in mlflow.genai.evaluate for DataFrame and list input #18971

AveshCSingh · 2025-11-22T02:08:02Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18971/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18971/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18971/merge

What changes are proposed in this pull request?

This PR implements multi-turn evaluation for mlflow.genai.evaluate, enabling scorers to evaluate entire conversation sessions. The support is limited to:
(1) Evaluation on a DataFrame or list of traces. Dataset support will be implemented in a follow-up PR
(2) Evaluation using a static "answer sheet". Multi-turn evaluation with predict_fn will be implemented in a future milestone.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

The below manual validations can be re-run. You must setup the following:

export OPENAI_API_KEY='your-key-here'  # For judge tests
pip install litellm  # For judge tests

Manual validation: make_judge with real OpenAI LLM

import os
import mlflow
from mlflow.genai import make_judge

@mlflow.trace(span_type="CHAT_MODEL")
def model(question, session_id):
    mlflow.update_current_trace(metadata={"mlflow.trace.session": session_id})
    return f"Answer to: {question}"

mlflow.set_experiment("judge_test")
with mlflow.start_run() as run:
    for q in ["What is ML?", "How does it work?", "Example?"]:
        model(q, session_id="conv_1")

    traces = mlflow.search_traces(
        experiment_ids=[run.info.experiment_id],
        filter_string=f'run_id = "{run.info.run_id}"'
    )

    # Using {{ conversation }} makes it session-level automatically
    judge = make_judge(
        name="coherence",
        model="openai:/gpt-4o-mini",
        instructions="Evaluate conversation coherence: {{ conversation }}. Return True if coherent.",
        feedback_value_type=bool
    )

    print(f"Is session-level: {judge.is_session_level_scorer}")

    results = mlflow.genai.evaluate(data=traces, scorers=[judge])

    assessments = results.result_df["coherence/value"].notna().sum()
    print(f"Assessments: {assessments} (expected: 1)")

Does this PR require documentation update?

We'll add documentation in a follow-up.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Added multi-turn evaluation support to mlflow.genai.evaluate, enabling scorers to evaluate entire conversation sessions.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

This PR implements multi-turn evaluation capability for mlflow.genai.evaluate, enabling evaluation of entire conversation sessions grouped by session_id. Key changes: 1. Environment Variable (mlflow/environment_variables.py): - Added MLFLOW_ENABLE_MULTI_TURN_EVALUATION flag (default: False) - Feature-gated for safe rollout and testing 2. Validation Logic (mlflow/genai/evaluation/utils.py): - Added _validate_multi_turn_input() to validate multi-turn configuration - Checks: feature flag enabled, no predict_fn, DataFrame input required - Added FEATURE_DISABLED import for proper error handling 3. Multi-Turn Evaluation (mlflow/genai/evaluation/harness.py): - Added _evaluate_multi_turn_scorers() to evaluate session groups - Modified run() to classify scorers and handle multi-turn evaluation - Groups traces by session_id, evaluates on session groups - Logs assessments to chronologically first trace of each session - Adds session_id to assessment metadata 4. Integration (mlflow/genai/evaluation/base.py): - Added validation call in evaluate() function - Imports _validate_multi_turn_input 5. Tests (tests/genai/evaluate/test_utils.py): - Added 6 comprehensive validation tests - Tests feature flag, predict_fn rejection, DataFrame requirement - Tests mixed single-turn and multi-turn scorers Implementation follows the multi-turn evaluation plan (PR mlflow#3 + PR mlflow#4 combined). All tests passing (60 passed, 3 skipped). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Avesh Singh <[email protected]>

github-actions · 2025-11-22T02:08:17Z

@AveshCSingh Thank you for the contribution! Could you fix the following issue(s)?

⚠ DCO check

The DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

github-actions · 2025-11-22T02:17:08Z

Documentation preview for 6051ef5 is available at:

https://pr-18971--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Signed-off-by: Avesh Singh <[email protected]>

This commit addresses several TODO items to improve the multi-turn evaluation implementation: - Remove leading underscores from exported utility functions - Renamed _classify_scorers -> classify_scorers - Renamed _group_traces_by_session -> group_traces_by_session - Renamed _get_first_trace_in_session -> get_first_trace_in_session - Optimize trace retrieval by avoiding redundant get_trace call - Find matching eval_result from existing list instead of fetching trace - Replace hardcoded "session_id" string with TraceMetadataKey.TRACE_SESSION constant - Improves maintainability and consistency with other metadata keys - Rename validation function for clarity - Renamed _validate_multi_turn_input -> _validate_session_level_input - Updated terminology from "multi_turn" to "session_level" for consistency - Remove unused data parameter from validation function - Simplified function signature by removing parameter that was never used All tests pass successfully after these changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Avesh Singh <[email protected]>

Signed-off-by: Avesh Singh <[email protected]>

AveshCSingh · 2025-11-25T03:20:05Z

mlflow/genai/evaluation/session_utils.py

+def classify_scorers(scorers: list[Scorer]) -> tuple[list[Scorer], list[Scorer]]:
+    """
+    Separate scorers into single-turn and multi-turn categories.
+
+    Args:
+        scorers: List of scorer instances.
+
+    Returns:
+        tuple: (single_turn_scorers, multi_turn_scorers)
+    """
+    single_turn_scorers = []
+    multi_turn_scorers = []
+
+    for scorer in scorers:
+        if scorer.is_session_level_scorer:
+            multi_turn_scorers.append(scorer)
+        else:
+            single_turn_scorers.append(scorer)
+
+    return single_turn_scorers, multi_turn_scorers
+
+
+def group_traces_by_session(eval_items: list["EvalItem"]) -> dict[str, list["EvalItem"]]:
+    """
+    Group evaluation items containing traces by session_id.
+
+    Args:
+        eval_items: List of EvalItem objects.
+
+    Returns:
+        dict: {session_id: [eval_item, ...]} where eval items are grouped by session.
+              Only items with traces that have a session_id are included in the output.
+    """
+    session_groups = defaultdict(list)
+
+    for idx, item in enumerate(eval_items):
+        if not hasattr(item, "trace") or item.trace is None:
+            continue
+
+        trace_metadata = item.trace.info.trace_metadata
+
+        if session_id := trace_metadata.get(TraceMetadataKey.TRACE_SESSION):
+            session_groups[session_id].append(item)
+
+    return dict(session_groups)
+
+
+def get_first_trace_in_session(session_items: list["EvalItem"]) -> "EvalItem":
+    """
+    Find the chronologically first trace in a session based on request_time.


These 3 methods were copied over from utils.py.

mlflow/genai/scorers/base.py

AveshCSingh · 2025-11-25T03:21:55Z

mlflow/tracing/utils/copy.py

+            # Copy trace metadata
+            if trace_metadata := info.get("trace_metadata"):
+                trace.info.trace_metadata.update(trace_metadata)


We previously did not copy trace metadata when copying traces into EvalItems.

AveshCSingh · 2025-11-25T03:24:30Z

tests/genai/evaluate/test_session_utils.py

+class _MultiTurnTestScorer:
+    """Helper class for testing multi-turn scorers."""
+
+    def __init__(self, name="test_multi_turn_scorer"):
+        self.name = name
+        self.is_session_level_scorer = True
+        self.aggregations = []
+
+    def run(self, session=None, **kwargs):
+        return True
+
+    def __call__(self, traces=None, **kwargs):
+        return 1.0
+
+
+# ==================== Tests for classify_scorers ====================
+
+
+def test_classify_scorers_all_single_turn():
+    """Test that all scorers are classified as single-turn when none are multi-turn."""
+
+    @scorer
+    def custom_scorer1(outputs):
+        return 1.0
+
+    @scorer
+    def custom_scorer2(outputs):
+        return 2.0
+
+    scorers_list = [custom_scorer1, custom_scorer2]
+    single_turn, multi_turn = classify_scorers(scorers_list)
+
+    assert len(single_turn) == 2
+    assert len(multi_turn) == 0
+    assert single_turn == scorers_list
+
+
+def test_classify_scorers_all_multi_turn():
+    """Test that all scorers are classified as multi-turn.
+
+    When all scorers have is_session_level_scorer=True.
+    """
+    multi_turn_scorer1 = _MultiTurnTestScorer(name="multi_turn_scorer1")
+    multi_turn_scorer2 = _MultiTurnTestScorer(name="multi_turn_scorer2")
+
+    scorers_list = [multi_turn_scorer1, multi_turn_scorer2]
+    single_turn, multi_turn = classify_scorers(scorers_list)
+
+    assert len(single_turn) == 0
+    assert len(multi_turn) == 2
+    assert multi_turn == scorers_list
+    # Verify they are actually multi-turn
+    assert multi_turn_scorer1.is_session_level_scorer is True
+    assert multi_turn_scorer2.is_session_level_scorer is True
+
+
+def test_classify_scorers_mixed():
+    """Test classification of mixed single-turn and multi-turn scorers."""
+
+    @scorer
+    def single_turn_scorer(outputs):
+        return 1.0
+
+    multi_turn_scorer = _MultiTurnTestScorer(name="multi_turn_scorer")
+
+    scorers_list = [single_turn_scorer, multi_turn_scorer]
+    single_turn, multi_turn = classify_scorers(scorers_list)
+
+    assert len(single_turn) == 1
+    assert len(multi_turn) == 1
+    assert single_turn[0] == single_turn_scorer
+    assert multi_turn[0] == multi_turn_scorer
+    # Verify properties
+    assert single_turn_scorer.is_session_level_scorer is False
+    assert multi_turn_scorer.is_session_level_scorer is True
+
+
+def test_classify_scorers_empty_list():
+    """Test classification of an empty list of scorers."""
+    single_turn, multi_turn = classify_scorers([])
+
+    assert len(single_turn) == 0
+    assert len(multi_turn) == 0
+
+
+# ==================== Tests for group_traces_by_session ====================
+
+
+def _create_mock_trace(trace_id: str, session_id: str | None, request_time: int):
+    """Helper to create a mock trace with session_id and request_time."""
+    trace_metadata = {}
+    if session_id is not None:
+        trace_metadata[TraceMetadataKey.TRACE_SESSION] = session_id
+
+    trace_info = TraceInfo(
+        trace_id=trace_id,
+        trace_location=TraceLocation.from_experiment_id("0"),
+        request_time=request_time,
+        execution_duration=1000,
+        state=TraceState.OK,
+        trace_metadata=trace_metadata,
+        tags={},
+    )
+
+    trace = Mock(spec=Trace)
+    trace.info = trace_info
+    trace.data = TraceData(spans=[])
+    return trace
+
+
+def _create_mock_eval_item(trace):
+    """Helper to create a mock EvalItem with a trace."""
+    eval_item = Mock(spec=EvalItem)
+    eval_item.trace = trace
+    return eval_item
+
+
+def test_group_traces_by_session_single_session():
+    """Test grouping traces that all belong to a single session."""
+    trace1 = _create_mock_trace("trace-1", "session-1", 1000)
+    trace2 = _create_mock_trace("trace-2", "session-1", 2000)
+    trace3 = _create_mock_trace("trace-3", "session-1", 3000)
+
+    eval_item1 = _create_mock_eval_item(trace1)
+    eval_item2 = _create_mock_eval_item(trace2)
+    eval_item3 = _create_mock_eval_item(trace3)
+
+    eval_items = [eval_item1, eval_item2, eval_item3]
+    session_groups = group_traces_by_session(eval_items)
+
+    assert len(session_groups) == 1
+    assert "session-1" in session_groups
+    assert len(session_groups["session-1"]) == 3
+
+    # Check that all traces are included
+    session_traces = [item.trace for item in session_groups["session-1"]]
+    assert trace1 in session_traces
+    assert trace2 in session_traces
+    assert trace3 in session_traces
+
+
+def test_group_traces_by_session_multiple_sessions():
+    """Test grouping traces that belong to different sessions."""
+    trace1 = _create_mock_trace("trace-1", "session-1", 1000)
+    trace2 = _create_mock_trace("trace-2", "session-1", 2000)
+    trace3 = _create_mock_trace("trace-3", "session-2", 1500)
+    trace4 = _create_mock_trace("trace-4", "session-2", 2500)
+
+    eval_items = [
+        _create_mock_eval_item(trace1),
+        _create_mock_eval_item(trace2),
+        _create_mock_eval_item(trace3),
+        _create_mock_eval_item(trace4),
+    ]
+
+    session_groups = group_traces_by_session(eval_items)
+
+    assert len(session_groups) == 2
+    assert "session-1" in session_groups
+    assert "session-2" in session_groups
+    assert len(session_groups["session-1"]) == 2
+    assert len(session_groups["session-2"]) == 2
+
+
+def test_group_traces_by_session_excludes_no_session_id():
+    """Test that traces without session_id are excluded from grouping."""
+    trace1 = _create_mock_trace("trace-1", "session-1", 1000)
+    trace2 = _create_mock_trace("trace-2", None, 2000)  # No session_id
+    trace3 = _create_mock_trace("trace-3", "session-1", 3000)
+
+    eval_items = [
+        _create_mock_eval_item(trace1),
+        _create_mock_eval_item(trace2),
+        _create_mock_eval_item(trace3),
+    ]
+
+    session_groups = group_traces_by_session(eval_items)
+
+    assert len(session_groups) == 1
+    assert "session-1" in session_groups
+    assert len(session_groups["session-1"]) == 2
+    # trace2 should not be included
+    session_traces = [item.trace for item in session_groups["session-1"]]
+    assert trace1 in session_traces
+    assert trace2 not in session_traces
+    assert trace3 in session_traces
+
+
+def test_group_traces_by_session_excludes_none_traces():
+    """Test that eval items without traces are excluded from grouping."""
+    trace1 = _create_mock_trace("trace-1", "session-1", 1000)
+
+    eval_item1 = _create_mock_eval_item(trace1)
+    eval_item2 = Mock()
+    eval_item2.trace = None  # No trace
+
+    eval_items = [eval_item1, eval_item2]
+    session_groups = group_traces_by_session(eval_items)
+
+    assert len(session_groups) == 1
+    assert "session-1" in session_groups
+    assert len(session_groups["session-1"]) == 1
+
+
+def test_group_traces_by_session_empty_list():
+    """Test grouping an empty list of eval items."""
+    session_groups = group_traces_by_session([])
+
+    assert len(session_groups) == 0
+    assert session_groups == {}
+
+
+# ==================== Tests for get_first_trace_in_session ====================
+
+
+def test_get_first_trace_in_session_chronological_order():
+    """Test that the first trace is correctly identified by request_time."""
+    trace1 = _create_mock_trace("trace-1", "session-1", 3000)
+    trace2 = _create_mock_trace("trace-2", "session-1", 1000)  # Earliest
+    trace3 = _create_mock_trace("trace-3", "session-1", 2000)
+
+    eval_item1 = _create_mock_eval_item(trace1)
+    eval_item2 = _create_mock_eval_item(trace2)
+    eval_item3 = _create_mock_eval_item(trace3)
+
+    session_items = [eval_item1, eval_item2, eval_item3]
+
+    first_item = get_first_trace_in_session(session_items)
+
+    assert first_item.trace == trace2
+    assert first_item == eval_item2
+


These tests as well as the two below (test_get_first_trace_in_session_single_trace and test_get_first_trace_in_session_same_timestamp) were moved from test_utils.py.

mlflow/environment_variables.py

mlflow/genai/evaluation/session_utils.py

mlflow/genai/evaluation/harness.py

serena-ruan · 2025-11-25T04:15:06Z

mlflow/genai/evaluation/session_utils.py

+
+    multi_turn_assessments = {}
+
+    for session_id, session_items in session_groups.items():


Feel like we can also run this in parallel? Is this planned?

I'm also curious whether it's possible to run single-turn and multi-turn in parallel besides running each the multi-turn scorer in parallel

Yes, we should be able to run multi-turn scorers within the same threadpool in harness.py as single-turn scorers are executed in. This will also fix the progress bar

Following the example of single-turn scorers, we can also parallelize the evaluation of multi-turn scorers on each session.

I'm planning on implementing this in a follow-up PR, since this one is already growing quite large.

Copilot

Pull request overview

This PR implements multi-turn evaluation support for mlflow.genai.evaluate, enabling scorers to evaluate entire conversation sessions. The implementation focuses on DataFrame and list input evaluation, with dataset support and predict_fn integration planned for future releases.

Key Changes:

Added session-level scorer classification and evaluation logic to group traces by session_id
Introduced trace metadata copying functionality to preserve session information
Created a feature flag (MLFLOW_ENABLE_MULTI_TURN_EVALUATION) to gate the new functionality

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/tracing/utils/test_copy.py`	New test file covering trace copying with metadata preservation
`tests/genai/evaluate/test_session_utils.py`	New comprehensive test suite for session-level evaluation utilities
`tests/genai/evaluate/test_utils.py`	Refactored to move multi-turn tests to dedicated session_utils test file
`mlflow/tracing/utils/copy.py`	Enhanced to copy trace metadata in addition to tags
`mlflow/genai/scorers/base.py`	Extended `run()` method signature to accept session parameter
`mlflow/genai/evaluation/session_utils.py`	New module containing all session-level evaluation logic and validation
`mlflow/genai/evaluation/utils.py`	Removed multi-turn helper functions (relocated to session_utils)
`mlflow/genai/evaluation/harness.py`	Integrated multi-turn scorer evaluation into the main evaluation flow
`mlflow/genai/evaluation/base.py`	Added validation for session-level evaluation inputs
`mlflow/environment_variables.py`	Added new feature flag for multi-turn evaluation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlflow/genai/evaluation/harness.py

mlflow/genai/evaluation/session_utils.py

smoorjani

Mostly LGTM! Conditionally stamping to unblock, just assuming all comments are addressed.

mlflow/environment_variables.py

mlflow/genai/evaluation/harness.py

mlflow/genai/evaluation/session_utils.py

mlflow/genai/evaluation/harness.py

tests/genai/evaluate/test_session_utils.py

mlflow/genai/evaluation/session_utils.py

xsh310 · 2025-11-25T20:16:02Z

mlflow/genai/evaluation/utils.py

+            )
+
+        # Check if data is a DataFrame-like object (has 'columns' attribute)
+        if not hasattr(data, "columns"):


Should we also raise when we receive list[dict] type?

I've removed this code, since we'll be supporting other evaluation types in a follow-up PR.

Signed-off-by: Avesh Singh <[email protected]>

mlflow/genai/evaluation/harness.py

serena-ruan · 2025-11-26T02:14:49Z

mlflow/genai/evaluation/harness.py

+        if eval_result.eval_item.trace is not None:
+            trace_id = eval_result.eval_item.trace.info.trace_id
+            try:
+                eval_result.eval_item.trace = mlflow.get_trace(trace_id)


This becomes slow once the eval dataset is big

Addressed with parallelization. We're replacing another get_trace call with this so it should ideally be net-zero.

serena-ruan

Overall LGTM! Left two comments about concerns on performance

Signed-off-by: Samraj Moorjani <[email protected]>

AveshCSingh and others added 3 commits November 24, 2025 20:15

Copy metadata when copying traces

968ebb2

Signed-off-by: Avesh Singh <[email protected]>

Merge branch 'master' into multi-turn-mlflow-genai-eval

bbaa331

Signed-off-by: Avesh Singh <[email protected]>

AveshCSingh force-pushed the multi-turn-mlflow-genai-eval branch from f72d951 to 3f6d6e3 Compare November 24, 2025 22:16

github-actions bot added area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. labels Nov 24, 2025

AveshCSingh added 3 commits November 24, 2025 22:43

rename, add tests

a4709db

Signed-off-by: Avesh Singh <[email protected]>

Add tests for evaluate_multi_turn_scorers

9683212

Signed-off-by: Avesh Singh <[email protected]>

Move session-related utils to their own file

70c5058

Signed-off-by: Avesh Singh <[email protected]>

AveshCSingh force-pushed the multi-turn-mlflow-genai-eval branch from 40d9dca to 70c5058 Compare November 24, 2025 23:09

AveshCSingh changed the title ~~[wip] Support multi-turn evaluation in mlflow.genai.evaluate~~ [wip] Support multi-turn evaluation in mlflow.genai.evaluate (trace input only) Nov 25, 2025

AveshCSingh changed the title ~~[wip] Support multi-turn evaluation in mlflow.genai.evaluate (trace input only)~~ Support multi-turn evaluation in mlflow.genai.evaluate (trace input only) Nov 25, 2025

Format

dbffe7e

Signed-off-by: Avesh Singh <[email protected]>

AveshCSingh force-pushed the multi-turn-mlflow-genai-eval branch from 0a10fa2 to dbffe7e Compare November 25, 2025 03:16

AveshCSingh commented Nov 25, 2025

View reviewed changes

AveshCSingh changed the title ~~Support multi-turn evaluation in mlflow.genai.evaluate (trace input only)~~ Support multi-turn evaluation in mlflow.genai.evaluate for DataFrame and list input Nov 25, 2025

AveshCSingh requested a review from smoorjani November 25, 2025 03:30

serena-ruan reviewed Nov 25, 2025

View reviewed changes

smoorjani added the v3.7.0 label Nov 25, 2025

harupy requested a review from Copilot November 25, 2025 08:46

Copilot started reviewing on behalf of harupy November 25, 2025 08:47 View session

harupy mentioned this pull request Nov 25, 2025

Add Claude Code lint hook for test docstrings #19022

Merged

7 tasks

Copilot finished reviewing on behalf of harupy November 25, 2025 08:49

Copilot AI reviewed Nov 25, 2025

View reviewed changes

mlflow/genai/evaluation/harness.py Outdated Show resolved Hide resolved

mlflow/genai/evaluation/session_utils.py Outdated Show resolved Hide resolved

mlflow/genai/evaluation/session_utils.py Outdated Show resolved Hide resolved

smoorjani approved these changes Nov 25, 2025

View reviewed changes

xsh310 reviewed Nov 25, 2025

View reviewed changes

mlflow/genai/evaluation/session_utils.py Outdated Show resolved Hide resolved

xsh310 reviewed Nov 25, 2025

View reviewed changes

AveshCSingh added 6 commits November 25, 2025 20:18

Address PR feedback

3def4a5

Signed-off-by: Avesh Singh <[email protected]>

Rename to evaluate_session_level_scorers, update return type

9c57227

Signed-off-by: Avesh Singh <[email protected]>

Only get_traces once

c587c67

Signed-off-by: Avesh Singh <[email protected]>

Add e2e test

0299e63

Signed-off-by: Avesh Singh <[email protected]>

lint

aea94e9

Signed-off-by: Avesh Singh <[email protected]>

ruff fmt

a4b04bb

Signed-off-by: Avesh Singh <[email protected]>

AveshCSingh requested review from serena-ruan and xsh310 November 25, 2025 22:00

AveshCSingh mentioned this pull request Nov 25, 2025

Support mlflow.genai.evaluate for multi-turn scorers on Datasets #19039

Merged

29 tasks

remove unnecessary env var

0cdb131

Signed-off-by: Avesh Singh <[email protected]>

serena-ruan reviewed Nov 26, 2025

View reviewed changes

mlflow/genai/evaluation/harness.py Outdated Show resolved Hide resolved

serena-ruan reviewed Nov 26, 2025

View reviewed changes

serena-ruan approved these changes Nov 26, 2025

View reviewed changes

smoorjani added 2 commits November 26, 2025 08:44

address comments

00f782d

Signed-off-by: Samraj Moorjani <[email protected]>

.

6051ef5

Signed-off-by: Samraj Moorjani <[email protected]>

smoorjani added this pull request to the merge queue Nov 26, 2025

Merged via the queue into mlflow:master with commit 46e3cb6 Nov 26, 2025
52 checks passed

This was referenced Dec 6, 2025

Support session param custom scorer #19096

Closed

Feature/session expectations #19109

Closed


		multi_turn_assessments = {}

		for session_id, session_items in session_groups.items():

Support multi-turn evaluation in mlflow.genai.evaluate for DataFrame and list input #18971

Support multi-turn evaluation in mlflow.genai.evaluate for DataFrame and list input #18971

Uh oh!

Conversation

AveshCSingh commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

What changes are proposed in this pull request?

How is this PR tested?

Manual validation: make_judge with real OpenAI LLM

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Nov 22, 2025

⚠ DCO check

Uh oh!

github-actions bot commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serena-ruan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

AveshCSingh commented Nov 22, 2025 •

edited

Loading

github-actions bot commented Nov 22, 2025 •

edited

Loading