Research Foundation · Kim et al., KAIST/CMU/Stanford (2025)

Dynamic Multi-Turn Evaluation Framework

Traditional single-turn assessments fail to capture adaptability, feedback incorporation, and conversational depth. We align our evaluation modeling with the LLM-as-an-Interviewer paradigm so client solutions can reflect the multi-turn rigor described in the research while mitigating benchmark contamination risks.

Read the Kim et al. paper ↗View evaluation benchmarks

The challenge we set out to solve

Why static testing falls short

Single-turn prompts inflate performance by ignoring how candidates respond to clarification and feedback.

Public benchmarks leak into model pre-training corpora, artificially boosting scores by up to 84 percentage points.

Binary pass/fail grading lacks the nuance hiring teams need for coaching and calibration.

What the research uncovered

Dynamic question modification neutralizes contamination while keeping difficulty constant.

Follow-up turns expose reasoning quality changes that single prompts miss (accuracy @1 → @3 improves 19.2 points).

Interview reports reveal behavioral patterns such as perseverance, clarification, and responsiveness.

How we translate the framework into modeling priorities

Question modification

Automatically paraphrases or parameterizes prompts to remove leaked answers and preserve difficulty. Supports role-specific terminology and scenario variants.

Iterative feedback

Classifies error types and issues corrective hints that nudge candidates toward better reasoning without revealing answers outright.

Follow-up engine

Generates clarification, rationale, elaboration, and modification questions to probe depth and adaptability in real time.

Interview report synthesis

Aggregates multi-turn metrics, error patterns, and qualitative notes into a structured report for hiring teams and candidates.

Reference metrics from Kim et al. (2025)

Initial accuracy (@1)

Research baseline

44.4%

Published dynamic result

45.2%

After feedback (@2)

Research baseline

56.6%

Published dynamic result

58.1%

After refinement (@3)

Research baseline

63.6%

Published dynamic result

65.4%

Follow-up accuracy

Research baseline

93%

Published dynamic result

94%

Interviewer satisfaction

Research baseline

3.5 / 5.0

Published dynamic result

3.7 / 5.0

Resources

Next: Voice systems optimization →Back to research overview