Research References

Benchmarks & Datasets

Academic benchmarks give us shared points of comparison with the wider research community, while privately held datasets help us study how those findings might translate into real interviews. The resources below inform our modeling work but are not statements about current SchoolsAdmissions performance.

Academic benchmark suite

MATH Dataset

500 sample subset stratified by difficulty to evaluate step-by-step mathematical reasoning with iterative feedback turns.

DepthQA

848 STEM-focused long-form question-answer pairs measuring factual accuracy and explanation quality.

MT-Bench

80 multi-turn instruction-following conversations covering eight task categories for holistic assessment.

MINT Reasoning Subset

316 tool-free reasoning tasks targeting logical inference, decision making, and common-sense evaluation.

Internal research inputs

Proprietary interview corpus

  • 7,361 human-to-human interviews collected between 2018 and 2022
  • 67 professional interviewers spanning 39 countries
  • Average 15-minute duration with 43.8 utterances per session
  • 440 conversations manually diarized for evaluation

Augmented training data

  • 9,208 pseudo-labelled dialogues from Switchboard and BlendedSkillTalk
  • Statistical heuristics simulate diarization errors for robustness
  • 84.4% inter-annotator agreement (κ score) on manual reviews

Evaluation protocols

  • Dynamic contamination sweeps paraphrase benchmark prompts to prevent leakage
  • Bootstrapped sampling (n ≥ 100) matches full evaluation results with p = 0.809
  • Ensemble judge configuration cross-checks GPT-4o, Claude 3.5 Sonnet, and Llama-3.1-70B

Performance snapshot

MetricBaselinePublished ResultSource
Speaker Diarization (F1)72.7%93.6%Wang et al., 2023
Topic Repetition Rate30.0%6.7%Wang et al., 2023
Interview Completion13.3%46.7%Wang et al., 2023
Off-Topic Responses20.0%10.0%Wang et al., 2023
User Satisfaction (1-5)4.294.53Allbert et al., 2025
Conversational Quality (1-10)8.328.78Allbert et al., 2025

Metrics above summarize outcomes reported in the cited publications. We reference these figures when shaping our models, and they should not be interpreted as direct measurements of SchoolsAdmissions systems.

Next steps