Research References

Benchmarks & Datasets

Academic benchmarks give us shared points of comparison with the wider research community, while privately held datasets help us study how those findings might translate into real interviews. The resources below inform our modeling work but are not statements about current SchoolsAdmissions performance.

Back to research overview View complete bibliography

Academic benchmark suite

MATH Dataset

500 sample subset stratified by difficulty to evaluate step-by-step mathematical reasoning with iterative feedback turns.

View dataset ↗

DepthQA

848 STEM-focused long-form question-answer pairs measuring factual accuracy and explanation quality.

View dataset ↗

MT-Bench

80 multi-turn instruction-following conversations covering eight task categories for holistic assessment.

View dataset ↗

MINT Reasoning Subset

316 tool-free reasoning tasks targeting logical inference, decision making, and common-sense evaluation.

View dataset ↗

Internal research inputs

Proprietary interview corpus

7,361 human-to-human interviews collected between 2018 and 2022
67 professional interviewers spanning 39 countries
Average 15-minute duration with 43.8 utterances per session
440 conversations manually diarized for evaluation

Augmented training data

9,208 pseudo-labelled dialogues from Switchboard and BlendedSkillTalk
Statistical heuristics simulate diarization errors for robustness
84.4% inter-annotator agreement (κ score) on manual reviews

Evaluation protocols

Dynamic contamination sweeps paraphrase benchmark prompts to prevent leakage
Bootstrapped sampling (n ≥ 100) matches full evaluation results with p = 0.809
Ensemble judge configuration cross-checks GPT-4o, Claude 3.5 Sonnet, and Llama-3.1-70B

Performance snapshot

Metric	Baseline	Published Result	Source
Speaker Diarization (F1)	72.7%	93.6%	Wang et al., 2023
Topic Repetition Rate	30.0%	6.7%	Wang et al., 2023
Interview Completion	13.3%	46.7%	Wang et al., 2023
Off-Topic Responses	20.0%	10.0%	Wang et al., 2023
User Satisfaction (1-5)	4.29	4.53	Allbert et al., 2025
Conversational Quality (1-10)	8.32	8.78	Allbert et al., 2025

Metrics above summarize outcomes reported in the cited publications. We reference these figures when shaping our models, and they should not be interpreted as direct measurements of SchoolsAdmissions systems.

Next steps

Explore bibliography →Back to context management