Academic benchmarks give us shared points of comparison with the wider research community, while privately held datasets help us study how those findings might translate into real interviews. The resources below inform our modeling work but are not statements about current SchoolsAdmissions performance.
500 sample subset stratified by difficulty to evaluate step-by-step mathematical reasoning with iterative feedback turns.
848 STEM-focused long-form question-answer pairs measuring factual accuracy and explanation quality.
80 multi-turn instruction-following conversations covering eight task categories for holistic assessment.
316 tool-free reasoning tasks targeting logical inference, decision making, and common-sense evaluation.
| Metric | Baseline | Published Result | Source |
|---|---|---|---|
| Speaker Diarization (F1) | 72.7% | 93.6% | Wang et al., 2023 |
| Topic Repetition Rate | 30.0% | 6.7% | Wang et al., 2023 |
| Interview Completion | 13.3% | 46.7% | Wang et al., 2023 |
| Off-Topic Responses | 20.0% | 10.0% | Wang et al., 2023 |
| User Satisfaction (1-5) | 4.29 | 4.53 | Allbert et al., 2025 |
| Conversational Quality (1-10) | 8.32 | 8.78 | Allbert et al., 2025 |
Metrics above summarize outcomes reported in the cited publications. We reference these figures when shaping our models, and they should not be interpreted as direct measurements of SchoolsAdmissions systems.