Research Foundation · Allbert et al., micro1 (2025)

Optimizing the STT × LLM × TTS Stack

Voice-based interview agents succeed or fail by their weakest component. The 300K+ interview analysis from micro1 highlights how transcription, reasoning, and speech synthesis interact—findings we study to prioritize future modeling and vendor evaluations.

Read the Allbert et al. paper ↗Review evaluation methodology

Component comparisons

Speech-to-Text

Allbert et al. (2025) report that transcription accuracy compounds through the stack, making STT the most critical component.

Google STT

Conversational quality: 8.66
Technical quality: 8.48
User satisfaction: 4.47

Whisper

Conversational quality: 8.33
Technical quality: 8.32
User satisfaction: 4.29

Large Language Model

The study shows LLM selection shaping conversational nuance, error recovery, and perceived empathy across the measured stack options.

GPT-4.1

Conversational quality: 8.66
Technical quality: 8.48
User satisfaction: 4.53

GPT-4o

Conversational quality: 8.33
Technical quality: 8.32
User satisfaction: 4.29

Groq2

Conversational quality: 8.36
Technical quality: 8.40
User satisfaction: 4.30

Text-to-Speech

Published findings highlight how natural prosody and pacing boost user trust even when objective accuracy remains constant.

Cartesia

Conversational quality: 8.78
Technical quality: 8.57
User satisfaction: 4.53

OpenAI

Conversational quality: 8.53
Technical quality: 8.39
User satisfaction: 4.41

Capabilities we prioritize

Speech-to-text fundamentals

High diarization quality to preserve who-said-what context.
Coverage for diverse accents and languages across client deployments.
Confidence scoring that can trigger clarification prompts when accuracy drops.

Language model behavior

Support for long-context interviews without losing prior rationale.
Follow-up generation that mirrors the coaching loops described in research.
Controls for tone and safety so client interviews stay on brand.

Speech synthesis experience

Natural prosody with adjustable pacing for various interview formats.
Configurable personas to match client expectations.
Low latency to keep dialogue responsive.

Optimization process

Benchmark

Measure every component against standardized interview datasets and human transcripts before release.

A/B Test

Deploy controlled experiments on production traffic to confirm measurable improvements in satisfaction and accuracy.

Monitor

Track latency, transcription confidence, hallucination flags, and TTS quality in realtime telemetry dashboards.

Update

Re-evaluate stacks quarterly or when new models ship, swapping components once they outperform incumbents by >2% satisfaction.

Research-informed checkpoints

Risks surfaced by the literature

Component swaps can introduce latency spikes that frustrate candidates.
Accent mismatches compound downstream hallucinations.
Flat prosody reduces perceived empathy in interviewer agents.

How we apply the findings

Test stacks against the published benchmarks before piloting with clients.
Instrument deployments for latency, hallucination, and satisfaction telemetry.
Iterate with vendor roadmaps while keeping contingency paths ready.

Resources

Next: Client-side proctoring →Back to research overview