Research Foundation · Allbert et al., micro1 (2025)

Optimizing the STT × LLM × TTS Stack

Voice-based interview agents succeed or fail by their weakest component. The 300K+ interview analysis from micro1 highlights how transcription, reasoning, and speech synthesis interact—findings we study to prioritize future modeling and vendor evaluations.

Component comparisons

Speech-to-Text

Allbert et al. (2025) report that transcription accuracy compounds through the stack, making STT the most critical component.

Google STT

Conversational quality
8.66
Technical quality
8.48
User satisfaction
4.47

Whisper

Conversational quality
8.33
Technical quality
8.32
User satisfaction
4.29

Large Language Model

The study shows LLM selection shaping conversational nuance, error recovery, and perceived empathy across the measured stack options.

GPT-4.1

Conversational quality
8.66
Technical quality
8.48
User satisfaction
4.53

GPT-4o

Conversational quality
8.33
Technical quality
8.32
User satisfaction
4.29

Groq2

Conversational quality
8.36
Technical quality
8.40
User satisfaction
4.30

Text-to-Speech

Published findings highlight how natural prosody and pacing boost user trust even when objective accuracy remains constant.

Cartesia

Conversational quality
8.78
Technical quality
8.57
User satisfaction
4.53

OpenAI

Conversational quality
8.53
Technical quality
8.39
User satisfaction
4.41

Capabilities we prioritize

Speech-to-text fundamentals

  • High diarization quality to preserve who-said-what context.
  • Coverage for diverse accents and languages across client deployments.
  • Confidence scoring that can trigger clarification prompts when accuracy drops.

Language model behavior

  • Support for long-context interviews without losing prior rationale.
  • Follow-up generation that mirrors the coaching loops described in research.
  • Controls for tone and safety so client interviews stay on brand.

Speech synthesis experience

  • Natural prosody with adjustable pacing for various interview formats.
  • Configurable personas to match client expectations.
  • Low latency to keep dialogue responsive.

Optimization process

Benchmark

Measure every component against standardized interview datasets and human transcripts before release.

A/B Test

Deploy controlled experiments on production traffic to confirm measurable improvements in satisfaction and accuracy.

Monitor

Track latency, transcription confidence, hallucination flags, and TTS quality in realtime telemetry dashboards.

Update

Re-evaluate stacks quarterly or when new models ship, swapping components once they outperform incumbents by >2% satisfaction.

Research-informed checkpoints

Risks surfaced by the literature

  • Component swaps can introduce latency spikes that frustrate candidates.
  • Accent mismatches compound downstream hallucinations.
  • Flat prosody reduces perceived empathy in interviewer agents.

How we apply the findings

  • Test stacks against the published benchmarks before piloting with clients.
  • Instrument deployments for latency, hallucination, and satisfaction telemetry.
  • Iterate with vendor roadmaps while keeping contingency paths ready.

Resources