Back to research overview
Research direction

Speech, Language, and Agent Workflows

Building ASR + LLM + RAG pipelines and agent workflows for conversational analysis, structured extraction, and evidence-aware reasoning over long-form audio and transcripts.

ASRAgentsRAG

Current thesis

Speech workflows become genuinely useful when they preserve grounding across audio, transcript, retrieval, and analyst judgment. The interesting work is not another chat wrapper around transcripts; it is a pipeline that keeps evidence aligned while moving toward decisions.

2026 perspective

The 2026 shift is away from short-clip demos and toward long-form, multi-speaker, evidence-carrying audio workflows. Speech systems now need to preserve what transcripts flatten.

Human factors lens

Real users do not want to babysit a black box over a 40-minute call. They need segment-level grounding, fast correction loops, and outputs that match the time pressure of analyst work.

Why this matters now

LongSpeech, released on January 20, 2026, makes the gap explicit by showing how quickly model quality drops once tasks involve long audio, multi-task reasoning, and sustained context. In 2025, WavRAG argued that transcript-only RAG throws away useful audio information, CORAL showed that real conversational retrieval is multi-turn and citation-sensitive, and recent EMNLP work on context discovery for ASR demonstrated that retrieval can improve rare-term recognition with lower latency than heavier LLM-only alternatives. Taken together, these signals shift the research question from recognition alone to end-to-end workflow design.

Current priorities

  • Preserve timestamps, speakers, and segment provenance through every transformation from raw audio to final claim.
  • Combine pre-ASR context discovery with post-ASR retrieval and evidence-aware reasoning instead of treating them as unrelated modules.
  • Design for long-form, multi-turn, and domain-shifted speech rather than clean clip-level tasks.
  • Turn outputs into structured artifacts that analysts can inspect, correct, and hand off quickly.

System design principles

Layered evidence flow

Keep links from raw audio to transcript spans to retrieved passages to final claims.

Modality-aware retrieval

Use transcript retrieval where it is sufficient, but keep audio-aware retrieval for cases where prosody, speaker state, or transcription error matters.

Incremental review

Let humans verify segments, entities, and extracted fields locally instead of reviewing a large final answer only at the end.

Workflow-shaped outputs

Generate summaries, entities, timelines, and alerts in formats that downstream teams can act on or audit.

Human factors and review design

  • Timestamped evidence is more valuable than fluent paraphrase when a reviewer has to verify a claim quickly.
  • Speaker attribution and turn boundaries should survive retrieval and summarization, because many operational errors begin as attribution errors.
  • Correction loops should support local repair: fix a name, speaker, or time span once, then propagate that correction downstream.
  • Good speech interfaces respect attention. Reviewers need compact uncertainty cues and jump-to-audio links rather than long textual justifications.

Evaluation agenda

Recognition quality in context

Measure rare-term accuracy, speaker attribution, and robustness across noisy, accented, or domain-shifted speech.

Grounded retrieval

Evaluate whether retrieved passages or audio segments actually support the generated claim, citation, or extraction.

Long-form continuity

Test performance across topic shifts, long meetings, and multi-turn conversations where memory and retrieval interact.

Reviewer effort

Track correction time, number of manual fixes, and how often humans must return to raw audio.

Open questions

  • When is transcript-only RAG enough, and when do we need audio-aware retrieval?
  • How should long-form speech benchmarks capture speaker identity, discourse structure, and evidence quality together?
  • What is the right abstraction layer for analyst-facing outputs: transcript spans, structured events, or both?
  • How can ASR context discovery, retrieval, and downstream agent reasoning be evaluated as one pipeline instead of isolated stages?

Signals shaping this direction