Speech, Language, and Agent Workflows

Why this matters now

LongSpeech, released on January 20, 2026, makes the gap explicit by showing how quickly model quality drops once tasks involve long audio, multi-task reasoning, and sustained context. In 2025, WavRAG argued that transcript-only RAG throws away useful audio information, CORAL showed that real conversational retrieval is multi-turn and citation-sensitive, and recent EMNLP work on context discovery for ASR demonstrated that retrieval can improve rare-term recognition with lower latency than heavier LLM-only alternatives. Taken together, these signals shift the research question from recognition alone to end-to-end workflow design.

Current priorities

Preserve timestamps, speakers, and segment provenance through every transformation from raw audio to final claim.
Combine pre-ASR context discovery with post-ASR retrieval and evidence-aware reasoning instead of treating them as unrelated modules.
Design for long-form, multi-turn, and domain-shifted speech rather than clean clip-level tasks.
Turn outputs into structured artifacts that analysts can inspect, correct, and hand off quickly.

System design principles

Layered evidence flow

Keep links from raw audio to transcript spans to retrieved passages to final claims.

Modality-aware retrieval

Use transcript retrieval where it is sufficient, but keep audio-aware retrieval for cases where prosody, speaker state, or transcription error matters.

Incremental review

Let humans verify segments, entities, and extracted fields locally instead of reviewing a large final answer only at the end.

Workflow-shaped outputs

Generate summaries, entities, timelines, and alerts in formats that downstream teams can act on or audit.

Human factors and review design

Timestamped evidence is more valuable than fluent paraphrase when a reviewer has to verify a claim quickly.
Speaker attribution and turn boundaries should survive retrieval and summarization, because many operational errors begin as attribution errors.
Correction loops should support local repair: fix a name, speaker, or time span once, then propagate that correction downstream.
Good speech interfaces respect attention. Reviewers need compact uncertainty cues and jump-to-audio links rather than long textual justifications.

Evaluation agenda

Recognition quality in context

Measure rare-term accuracy, speaker attribution, and robustness across noisy, accented, or domain-shifted speech.

Grounded retrieval

Evaluate whether retrieved passages or audio segments actually support the generated claim, citation, or extraction.

Long-form continuity

Test performance across topic shifts, long meetings, and multi-turn conversations where memory and retrieval interact.

Reviewer effort

Track correction time, number of manual fixes, and how often humans must return to raw audio.

Open questions

When is transcript-only RAG enough, and when do we need audio-aware retrieval?
How should long-form speech benchmarks capture speaker identity, discourse structure, and evidence quality together?
What is the right abstraction layer for analyst-facing outputs: transcript spans, structured events, or both?
How can ASR context discovery, retrieval, and downstream agent reasoning be evaluated as one pipeline instead of isolated stages?