語音、語言與證據流程

翻譯狀態

目前僅提供英文原文

這個研究方向頁面目前僅提供英文版本。導覽與周邊介面會依你選擇的語言顯示。

為什麼現在值得做

LongSpeech, released on January 20, 2026, makes the gap explicit by showing how quickly model quality drops once tasks involve long audio, multi-task reasoning, and sustained context. In 2025, WavRAG argued that transcript-only RAG throws away useful audio information, CORAL showed that real conversational retrieval is multi-turn and citation-sensitive, and recent EMNLP work on context discovery for ASR demonstrated that retrieval can improve rare-term recognition with lower latency than heavier LLM-only alternatives. Taken together, these signals shift the research question from recognition alone to end-to-end workflow design.

目前優先方向

Preserve timestamps, speakers, and segment provenance through every transformation from raw audio to final claim.
Combine pre-ASR context discovery with post-ASR retrieval and evidence-aware reasoning instead of treating them as unrelated modules.
Design for long-form, multi-turn, and domain-shifted speech rather than clean clip-level tasks.
Turn outputs into structured artifacts that analysts can inspect, correct, and hand off quickly.

系統設計原則

Layered evidence flow

Keep links from raw audio to transcript spans to retrieved passages to final claims.

Modality-aware retrieval

Use transcript retrieval where it is sufficient, but keep audio-aware retrieval for cases where prosody, speaker state, or transcription error matters.

Incremental review

Let humans verify segments, entities, and extracted fields locally instead of reviewing a large final answer only at the end.

Workflow-shaped outputs

Generate summaries, entities, timelines, and alerts in formats that downstream teams can act on or audit.

人因與審查流程

Timestamped evidence is more valuable than fluent paraphrase when a reviewer has to verify a claim quickly.
Speaker attribution and turn boundaries should survive retrieval and summarization, because many operational errors begin as attribution errors.
Correction loops should support local repair: fix a name, speaker, or time span once, then propagate that correction downstream.
Good speech interfaces respect attention. Reviewers need compact uncertainty cues and jump-to-audio links rather than long textual justifications.

評估議程

Recognition quality in context

Measure rare-term accuracy, speaker attribution, and robustness across noisy, accented, or domain-shifted speech.

Grounded retrieval

Evaluate whether retrieved passages or audio segments actually support the generated claim, citation, or extraction.

Long-form continuity

Test performance across topic shifts, long meetings, and multi-turn conversations where memory and retrieval interact.

Reviewer effort

Track correction time, number of manual fixes, and how often humans must return to raw audio.

持續追問的問題

When is transcript-only RAG enough, and when do we need audio-aware retrieval?
How should long-form speech benchmarks capture speaker identity, discourse structure, and evidence quality together?
What is the right abstraction layer for analyst-facing outputs: transcript spans, structured events, or both?
How can ASR context discovery, retrieval, and downstream agent reasoning be evaluated as one pipeline instead of isolated stages?