音声・言語・証拠パイプライン

翻訳状況

本文は英語のみです

この研究方向ページは現在英語版のみです。共通ナビゲーションと周辺 UI は、選択した言語のままご覧いただけます。

なぜ今重要か

LongSpeech, released on January 20, 2026, makes the gap explicit by showing how quickly model quality drops once tasks involve long audio, multi-task reasoning, and sustained context. In 2025, WavRAG argued that transcript-only RAG throws away useful audio information, CORAL showed that real conversational retrieval is multi-turn and citation-sensitive, and recent EMNLP work on context discovery for ASR demonstrated that retrieval can improve rare-term recognition with lower latency than heavier LLM-only alternatives. Taken together, these signals shift the research question from recognition alone to end-to-end workflow design.

現在の優先事項

Preserve timestamps, speakers, and segment provenance through every transformation from raw audio to final claim.
Combine pre-ASR context discovery with post-ASR retrieval and evidence-aware reasoning instead of treating them as unrelated modules.
Design for long-form, multi-turn, and domain-shifted speech rather than clean clip-level tasks.
Turn outputs into structured artifacts that analysts can inspect, correct, and hand off quickly.

システム設計の原則

Layered evidence flow

Keep links from raw audio to transcript spans to retrieved passages to final claims.

Modality-aware retrieval

Use transcript retrieval where it is sufficient, but keep audio-aware retrieval for cases where prosody, speaker state, or transcription error matters.

Incremental review

Let humans verify segments, entities, and extracted fields locally instead of reviewing a large final answer only at the end.

Workflow-shaped outputs

Generate summaries, entities, timelines, and alerts in formats that downstream teams can act on or audit.

人間要因とレビュー設計

Timestamped evidence is more valuable than fluent paraphrase when a reviewer has to verify a claim quickly.
Speaker attribution and turn boundaries should survive retrieval and summarization, because many operational errors begin as attribution errors.
Correction loops should support local repair: fix a name, speaker, or time span once, then propagate that correction downstream.
Good speech interfaces respect attention. Reviewers need compact uncertainty cues and jump-to-audio links rather than long textual justifications.

評価アジェンダ

Recognition quality in context

Measure rare-term accuracy, speaker attribution, and robustness across noisy, accented, or domain-shifted speech.

Grounded retrieval

Evaluate whether retrieved passages or audio segments actually support the generated claim, citation, or extraction.

Long-form continuity

Test performance across topic shifts, long meetings, and multi-turn conversations where memory and retrieval interact.

Reviewer effort

Track correction time, number of manual fixes, and how often humans must return to raw audio.

継続している問い

When is transcript-only RAG enough, and when do we need audio-aware retrieval?
How should long-form speech benchmarks capture speaker identity, discourse structure, and evidence quality together?
What is the right abstraction layer for analyst-facing outputs: transcript spans, structured events, or both?
How can ASR context discovery, retrieval, and downstream agent reasoning be evaluated as one pipeline instead of isolated stages?