值得信賴的 AI 系統

翻譯狀態

目前僅提供英文原文

這個研究方向頁面目前僅提供英文版本。導覽與周邊介面會依你選擇的語言顯示。

為什麼現在值得做

The 2025 Stanford AI Index reported that responsible-AI evaluation remains uneven even as documented incidents continue to accumulate. At the same time, NIST's February 2026 AI Agent Standards Initiative, draft benchmark guidance for language-model evaluation, and ongoing ARIA field-testing work all point in the same direction: the next serious step is comparable, operationally meaningful evaluation. My view is that agent progress is now gated less by raw fluency and more by control surfaces such as permissions, observability, interoperability, and human escalation.

目前優先方向

Move evaluation from one-shot benchmark wins to repeated-run performance, variance, recovery behavior, and policy adherence.
Treat traceability as a product feature: tool calls, retrieved evidence, approval checkpoints, and rollback points should be reviewable without exposing unnecessary chain-of-thought.
Design bounded autonomy with explicit scopes for read, write, execute, and external communication actions.
Measure trust calibration, not just model confidence, so humans can decide when to rely on the system and when to slow it down.

系統設計原則

Bounded autonomy

Give agents explicit task scopes, tool budgets, and approval gates instead of vague permission to 'go solve it'.

Inspectable state

Persist the plan, retrieved context, tool outputs, and meaningful diffs so reviewers can reconstruct what happened quickly.

Graceful handoff

Escalation should preserve context and evidence rather than forcing a human to restart the task from scratch.

Comparative evaluation

Compare agents against strong non-agent baselines, mixed-initiative workflows, and failure-handling variants instead of weaker prompts alone.

人因與審查流程

Review surfaces should foreground consequences: what changed, what external action is pending, and what evidence supports it.
Uncertainty needs to be legible at the workflow level, not hidden inside a single confidence score.
Operator trust should be calibrated with stable conventions, reproducible traces, and reversible actions rather than persuasive language.
When the system hands work to a person, it should preserve context, partial progress, and rationale so the human can continue without reconstructing the task.

評估議程

Reliability under repetition

Measure variance across repeated runs, near-miss behavior, and whether recovery steps converge or compound error.

Evidence quality

Score whether retrieved or cited material actually supports the action or conclusion the agent produced.

Oversight efficiency

Track time-to-review, override quality, and whether humans catch the failures that matter.

Policy and safety compliance

Test permission boundaries, escalation behavior, and refusal quality when requests conflict with policy or evidence.

持續追問的問題

What is the right granularity for agent permissions in research, coding, and analyst workflows?
How should we compare autonomous, mixed-initiative, and step-wise systems when human labor is part of the objective?
Which traces improve reviewer judgment, and which ones only create noise?
What counts as acceptable failure behavior for an agent allowed to act on external systems?