Trustworthy AI and Agent Systems

Why this matters now

The 2025 Stanford AI Index reported that responsible-AI evaluation remains uneven even as documented incidents continue to accumulate. At the same time, NIST's February 2026 AI Agent Standards Initiative, draft benchmark guidance for language-model evaluation, and ongoing ARIA field-testing work all point in the same direction: the next serious step is comparable, operationally meaningful evaluation. My view is that agent progress is now gated less by raw fluency and more by control surfaces such as permissions, observability, interoperability, and human escalation.

Current priorities

Move evaluation from one-shot benchmark wins to repeated-run performance, variance, recovery behavior, and policy adherence.
Treat traceability as a product feature: tool calls, retrieved evidence, approval checkpoints, and rollback points should be reviewable without exposing unnecessary chain-of-thought.
Design bounded autonomy with explicit scopes for read, write, execute, and external communication actions.
Measure trust calibration, not just model confidence, so humans can decide when to rely on the system and when to slow it down.

System design principles

Bounded autonomy

Give agents explicit task scopes, tool budgets, and approval gates instead of vague permission to 'go solve it'.

Inspectable state

Persist the plan, retrieved context, tool outputs, and meaningful diffs so reviewers can reconstruct what happened quickly.

Graceful handoff

Escalation should preserve context and evidence rather than forcing a human to restart the task from scratch.

Comparative evaluation

Compare agents against strong non-agent baselines, mixed-initiative workflows, and failure-handling variants instead of weaker prompts alone.

Human factors and review design

Review surfaces should foreground consequences: what changed, what external action is pending, and what evidence supports it.
Uncertainty needs to be legible at the workflow level, not hidden inside a single confidence score.
Operator trust should be calibrated with stable conventions, reproducible traces, and reversible actions rather than persuasive language.
When the system hands work to a person, it should preserve context, partial progress, and rationale so the human can continue without reconstructing the task.

Evaluation agenda

Reliability under repetition

Measure variance across repeated runs, near-miss behavior, and whether recovery steps converge or compound error.

Evidence quality

Score whether retrieved or cited material actually supports the action or conclusion the agent produced.

Oversight efficiency

Track time-to-review, override quality, and whether humans catch the failures that matter.

Policy and safety compliance

Test permission boundaries, escalation behavior, and refusal quality when requests conflict with policy or evidence.

Open questions

What is the right granularity for agent permissions in research, coding, and analyst workflows?
How should we compare autonomous, mixed-initiative, and step-wise systems when human labor is part of the objective?
Which traces improve reviewer judgment, and which ones only create noise?
What counts as acceptable failure behavior for an agent allowed to act on external systems?