---
id: "action-build-deterministic-evals"
type: "action-item"
source_timestamps: ["00:00:00"]
tags: ["agentic-workflows", "security", "evaluations"]
related: ["concept-trust-failure-hallucination"]
action: "Implement external, code-based verification to audit agentic task completion."
outcome: "Prevents silent failures and hallucinated success reports from corrupting autonomous pipelines."
speakers: ["Nate B. Jones"]
sources: ["s12-opus-47"]
sourceVaultSlug: "s12-opus-47"
originDay: 12
---
# Build deterministic verification for agents

## Action

**Implement external, code-based verification to audit agentic task completion.**

## Outcome

Prevents silent failures and hallucinated success reports from corrupting autonomous pipelines.

## Why

Given [[entity-claude-opus-4-7-d12|Opus 4.7]]'s tendency to [[concept-trust-failure-hallucination|hallucinate audit trails]] when it fails to process files (see [[claim-hallucinates-audit]]), developers **cannot rely on the model's self-reported success logs**.

## What 'Deterministic' Means Here

Verification logic that does **not depend on the model's truthfulness about itself**. Examples:

- **File hashes** — confirm each input file was actually read and its output produced.
- **Database row counts** — confirm expected number of records was inserted.
- **Exit codes** from subprocess executions.
- **Schema validation** on outputs.
- **Diff against expected outputs** for known-good test cases.
- **Timestamp ranges** on file modifications.

## Pattern

For every step the agent claims to have performed, run a **code-based assertion** that the side effect actually occurred. The model's report is treated as a hypothesis, not as evidence.

## Why This Beats Benchmark Reliance

See [[contrarian-benchmarks-vs-business]] — high benchmark scores don't catch silent fabrication. Deterministic verification does.

## Cross-References

- Concept: [[concept-trust-failure-hallucination]]
- Claim: [[claim-hallucinates-audit]]
- Quote: [[quote-trust-failure]]
- Framework: [[framework-hex-eval]] (Step 4: Audit Verification)
- Contrarian: [[contrarian-benchmarks-vs-business]]


## Related across days
- [[framework-agentic-eval-loop]]
- [[action-implement-comprehension-gate]]
- [[framework-agent-evaluation]]
- [[concept-scenario-testing]]
- [[framework-hex-eval]]