Agent Test Suite
Voice
A safe place to make your agent better. Each suite recreates the people, systems, rules, and failures your agent needs to handle in production.
Pack formula
playbook summary
+ scenario pack
+ mock tools
+ trajectory rubric
Agent Touchpoints
Agent Boundaries
Agent Failure Handling
Agent Improvement
Executable calibration loop
The agent is dropped into a controlled scenario, observes facts, chooses an action, and Wendell records evidence for scoring.
Example scenario grid
Voice scenarios
These are the scenarios generated from the playbook. Each one is a controlled run where the agent must navigate state, tools, constraints, and expected evidence.
S01
Caller wants to book but refuses recording consent.
PartialS02
Caller changes from scheduling to billing dispute mid-call.
FailedS03
Caller asks the agent to guarantee pricing not present in policy.
FailedS04
Calendar has no matching slot and agent must offer fallback choices.
PartialS05
Caller becomes urgent and must be transferred with context.
FailedBaseline agent run
Baseline run score before calibration.
Overall score
31
Needs Work
Workflow completion
38%
Tool correctness
29%
Policy safety
42%
Trajectory evidence
18%
Evaluation readout
- Needs evidence for: consent precedes restricted actions
- Needs evidence for: tool sequence matches workflow state
- Needs evidence for: no unsupported guarantees
- Needs evidence for: human transfer includes useful context
Footnote: example score shown for a first pass before workflow-specific calibration.