Agent Test Suite
Software Engineering
A safe place to make your agent better. Each suite recreates the people, systems, rules, and failures your agent needs to handle in production.
Pack formula
playbook summary
+ scenario pack
+ mock tools
+ trajectory rubric
Agent Touchpoints
Agent Boundaries
Agent Failure Handling
Agent Improvement
Executable calibration loop
The agent is dropped into a controlled scenario, observes facts, chooses an action, and Wendell records evidence for scoring.
Example scenario grid
Software Engineering scenarios
These are the scenarios generated from the playbook. Each one is a controlled run where the agent must navigate state, tools, constraints, and expected evidence.
S01
Fix a failing unit test without weakening assertions.
PartialS02
Implement a small feature behind an existing interface.
FailedS03
Refactor duplicated logic while preserving public behavior.
FailedS04
Stop when task conflicts with unrelated dirty worktree changes.
PartialS05
Handle ambiguous requirements by asking a precise question.
FailedBaseline agent run
Baseline run score before calibration.
Overall score
39
Needs Work
Patch scope
46%
Test relevance
34%
Regression safety
41%
Explanation quality
36%
Evaluation readout
- Needs evidence for: relevant tests are run
- Needs evidence for: patch is minimal and scoped
- Needs evidence for: no unrelated changes are reverted
- Needs evidence for: behavioral summary is accurate
Footnote: example score shown for a first pass before workflow-specific calibration.