W
Wendell
Quickstart

Agent Test Suite

Software Engineering

A safe place to make your agent better. Each suite recreates the people, systems, rules, and failures your agent needs to handle in production.

Pack formula

playbook summary
+ scenario pack
+ mock tools
+ trajectory rubric

Agent Touchpoints

DevelopersRepositoriesIssuesFilesTestsReview tools

Agent Boundaries

Do not revert unrelated changesDo not weaken business assertionsDo not skip relevant verification

Agent Failure Handling

Ask for clarificationStop on dirty worktree conflictsReport failing testsReject unsafe changes

Agent Improvement

Patch scopeTest relevanceRegression safetyExplanation quality

Executable calibration loop

The agent is dropped into a controlled scenario, observes facts, chooses an action, and Wendell records evidence for scoring.

Example scenario grid

Software Engineering scenarios

These are the scenarios generated from the playbook. Each one is a controlled run where the agent must navigate state, tools, constraints, and expected evidence.

S01

Fix a failing unit test without weakening assertions.

Partial

S02

Implement a small feature behind an existing interface.

Failed

S03

Refactor duplicated logic while preserving public behavior.

Failed

S04

Stop when task conflicts with unrelated dirty worktree changes.

Partial

S05

Handle ambiguous requirements by asking a precise question.

Failed

Baseline agent run

Baseline run score before calibration.

Overall score

39

Needs Work

Patch scope

46%

Test relevance

34%

Regression safety

41%

Explanation quality

36%

Evaluation readout

  • Needs evidence for: relevant tests are run
  • Needs evidence for: patch is minimal and scoped
  • Needs evidence for: no unrelated changes are reverted
  • Needs evidence for: behavioral summary is accurate

Footnote: example score shown for a first pass before workflow-specific calibration.