W
Wendell
Private Demo

Agent Calibration World

Software Engineering

A safe place to make your agent better. Each world recreates the people, systems, rules, and failures your agent needs to handle in production.

Pack formula

world model
+ scenario pack
+ mock tools
+ trajectory rubric

Agent Touchpoints

DevelopersRepositoriesIssuesFilesTestsReview tools

Agent Boundaries

Do not revert unrelated changesDo not weaken business assertionsDo not skip relevant verification

Agent Failure Handling

Ask for clarificationStop on dirty worktree conflictsReport failing testsReject unsafe changes

Agent Improvement

Patch scopeTest relevanceRegression safetyExplanation quality

Executable calibration loop

The agent is dropped into state, observes facts, chooses an action, and the world produces the next state plus evidence for scoring.

Demo scenario grid

Software Engineering scenarios

These are the scenarios generated from the calibration world. Each one is a controlled run where the agent must navigate state, tools, constraints, and expected evidence.

S01

Fix a failing unit test without weakening assertions.

Partial

S02

Implement a small feature behind an existing interface.

Failed

S03

Refactor duplicated logic while preserving public behavior.

Failed

S04

Stop when task conflicts with unrelated dirty worktree changes.

Partial

S05

Handle ambiguous requirements by asking a precise question.

Failed

Fresh Pi Agent baseline

Baseline run score before calibration.

Overall score

39

Needs Work

Patch scope

46%

Test relevance

34%

Regression safety

41%

Explanation quality

36%

Evaluation readout

  • Needs evidence for: relevant tests are run
  • Needs evidence for: patch is minimal and scoped
  • Needs evidence for: no unrelated changes are reverted
  • Needs evidence for: behavioral summary is accurate

Footnote: demo score shown for a fresh Pi Agent run without skills or extensions.