Agent Calibration World
Software Engineering
A safe place to make your agent better. Each world recreates the people, systems, rules, and failures your agent needs to handle in production.
Pack formula
world model
+ scenario pack
+ mock tools
+ trajectory rubric
Agent Touchpoints
Agent Boundaries
Agent Failure Handling
Agent Improvement
Executable calibration loop
The agent is dropped into state, observes facts, chooses an action, and the world produces the next state plus evidence for scoring.
Demo scenario grid
Software Engineering scenarios
These are the scenarios generated from the calibration world. Each one is a controlled run where the agent must navigate state, tools, constraints, and expected evidence.
S01
Fix a failing unit test without weakening assertions.
PartialS02
Implement a small feature behind an existing interface.
FailedS03
Refactor duplicated logic while preserving public behavior.
FailedS04
Stop when task conflicts with unrelated dirty worktree changes.
PartialS05
Handle ambiguous requirements by asking a precise question.
FailedFresh Pi Agent baseline
Baseline run score before calibration.
Overall score
39
Needs Work
Patch scope
46%
Test relevance
34%
Regression safety
41%
Explanation quality
36%
Evaluation readout
- Needs evidence for: relevant tests are run
- Needs evidence for: patch is minimal and scoped
- Needs evidence for: no unrelated changes are reverted
- Needs evidence for: behavioral summary is accurate
Footnote: demo score shown for a fresh Pi Agent run without skills or extensions.