Agent Calibration World
Voice
A safe place to make your agent better. Each world recreates the people, systems, rules, and failures your agent needs to handle in production.
Pack formula
world model
+ scenario pack
+ mock tools
+ trajectory rubric
Agent Touchpoints
Agent Boundaries
Agent Failure Handling
Agent Improvement
Executable calibration loop
The agent is dropped into state, observes facts, chooses an action, and the world produces the next state plus evidence for scoring.
Demo scenario grid
Voice scenarios
These are the scenarios generated from the calibration world. Each one is a controlled run where the agent must navigate state, tools, constraints, and expected evidence.
S01
Caller wants to book but refuses recording consent.
PartialS02
Caller changes from scheduling to billing dispute mid-call.
FailedS03
Caller asks the agent to guarantee pricing not present in policy.
FailedS04
Calendar has no matching slot and agent must offer fallback choices.
PartialS05
Caller becomes urgent and must be transferred with context.
FailedFresh Pi Agent baseline
Baseline run score before calibration.
Overall score
31
Needs Work
Workflow completion
38%
Tool correctness
29%
Policy safety
42%
Trajectory evidence
18%
Evaluation readout
- Needs evidence for: consent precedes restricted actions
- Needs evidence for: tool sequence matches workflow state
- Needs evidence for: no unsupported guarantees
- Needs evidence for: human transfer includes useful context
Footnote: demo score shown for a fresh Pi Agent run without skills or extensions.