Wendell - Build worlds. Test agents. Catch failures.

Select Agent

Choose what Wendell should evaluate. Start with built-in baselines, connect a custom agent endpoint, or run through Pi as an agent harness.

Careful Baseline

Follows policies strictly, always verifies accounts first, and escalates when unsure. May be slower but rarely makes critical errors.

Policy-First

Risky Baseline

Optimizes for speed and customer satisfaction. Sometimes skips verification or makes policy exceptions to resolve issues faster.

Speed-First

External

Custom Agent Endpoint

Bring-your-own-agent placeholder. Represents a hosted agent connected through HTTP, SDK, or command adapter.

AdapterHTTP / SDK

External

Pi Harness

External agent harness adapter. Connects to Pi's agent infrastructure for evaluation.

AdapterPi

Bring your own agent

Wendell is agent-agnostic. Connect a model, hosted agent, local harness, browser agent, or tool runtime.

Planned

Production agents behind an API

Wendell sends each scenario observation to your hosted agent and records the response, tool calls, and latency.

POST /respond → { message, tool_calls }

Planned

CI/eval pipelines and internal platforms

Embed Wendell directly in Python or TypeScript and evaluate your own agent function against scenario suites.

wendell.evaluate({ agent, scenarios })

Available

Local agents, prototypes, LangChain, CrewAI

Any command that accepts JSON on stdin and prints an agent response can be tested by Wendell.

--agent-command "python my_agent.py"

Planned

Browser/UI agents

Wendell presents a simulated customer or workflow page that browser agents interact with like a real app.

Launch target URL + session token

Planned

Tool-using agents and desktop assistants

Expose simulations as tools: start, observe, send message, call tool, finish run, and get report.

start_simulation, observe, finish_run

Planned

Observability, dashboards, QA systems

Stream scenario, trajectory, assessment, and critical-failure events into your own systems.

scenario.completed, assessment.completed