Ship deterministic agent CI in hours, not weeks.
This guide covers the core workflow: define suites, record tool calls, replay in CI, and gate merges with strict assertions.
pipx install runledger runledger init runledger run ./evals --mode record runledger run ./evals --mode replay open .agentci/runs/**/report.html
Quickstart
See all commandsInstall
Use pipx for a clean global install and isolated environments.
pipx install runledger
Bootstrap
Generate a demo suite with a sample agent and cassette.
runledger init
Replay
Run deterministic evals and publish artifacts to CI.
runledger run ./evals --mode replay
Core Concepts
Suites
A suite bundles cases, tool registry, and budgets into a single CI unit.
Cases
Each case defines a task input and a cassette for deterministic replay.
Cassettes
Record tool inputs and outputs once, then reuse them in CI.
Assertions
Validate JSON output with schemas, required fields, and tool contracts.
Budgets
Enforce hard caps on latency, tool calls, and error rates.
Baselines
Track regressions and gate PRs when success rate drops.
Assertions and budgets are hard gates.
Use JSON Schema and required fields for deterministic checks, then layer budgets for latency and tool usage.
- JSON schema validation for final output
- Required fields and regex guards
- Budget caps for wall time and tool calls
assertions:
- type: json_schema
schema_path: schema.json
- type: required_fields
fields: ["category", "reply"]
budgets:
max_wall_ms: 20000
max_tool_calls: 10
max_tool_errors: 0
Artifacts and reporting
Protocol detailsRun logs
Every event is captured to JSONL for auditing and diffs.
run.jsonl
CI output
JUnit and summary JSON integrate directly with CI dashboards.
junit.xml
Shareable report
A static HTML report that opens anywhere, no server needed.
report.html
Summary metrics
Use summary.json for baseline diffs and regression gates.
summary.json
Ready for the CLI deep dive?
Explore every command, config field, and protocol message.